My question is not about a specific code snippet but more general, so please bear with me:
How should I organize the data I'm analyzing, and which tools should I use to manage it?
I'm using python and numpy to analyse data. Because the python documentation indicates that dictionaries are very optimized in python, and also due to the fact that the data itself is very structured, I stored it in a deeply nested dictionary.
Here is a skeleton of the dictionary: the position in the hierarchy defines the nature of the element, and each new line defines the contents of a key in the precedent level:
[AS091209M02] [AS091209M01] [AS090901M06] ...
[100113] [100211] [100128] [100121]
[R16] [R17] [R03] [R15] [R05] [R04] [R07] ...
[1263399103] ...
[ImageSize] [FilePath] [Trials] [Depth] [Frames] [Responses] ...
[N01] [N04] ...
[Sequential] [Randomized]
[Ch1] [Ch2]
Edit: To explain a bit better my data set:
[individual] ex: [AS091209M02]
[imaging session (date string)] ex: [100113]
[Region imaged] ex: [R16]
[timestamp of file] ex [1263399103]
[properties of file] ex: [Responses]
[regions of interest in image ] ex [N01]
[format of data] ex [Sequential]
[channel of acquisition: this key indexes an array of values] ex [Ch1]
The type of operations I perform is for instance to compute properties of the arrays (listed under Ch1, Ch2), pick up arrays to make a new collection, for instance analyze responses of N01 from region 16 (R16) of a given individual at different time points, etc.
This structure works well for me and is very fast, as promised. I can analyze the full data set pretty quickly (and the dictionary is far too small to fill up my computer's ram : half a gig).
My problem comes from the cumbersome manner in which I need to program the operations of the dictionary. I often have stretches of code that go like this:
for mk in dic.keys():
for rgk in dic[mk].keys():
for nk in dic[mk][rgk].keys():
for ik in dic[mk][rgk][nk].keys():
for ek in dic[mk][rgk][nk][ik].keys():
#do something
which is ugly, cumbersome, non reusable, and brittle (need to recode it for any variant of the dictionary).
I tried using recursive functions, but apart from the simplest applications, I ran into some very nasty bugs and bizarre behaviors that caused a big waste of time (it does not help that I don't manage to debug with pdb in ipython when I'm dealing with deeply nested recursive functions). In the end the only recursive function I use regularly is the following:
def dicExplorer(dic, depth = -1, stp = 0):
'''prints the hierarchy of a dictionary.
if depth not specified, will explore all the dictionary
'''
if depth - stp == 0: return
try : list_keys = dic.keys()
except AttributeError: return
stp += 1
for key in list_keys:
else: print '+%s> [\'%s\']' %(stp * '---', key)
dicExplorer(dic[key], depth, stp)
I know I'm doing this wrong, because my code is long, noodly and non-reusable. I need to either use better techniques to flexibly manipulate the dictionaries, or to put the data in some database format (sqlite?). My problem is that since I'm (badly) self-taught in regards to programming, I lack practical experience and background knowledge to appreciate the options available. I'm ready to learn new tools (SQL, object oriented programming), whatever it takes to get the job done, but I am reluctant to invest my time and efforts into something that will be a dead end for my needs.
So what are your suggestions to tackle this issue, and be able to code my tools in a more brief, flexible and re-usable manner?
Addendum: apart of doing something with a particular sub-dictionary of the data dictionary, here are some examples of operations I implemented for the dataset dic, or a sub dictionary of it:
actually I have some recursive functions that worked well:
def normalizeSeqDic(dic, norm_dic = {}, legend = ()):
'''returns a normalized dictionary from a seq_amp_dic. Normalization is performed using the first time point as reference
'''
try :
list_keys = dic.keys()
for key in list_keys:
next_legend = legend + (key,)
normalizeSeqDic(dic[key], norm_dic, next_legend)
except AttributeError:
# normalization
# unpack list
mk, ek, nk, tpk = legend
#assign values to amplitude dict
if mk not in norm_dic: norm_dic[mk] = {}
if ek not in norm_dic[mk]: norm_dic[mk][ek] = {}
if nk not in norm_dic[mk][ek]: norm_dic[mk][ek][nk] = {}
if tpk not in norm_dic[mk][ek][nk]: norm_dic[mk][ek][nk][tpk] = {}
new_array = []
for x in range(dic.shape[0]):
new_array.append(dic[x][1:]/dic[x][0])
new_array = asarray(new_array)
norm_dic[mk][ek][nk][tpk] = new_array
return norm_dic
def poolDic(dic):
'''returns a dic in which all the values are pooled, and root (mk) keys are fused
these pooled dics can later be combined into another dic
'''
pooled_dic = {}
for mk in dic.keys():
for ek in dic[mk].keys():
for nk in dic[mk][ek].keys():
for tpk in dic[mk][ek][nk].keys():
#assign values to amplitude dict
if ek not in pooled_dic: pooled_dic[ek] = {}
if nk not in pooled_dic[ek]: pooled_dic[ek][nk] = {}
if tpk not in pooled_dic[ek][nk]:
pooled_dic[ek][nk][tpk] = dic[mk][ek][nk][tpk]
else: pooled_dic[ek][nk][tpk]= vstack((pooled_dic[ek][nk][tpk], dic[mk][ek][nk][tpk]))
return pooled_dic
def timePointsDic(dic):
'''Determines the timepoints for each individual key at root
'''
tp_dic = {}
for mk in dic.keys():
tp_list = []
for rgk in dic[mk].keys():
tp_list.extend(dic[mk][rgk]['Neuropil'].keys())
tp_dic[mk]=tuple(sorted(list(set(tp_list))))
return tp_dic
for some operations I found no other way than to flatten the dictionary:
def flattenDic(dic, label):
'''flattens a dic to produce a list of of tuples containing keys and 'label' values
'''
flat_list = []
for mk in dic.keys():
for rgk in dic[mk].keys():
for nk in dic[mk][rgk].keys():
for ik in dic[mk][rgk][nk].keys():
for ek in dic[mk][rgk][nk][ik].keys():
flat_list.append((mk, rgk, nk, ik, ek, dic[mk][rgk][nk][ik][ek][label])
return flat_list
def extractDataSequencePoints(flat_list, mk, nk, tp_list):
'''produces a list containing arrays of time point values
time_points is a list of the time points wished (can have 2 or 3 elements)
'''
nb_tp = len(tp_list)
# build tp_seq list
tp_seq = []
tp1, tp2, tp3 = [], [], []
if nk == 'Neuropil':
tp1.extend(x for x in flat_list if x[0]==mk and x[2] == 'Neuropil' and x[3] == tp_list[0])
tp2.extend(x for x in flat_list if x[0]==mk and x[2] == 'Neuropil'and x[3] == tp_list[1])
else:
tp1.extend(x for x in flat_list if x[0]==mk and x[2] != 'Neuropil'and x[3] == tp_list[0])
tp2.extend(x for x in flat_list if x[0]==mk and x[2] != 'Neuropil'and x[3] == tp_list[1])
if nb_tp == 3:
if nk == 'Neuropil':
tp3.extend(x for x in flat_list if x[0]==mk and x[2] == 'Neuropil'and x[3] == tp_list[2])
else:
tp3.extend(x for x in flat_list if x[0]==mk and x[2] != 'Neuropil'and x[3] == tp_list[2])
for x in tp1:
for y in tp2:
if x[0:3] == y[0:3] :
if nb_tp == 3:
for z in tp3:
if x[0:3] == z[0:3] :
tp_seq.append(asarray([x[4],y[4],z[4]]))
else:
tp_seq.append(asarray([x[4],y[4]]))
return tp_seq