views:

155

answers:

5

My question is not about a specific code snippet but more general, so please bear with me:

How should I organize the data I'm analyzing, and which tools should I use to manage it?

I'm using python and numpy to analyse data. Because the python documentation indicates that dictionaries are very optimized in python, and also due to the fact that the data itself is very structured, I stored it in a deeply nested dictionary.

Here is a skeleton of the dictionary: the position in the hierarchy defines the nature of the element, and each new line defines the contents of a key in the precedent level:

[AS091209M02] [AS091209M01] [AS090901M06] ... 
[100113] [100211] [100128] [100121] 
[R16] [R17] [R03] [R15] [R05] [R04] [R07] ... 
[1263399103] ... 
[ImageSize] [FilePath] [Trials] [Depth] [Frames] [Responses] ... 
[N01] [N04] ... 
[Sequential] [Randomized] 
[Ch1] [Ch2]

Edit: To explain a bit better my data set:

[individual] ex: [AS091209M02]
[imaging session (date string)] ex: [100113]
[Region imaged] ex: [R16]
[timestamp of file] ex [1263399103]  
[properties of file] ex: [Responses]
[regions of interest in image ] ex [N01]
[format of data] ex [Sequential]
[channel of acquisition: this key indexes an array of values] ex [Ch1]

The type of operations I perform is for instance to compute properties of the arrays (listed under Ch1, Ch2), pick up arrays to make a new collection, for instance analyze responses of N01 from region 16 (R16) of a given individual at different time points, etc.

This structure works well for me and is very fast, as promised. I can analyze the full data set pretty quickly (and the dictionary is far too small to fill up my computer's ram : half a gig).

My problem comes from the cumbersome manner in which I need to program the operations of the dictionary. I often have stretches of code that go like this:

for mk in dic.keys():
    for rgk in dic[mk].keys():
        for nk in dic[mk][rgk].keys():
            for ik in dic[mk][rgk][nk].keys():
                for ek in dic[mk][rgk][nk][ik].keys():
                    #do something

which is ugly, cumbersome, non reusable, and brittle (need to recode it for any variant of the dictionary).

I tried using recursive functions, but apart from the simplest applications, I ran into some very nasty bugs and bizarre behaviors that caused a big waste of time (it does not help that I don't manage to debug with pdb in ipython when I'm dealing with deeply nested recursive functions). In the end the only recursive function I use regularly is the following:

def dicExplorer(dic, depth = -1, stp = 0):
    '''prints the hierarchy of a dictionary.
    if depth not specified, will explore all the dictionary
    '''
    if depth - stp == 0: return
    try : list_keys = dic.keys()
    except AttributeError: return
    stp += 1
    for key in list_keys:
        else: print '+%s> [\'%s\']' %(stp * '---', key)
        dicExplorer(dic[key], depth, stp)

I know I'm doing this wrong, because my code is long, noodly and non-reusable. I need to either use better techniques to flexibly manipulate the dictionaries, or to put the data in some database format (sqlite?). My problem is that since I'm (badly) self-taught in regards to programming, I lack practical experience and background knowledge to appreciate the options available. I'm ready to learn new tools (SQL, object oriented programming), whatever it takes to get the job done, but I am reluctant to invest my time and efforts into something that will be a dead end for my needs.

So what are your suggestions to tackle this issue, and be able to code my tools in a more brief, flexible and re-usable manner?

Addendum: apart of doing something with a particular sub-dictionary of the data dictionary, here are some examples of operations I implemented for the dataset dic, or a sub dictionary of it:

actually I have some recursive functions that worked well:

def normalizeSeqDic(dic, norm_dic = {}, legend = ()):
    '''returns a normalized dictionary from a seq_amp_dic. Normalization is performed using the first time point as reference
    '''
    try : 
        list_keys = dic.keys()
        for key in list_keys:
            next_legend = legend + (key,) 
            normalizeSeqDic(dic[key], norm_dic, next_legend)
    except AttributeError:
        # normalization
        # unpack list
        mk, ek, nk, tpk = legend
        #assign values to amplitude dict
        if mk not in norm_dic: norm_dic[mk] = {}
        if ek not in norm_dic[mk]: norm_dic[mk][ek] = {}
        if nk not in norm_dic[mk][ek]: norm_dic[mk][ek][nk] = {}
        if tpk not in norm_dic[mk][ek][nk]: norm_dic[mk][ek][nk][tpk] = {}
        new_array = []
        for x in range(dic.shape[0]):
            new_array.append(dic[x][1:]/dic[x][0])
        new_array = asarray(new_array)
        norm_dic[mk][ek][nk][tpk] = new_array
    return norm_dic

def poolDic(dic):
    '''returns a dic in which all the values are pooled, and root (mk) keys are fused
    these pooled dics can later be combined into another dic
    '''
    pooled_dic = {}
    for mk in dic.keys():
        for ek in dic[mk].keys():
            for nk in dic[mk][ek].keys():
                for tpk in dic[mk][ek][nk].keys():
                    #assign values to amplitude dict
                    if ek not in pooled_dic: pooled_dic[ek] = {}
                    if nk not in pooled_dic[ek]: pooled_dic[ek][nk] = {}
                    if tpk not in pooled_dic[ek][nk]:
                        pooled_dic[ek][nk][tpk] = dic[mk][ek][nk][tpk]
                    else: pooled_dic[ek][nk][tpk]= vstack((pooled_dic[ek][nk][tpk], dic[mk][ek][nk][tpk]))
    return pooled_dic

def timePointsDic(dic):
    '''Determines the timepoints for each individual key at root
    '''
    tp_dic = {}
    for mk in dic.keys():
        tp_list = []
        for rgk in dic[mk].keys():
            tp_list.extend(dic[mk][rgk]['Neuropil'].keys())
        tp_dic[mk]=tuple(sorted(list(set(tp_list))))
    return tp_dic

for some operations I found no other way than to flatten the dictionary:

def flattenDic(dic, label):
    '''flattens a dic to produce a list of of tuples containing keys and 'label' values
    '''
    flat_list = []
    for mk in dic.keys():
        for rgk in dic[mk].keys():
            for nk in dic[mk][rgk].keys():
                for ik in dic[mk][rgk][nk].keys():
                    for ek in dic[mk][rgk][nk][ik].keys():
                        flat_list.append((mk, rgk, nk, ik, ek, dic[mk][rgk][nk][ik][ek][label])
    return flat_list

def extractDataSequencePoints(flat_list, mk, nk, tp_list):
        '''produces a list containing arrays of time point values
        time_points is a list of the time points wished (can have 2 or 3 elements)
        '''
        nb_tp = len(tp_list)
        # build tp_seq list
        tp_seq = []
        tp1, tp2, tp3 = [], [], []
        if nk == 'Neuropil':
            tp1.extend(x for x in flat_list if x[0]==mk and x[2] == 'Neuropil' and x[3] == tp_list[0])
            tp2.extend(x for x in flat_list if x[0]==mk and x[2] == 'Neuropil'and  x[3] == tp_list[1])
        else:
            tp1.extend(x for x in flat_list if x[0]==mk and x[2] != 'Neuropil'and x[3] == tp_list[0])
            tp2.extend(x for x in flat_list if x[0]==mk and x[2] != 'Neuropil'and x[3] == tp_list[1])
        if nb_tp == 3:
            if nk == 'Neuropil':
                tp3.extend(x for x in flat_list if x[0]==mk and x[2] == 'Neuropil'and x[3] == tp_list[2])
            else:
                tp3.extend(x for x in flat_list if x[0]==mk and x[2] != 'Neuropil'and x[3] == tp_list[2])
        for x in tp1:
            for y in tp2:
                if x[0:3] == y[0:3] :
                    if nb_tp == 3:
                        for z in tp3:
                            if x[0:3] == z[0:3] :
                                tp_seq.append(asarray([x[4],y[4],z[4]]))
                    else:
                        tp_seq.append(asarray([x[4],y[4]]))
        return tp_seq
+1  A: 

You can make your loops look better, by replacing:

for mk in dic.keys():
    for rgk in dic[mk].keys():
        for nk in dic[mk][rgk].keys():
            for ik in dic[mk][rgk][nk].keys():
                for ek in dic[mk][rgk][nk][ik].keys():
                    #do something

with

for mv in dic.values():
    for rgv in mv.values():
        for nv in rgv.values():
            for iv in nv.values():
                for ev in iv.values():
                    #do something

You thus gain access to all the values with a relatively terse code. If you also need some keys, you can do something like:

for (mk, mv) in dic.items():
    # etc.

Depending on your needs, you might also consider creating and then using a single dictionary with tuple keys:

dic[(mk, rgk, nv, ik, ek)]
EOL
Iterating a dictionary object returns the dictionary's keys. You can't then iterate the key and expect to be accessing the value.
MattH
I tried the first block, but at for rgv in mv: I see that rgv iterates through the mv string, instead of the mv.keys()?
AlexandreS
@MattH: you're right, my bad! Fixed. Thanks!
EOL
A: 

You ask: How should I organize the data I'm analyzing, and which tools should I use to manage it?

I suspect that a dictionary, for all its optimization, is not the right answer to that question. I think you'd be better off using XML or, if there is a Python binding for it, HDF5, even NetCDF. Or, as you suggest yourself, a database.

If your project is of sufficient duration and usefulness to warrant learning how to use such technologies, then I think you'll find that learning them now and getting the data structures right is a better path to success than wrestling with the wrong data structures for the entire project. Learning XML, or HDF5, or SQL, or whatever you choose, is building up your general expertise and making you better able to tackle the next project. Sticking with awkward, problem-specific and idiosyncratic data structures leads to the same set of problems next time around.

High Performance Mark
Thanks for letting me know about HDF5 and NetCDF. Python has bindings for both, and the licenses are usable. I think I will start by reading about HDF5, and if it looks promising I will study how to make use of it with my data.These kinds of pointers are exactly what I was hoping for when I wrote my question.
AlexandreS
A: 

You could write a generator function that allows you to iterate over all elements of a certain level:

def elementsAt(dic, level):
    if not hasattr(dic, 'itervalues'):
        return
    for element in dic.itervalues():
        if level == 0:
            yield element
        else:
            for subelement in elementsAt(element, level - 1):
                yield subelement

Which can then be used as following:

for element in elementsAt(dic, 4):
    # Do something with element

If you also need to filter elements, you could first get all elements that need to be filtered (say, the 'rgk' level):

for rgk in getElementsAt(dic, 1):
    if isValid(rgk):
        for ek in getElementsAt(rgk, 2):
            # Do something with ek

At least that'll make it a little easier to work with a hierarchy of dictionaries. Using more descriptive names would help, too.

Pieter Witvoet
thanks for showing me the use of dic.itervalues(), which I had not appreciated so far. The problem is that it does not really address the issue of deeply nested and inflexible for chains.
AlexandreS
+6  A: 

"I stored it in a deeply nested dictionary"

And, as you've seen, it doesn't work out well.

What's the alternative?

  1. Composite keys and a shallow dictionary. You have an 8-part key: ( individual, imaging session, Region imaged, timestamp of file, properties of file, regions of interest in image, format of data, channel of acquisition ) which maps to an array of values.

    { ('AS091209M02', '100113', 'R16', '1263399103', 'Responses', 'N01', 'Sequential', 'Ch1' ): array, 
    ...
    

    The issue with this is search.

  2. Proper class structures. Actually, a full-up Class definition may be overkill.

"The type of operations I perform is for instance to compute properties of the arrays (listed under Ch1, Ch2), pick up arrays to make a new collection, for instance analyze responses of N01 from region 16 (R16) of a given individual at different time points, etc."

Recommendation

First, use a namedtuple for your ultimate object.

Array = namedtuple( 'Array', 'individual, session, region, timestamp, properties, roi, format, channel, data' )

Or something like that. Build a simple list of these named tuple objects. You can then simply iterate over them.

Second, use many simple map-reduce operations on this master list of the array objects.

Filtering:

for a in theMasterArrrayList:
    if a.region = 'R16' and interest = 'N01':
        # do something on these items only.

Reducing by Common Key:

individual_dict = defaultdict(list)
for a in theMasterArrayList:
    individual_dict[ a.individual ].append( a )

This will create a subset in the map that has exactly the items you want.

You can then do indiidual_dict['AS091209M02'] and have all of their data. You can do this for any (or all) of the available keys.

region_dict = defaultdict(list)
for a in theMasterArrayList:
    region_dict[ a.region ].append( a )

This does not copy any data. It's fast and relatively compact in memory.

Mapping (or transforming) the array:

for a in theMasterArrayList:
    someTransformationFunction( a.data )

If the array is itself a list, you're can update that list without breaking the tuple as a whole. If you need to create a new array from an existing array, you're creating a new tuple. There's nothing wrong with this, but it is a new tuple. You wind up with programs like this.

def region_filter( array_list, region_set ):
    for a in array_list:
        if a.region in region_set:
            yield a

def array_map( array_list, someConstant ):
    for a in array_list:
        yield Array( *(a[:8] + (someTranformation( a.data, someConstant ),) )

def some_result( array_list, region, someConstant ):
    for a in array_map( region_filter( array_list, region ), someConstant ):
        yield a

You can build up transformations, reductions, mappings into more elaborate things.

The most important thing is creating only the dictionaries you need from the master list so you don't do any more filtering than is minimally necessary.

BTW. This can be mapped to a relational database trivially. It will be slower, but you can have multiple concurrent update operations. Except for multiple concurrent updates, a relational database doesn't offer any features above this.

S.Lott
This is very helpful. I had already resorted to flattening the data dictionary or a (subsets of it) + list comprehensions to filter the data, but your description of the use of named tuple enables a much cleaner syntax and code than what I have. Plus it involves minimal learning to get the desired results, so I think I'll go for that, at least for the moment.
AlexandreS
A: 

I will share some thoughts about this. Instead of this function:

for mk in dic.keys():
    for rgk in dic[mk].keys():
        for nk in dic[mk][rgk].keys():
            for ik in dic[mk][rgk][nk].keys():
                for ek in dic[mk][rgk][nk][ik].keys():
                    #do something

Which you would like to simply write as:

for ek in deep_loop(dic):
    do_something

There are 2 ways. One is functional, second is generator-like. The second one is:

def deep_loop(dic):
    for mk in dic.keys():
        for rgk in dic[mk].keys():
            for nk in dic[mk][rgk].keys():
                for ik in dic[mk][rgk][nk].keys():
                    for ek in dic[mk][rgk][nk][ik].keys():
                        yield ek

This allows you to capture the logic of going through the dictionary. It is very easy to modify this function to support different ways of going through the structure. It depends on the way your structure changes, if it is just a depth of the loop or something different. Could you post some more advanced examples on what requirements on going through the tree you have? Like filtering, search etc.? The depth would look like this (untested) - it will yield a pair of (tuple of keys), (value):

def deep_loop(dic, depth):
    if depth == 0:
        yield (), dic
    for subkey, subval in dic.items():
        for ktuple, value in deep_loop(subval, depth-1):
            yield (subkey,)+ktuple, value

Now it gets easier:

for (k1,k2,k3,k4), value in deep_loop(dic, 4):
    # do something

There are other ways to customize this, you could add a named tuple type as a parameter of deep_loop. Deep_loop could autodetect the depth from the named tuple and return the named tuple.

ondra
I posted additional examples of what I managed to implement as an addendum, hope that helps. Your first solution packs the search process in a function, but does not solve most of my problems, as searches are often variable, and I would have to define many functions to describe each search. Your second option addresses this point, but uses recursion, which frightens me a bit after the bad experiences I had with it (although I managed to write some working recursive functions). But thanks for showing me how you coupled dic.items() with generator expressions, that's useful to me.
AlexandreS