tags:

views:

1612

answers:

2

I'm just starting with NumPy so I may be missing some core concepts...

What's the best way to create a NumPy array from a dictionary whose values are lists?

Something like this:

d = { 1: [10,20,30] , 2: [50,60], 3: [100,200,300,400,500] }

Should turn into something like:

data = [
  [10,20,30,?,?],
  [50,60,?,?,?],
  [100,200,300,400,500]
]

I'm going to do some basic statistics on each row, eg:

deviations = numpy.std(data, axis=1)

Questions:

  • What's the best / most efficient way to create the numpy.array from the dictionary? The dictionary is large; a couple of million keys, each with ~20 items.

  • The number of values for each 'row' are different. If I understand correctly numpy wants uniform size, so what do I fill in for the missing items to make std() happy?

Update: One thing I forgot to mention - while the python techniques are reasonable (eg. looping over a few million items is fast), it's constrained to a single CPU. Numpy operations scale nicely to the hardware and hit all the CPUs, so they're attractive.

+6  A: 

You don't need to create numpy arrays to call numpy.std(). You can call numpy.std() in a loop over all the values of your dictionary. The list will be converted to a numpy array on the fly to compute the standard variation.

The downside of this method is that the main loop will be in python and not in C. But I guess this should be fast enough: you will still compute std at C speed, and you will save a lot of memory as you won't have to store 0 values where you have variable size arrays.

  • If you want to further optimize this, you can store your values into a list of numpy arrays, so that you do the python list -> numpy array conversion only once.
  • if you find that this is still too slow, try to use psycho to optimize the python loop.
  • if this is still too slow, try using Cython together with the numpy module. This Tutorial claims impressive speed improvements for image processing. Or simply program the whole std function in Cython (see this for benchmarks and examples with sum function )
  • An alternative to Cython would be to use SWIG with numpy.i.
  • if you want to use only numpy and have everything computed at C level, try grouping all the records of same size together in different arrays and call numpy.std() on each of them. It should look like the following example.

example with O(N) complexity:

import numpy
list_size_1 = []
list_size_2 = []
for row in data.itervalues():
    if len(row) == 1:
      list_size_1.append(row)
    elif len(row) == 2:
      list_size_2.append(row)
list_size_1 = numpy.array(list_size_1)
list_size_2 = numpy.array(list_size_2)
std_1 = numpy.std(list_size_1, axis = 1)
std_2 = numpy.std(list_size_2, axis = 1)
Mapad
I'm doing the numpy.std in a loop now, and you're right, the memory savings are important. I would like to at least do a speed comparison with the numpy version though.
Parand
The problem is that numpy.std() was made to accept only fix size array. So the only way I see to do this test is to group all the records of same size together and call numpy.std() on each of them.
Mapad
Shouldn't CPython really be Cython? Have I got it wrong?
batbrat
Yes, correct. Fixed.
Mapad
Grouping same sized records, simple but effective. I like it.
Parand
+2  A: 

While there are already some pretty reasonable ideas present here, I believe following is worth mentioning.

Filling missing data with any default value would spoil the statistical characteristics (std, etc). Evidently that's why Mapad proposed the nice trick with grouping same sized records. The problem with it (assuming there isn't any a priori data on records lengths is at hand) is that it involves even more computations than the straightforward solution:

  1. at least O(N*logN) 'len' calls and comparisons for sorting with an effective algorithm
  2. O(N) checks on the second way through the list to obtain groups(their beginning and end indexes on the 'vertical' axis)

Using Psyco is a good idea (it's strikingly easy to use, so be sure to give it a try).

It seems that the optimal way is to take the strategy described by Mapad in bullet #1, but with a modification - not to generate the whole list, but iterate through the dictionary converting each row into numpy.array and performing required computations. Like this:

for row in data.itervalues():
    np_row = numpy.array(row)    
    this_row_std = numpy.std(np_row)
    # compute any other statistic descriptors needed and then save to some list

In any case a few million loops in python won't take as long as one might expect. Besides this doesn't look like a routine computation, so who cares if it takes extra second/minute if it is run once in a while or even just once.


A generalized variant of what was suggested by Mapad:

from numpy import array, mean, std

def get_statistical_descriptors(a):
    if ax = len(shape(a))-1
    functions = [mean, std]
    return f(a, axis = ax) for f in functions


def process_long_list_stats(data):
    import numpy

    groups = {}

    for key, row in data.iteritems():
     size = len(row)
     try:
      groups[size].append(key)
     except KeyError:
      groups[size] = ([key])

    results = []

    for gr_keys in groups.itervalues():    
     gr_rows = numpy.array([data[k] for k in gr_keys])  
     stats = get_statistical_descriptors(gr_rows)    
     results.extend( zip(gr_keys, zip(*stats)) )

    return dict(results)
Maleev
Thanks Maleev, this is essentially what I ended up doing. One thing I forgot to mention - while looping in Python is fast, I believe I'm only using a single CPU with this method. Matrix operations hit all CPUs, so they're attractive.
Parand
Why would you need to sort the rows before grouping vectors by length? Only grouping is needed. Moreover I would be careful with the big O notation: here N ~ 1000000 but the speed between a Python and C program can be ~100 times slower. So N -> 1000 is not really tending to the infinity
Mapad
2 Parand: You're right, taking multi-threading into account does really make sense.2 Mapad: If I'm not terribly mistaken, grouping is essentially equivalent to sorting. How then do you suggest to group them?
Maleev
Python code, whether it is just loop over the rows or grouping, is executed in any of the cases. So talking exclusively of python code's asymptotic complexity, we've got difference in p*O(NlogN) - p*O(N) = p*O(NlogN). Besides C code looping through rows inside groups adds up c*O(N) to it.
Maleev
You say c << p. Sure. But that still leaves that p*O(NlogN) difference. Unless you prove you can really do grouping in O(N) in average and worst case.
Maleev
I agree with your simplifications, I just want to recall they assume N >> p (p being computation time introduced by Python loop compared to C loop to process a record). Since all process here take O(N) (see my example), I would not throw away the p with the big O notation to check complexity
Mapad