views:

183

answers:

3

Hi,

I would like to make a nice function to aggregate data among an array (it's a numpy record array, but it does not change anything)

you have an array of data that you want to aggregate among one axis: for example an array of dtype=[(name, (np.str_,8), (job, (np.str_,8), (income, np.uint32)] and you want to have the mean income per job

I did this function, and in the example it should be called as aggregate(data,'job','income',mean)


def aggregate(data, key, value, func):

    data_per_key = {}

    for k,v in zip(data[key], data[value]):

        if k not in data_per_key.keys():

            data_per_key[k]=[]

        data_per_key[k].append(v)

    return [(k,func(data_per_key[k])) for k in data_per_key.keys()]


the problem is that I find it not very nice I would like to have it in one line: do you have any ideas?

Thanks for your answer Louis

PS: I would like to keep the func in the call so that you can also ask for median, minimum...

A: 

http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#dictionary-get-method

should help to make it a little prettier, more pythonic, more efficient possibly. I'll come back later to check on your progress. Maybe you can edit the function with this in mind? Also see the next couple of sections.

skyl
+2  A: 

Your if k not in data_per_key.keys() could be rewritten as if k not in data_per_key, but you can do even better with defaultdict. Here's a version that uses defaultdict to get rid of the existence check:

import collections

def aggregate(data, key, value, func):
    data_per_key = collections.defaultdict(list)
    for k,v in zip(data[key], data[value]):
        data_per_key[k].append(v)

    return [(k,func(data_per_key[k])) for k in data_per_key.keys()]
Hank Gay
I'd change the last line to `return [(k,f(v)) for k,v in data_per_key.items()]`
gnibbler
That's a good call, but I was trying to highlight the `defaultdict` stuff by making that the only change. Your return is definitely better, though.
Hank Gay
thanks for the defaultdict trick! and also for the final iteration
Louis
A: 

Perhaps the function you are seeking is matplotlib.mlab.rec_groupby:

import matplotlib.mlab

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

result=matplotlib.mlab.rec_groupby(data, ('job',), (('income',np.mean,'avg_income'),))

yields

('Digger', 4.0)
('Planter', 2.5)
('Waterer', 3.0)

matplotlib.mlab.rec_groupby returns a recarray:

print(result.dtype)
# [('job', '|S7'), ('avg_income', '<f8')]
unutbu
that's exactly what I was looking for: the job done in one line! Moreover it's returning directly an array! Perfect!
Louis