ansaurus

Question

Using Numpy to find average value across data sets, with some missing data

Answer 1

A:

Well, one way to do it would be to iterate over each row of each data set and append a given column value to an array that's stored in a dictionary, where the time index is used for its key value. You then iterate over the dictionary and pull the average for each array stored there.

This isn't particularly efficient -- the other option is to find the longest array, iterate over it, and query the other datasets to create an temporary array to average. This way you save the secondary iteration over the dictionary.

Marcel Levy 2009-11-20 04:03:39

I was really hoping that numpy, with its array-oriented efficiency, would provide a way to do exactly that. You're right though, I'll have to fall back to the method you suggest if there's no existing operation for it.

Lemur 2009-11-20 04:19:00

If you're really wanting to stay in numpy, take a look at masked arrays here: http://docs.scipy.org/doc/numpy/reference/maskedarray.generic.html

Marcel Levy 2009-11-20 05:09:55

It's not so much numpy itself that I want. It's clean, easy-to-understand code! Frankly, I'd drop Python for (hypothetically) R, if that meant an elegant solution. But I know even less about R than numpy.Thanks for the tip on masked arrays. I'll check it out.

Lemur 2009-11-20 05:40:02

Answer 2

+1 A:

Edit: I've revised my method, abandoning scipy.nanmean in favor of masked arrays.

If it is unclear what the code is doing at any point, first try putting print statements in. If it is still unclear, feel free to ask; I'll try my best to explain. The trick part is getting the t-values merged. (That was done with numpy array's searchsorted method.)

Playing with numpy has led me to believe that its speed advantages may not exist until the datasets get quite big (maybe you'll need at least 10,000 rows per data set). Otherwise, a pure python solution may be both easier to write and faster.

Here are the toy datasets I used:

% cat set1
1, 10, 50
2, 13, 58
3,9,43
4,14,61
7, 4, 82

% cat set2
1, 12, 48
2, 7, 60
3,17,51
4,12,57
7,10,88
8,15,92
9,6,63

And here is the code:

#!/usr/bin/env python
import numpy as np

filenames=('set1','set2')   # change this to list all your csv files
column_names=('t','a','b')

# slurp the csv data files into a list of numpy arrays
data=[np.loadtxt(filename, delimiter=',') for filename in filenames]

# Find the complete list of t-values
# For each elt in data, elt[a,b] is the value in the a_th row and b_th column
t_values=np.array(list(reduce(set.union,(set(elt[:,0]) for elt in data))))
t_values.sort()
# print(t_values)
# [ 1.  2.  3.  4.  7.  8.  9.]

num_rows=len(t_values)
num_columns=len(column_names)
num_datasets=len(filenames)

# For each data set, we compute the indices of the t_values that are used.
idx=[(t_values.searchsorted(data[n][:,0])) for n in range(num_datasets)]

data2=np.ma.zeros((num_rows,num_columns,num_datasets))
for n in range(num_datasets):
    data2[idx[n],:,n]=data[n][:,:]
data2=np.ma.masked_equal(data2, 0)
averages=data2.mean(axis=-1)
print(averages)
# [[1.0 11.0 49.0]
#  [2.0 10.0 59.0]
#  [3.0 13.0 47.0]
#  [4.0 13.0 59.0]
#  [7.0 7.0 85.0]
#  [8.0 15.0 92.0]
#  [9.0 6.0 63.0]]

unutbu 2009-11-20 04:47:02

Nice! I didn't know about 'loadtxt'. I was using the 'tabular' module, which turned out to be overkill. Thanks.

Lemur 2009-11-21 06:00:51

Answer 3

+2 A:

Why not just us numpy's ma (masked array) module?

maxLen = reduce(lambda a,b : max(a, b.shape[0]),
                dSets, 0)
all = N.ma.zeros((maxLen,)+ dSets[0].shape[1:] + (len(dSets),),
                     dtype=float)      # set the dtype to whatever
all.mask = True
for i, set in enumerate(dSets):
    all.mask[:len(set),...,i] = False
    all[:len(set),...,i] = set

mean = all.mean(axis=-1)

Of course, this only works if you can guarantee that the time in each row is the same across all arrays, i.e. set[i,0] == set[j,0] for all i,j

Rupert Nash 2009-11-20 09:49:42

Even if the time isn't the same, you can use masked arrays. You just need to be smarter in setting up the masked array so the data for each time is in the same row.

AFoglia 2009-11-20 14:53:34

This works great. Thanks! One thing: the reduce/lambda construct can fail when an early value is the highest: 'int' has no method 'shape'. replaced with: maxLen = max([a.shape[0] for a in dSets])

Lemur 2009-11-21 06:04:17

Yes you're right, I ballsed up the lambda. Edited to correct. Cheers!

Rupert Nash 2009-11-21 11:32:00

ansaurus

tags:

views:

answers:

Using Numpy to find average value across data sets, with some missing data

related questions