views:

53

answers:

1

I have many csv files which each contain roughly identical matrices. Each matrix is 11 columns by either 5 or 6 rows. The columns are variables and the rows are test conditions. Some of the matrices do not contain data for the last test condition, which is why there are 5 rows in some matrices and six rows in other matrices.

My application is in python 2.6 using numpy and sciepy.

My question is this:
How can I most efficiently create a summary matrix that contains the means of each cell across all of the identical matrices?

The summary matrix would have the same structure as all of the other matrices, except that the value in each cell in the summary matrix would be the mean of the values stored in the identical cell across all of the other matrices. If one matrix does not contain data for the last test condition, I want to make sure that its contents are not treated as zeros when the averaging is done. In other words, I want the means of all the non-zero values.

Can anyone show me a brief, flexible way of organizing this code so that it does everything I want to do with as little code as possible and also remain as flexible as possible in case I want to re-use this later with other data structures?

I know how to pull all the csv files in and how to write output. I just don't know the most efficient way to structure flow of data in the script, including whether to use python arrays or numpy arrays, and how to structure the operations, etc.

I have tried coding this in a number of different ways, but they all seem to be rather code intensive and inflexible if I later want to use this code for other data structures.

+2  A: 

You could use masked arrays. Say N is the number of csv files. You can store all your data in a masked array A, of shape (N,11,6).

from numpy import *
A = ma.zeros((N,11,6))
A.mask = zeros_like(A) # fills the mask with zeros: nothing is masked
A.mask = (A.data == 0) # another way of masking: mask all data equal to zero
A.mask[0,0,0] = True # mask a value
A[1,2,3] = 12. # fill a value: like an usual array

Then, the mean values along first axis, and taking into account masked values, are given by:

mean(A, axis=0) # the returned shape is (11,6)
François