I have several (10 or so) CSV-formatted data sets. Each column of a data set represents one aspect of a running system (available RAM, CPU usage, open TCP connections, and so forth). Each row contains the values for these columns at one moment in time.
The data sets were captured during individual runs of the same test. The number of rows is not guaranteed to be the same in each data set (i.e.: some tests ran longer than others).
I want to produce a new CSV file that represents the "average" value, across all data sets, for a given time offset and a given column. Ideally, values missing in one data set would be ignored. If necessary, though, missing values could be assumed to be the same as the last known value, or the average of known values for that row.
A simplified example:
+---------------+ +---------------+ +---------------+
| Set 1 | | Set 2 | | Average |
+---+-----+-----+ +---+-----+-----+ +---+-----+-----+
| t | A | B | | t | A | B | | t | A | B |
+---+-----+-----+ +---+-----+-----+ +---+-----+-----+
| 1 | 10 | 50 | | 1 | 12 | 48 | | 1 | 11 | 49 |
| 2 | 13 | 58 | | 2 | 7 | 60 | | 2 | 10 | 59 |
| 3 | 9 | 43 | | 3 | 17 | 51 | => | 3 | 13 | 47 |
| 4 | 14 | 61 | | 4 | 12 | 57 | | 4 | 13 | 59 |
| : | : | : | | : | : | : | | : | : | : |
| 7 | 4 | 82 | | 7 | 10 | 88 | | 7 | 7 | 86 |
+---+-----+-----+ | 8 | 15 | 92 | | 8 | 15 | 92 |
| 9 | 6 | 63 | | 9 | 6 | 63 |
+---+-----+-----+ +---+-----+-----+
I'm new to numpy, having picked it up specifically for this project. What's the best way to do this? For data sets with the same number of rows (which I've been forcing by chopping longer data sets short), I just do:
d_avg = sum(dsets) / float(len(dsets))
where "dsets" is a list of the ndarrays containing the data from each CSV file. This works well, but I don't want to discard the data from the longer runs.
I can also resize the shorter runs to the length of the longest, but all the new fields are filled with "NoneType". Later operations then error when adding (for example) a float and a NoneType.
Any suggestions?