views:

417

answers:

2

I have a bunch of csv datasets, about 10Gb in size each. I'd like to generate histograms from their columns. But it seems like the only way to do this in numpy is to first load the entire column into a numpy array and then call numpy.histogram on that array. This consumes an unnecessary amount of memory.

Does numpy support online binning? I'm hoping for something that iterates over my csv line by line and bins values as it reads them. This way at most one line is in memory at any one time.

Wouldn't be hard to roll my own, but wondering if someone already invented this wheel.

+1  A: 

Here's a way to bin your values directly:

import numpy as NP

column_of_values = NP.random.randint(10, 99, 10)

# set the bin values:
bins = NP.array([0.0, 20.0, 50.0, 75.0])

binned_values = NP.digitize(column_of_values, bins)

'binned_values' is an index array, containing the index of the bin to which each value in column_of_values belongs.

'bincount' will give you (obviously) the bin counts:

NP.bincount(binned_values)

Given the size of your data set, using Numpy's 'loadtxt' to build a generator, might be useful:

data_array = NP.loadtxt(data_file.txt, delimiter=",")
def fnx() :
  for i in range(0, data_array.shape[1]) :
    yield dx[:,i]
doug
But wouldn't loadtxt load the entire file in memory first? That's exactly the problem I want to avoid.
whitman
+2  A: 

As you said, it's not that hard to roll your own. You'll need to set up the bins yourself and reuse them as you iterate over the file. The following ought to be a decent starting point:

import numpy as np
datamin = -5
datamax = 5
numbins = 20
mybins = np.linspace(datamin, datamax, numbins)
myhist = np.zeros(numbins-1, dtype='int32')
for i in range(100):
    d = np.random.randn(1000,1)
    htemp, jnk = np.histogram(d, mybins)
    myhist += htemp

I'm guessing performance will be an issue with such large files, and the overhead of calling histogram on each line might be too slow. @doug's suggestion of a generator seems like a good way to address that problem.

mtrw
Good solution. If you want to make it a tad faster, you can do `myhist += htemp` (I guess that it's faster because it updates the histogram in place).
EOL
Thanks @EOL. I forget some of the nice Python features because I haven't switched completely from Octave. And then there are the advanced features like generators that I have yet to learn.
mtrw