views:

709

answers:

2

I have a series of large text files (up to 1 gig) that are output from an experiment that need to be analysed in Python. They would be best loaded into a 2D numpy array, which presents the first question:

  • As the number of rows is unknown at the beginning of the loading, how can a very large numpy array be most efficiently built, row by row?

Simply adding the row to the array would be inefficient in memory terms, as two large arrays would momentarily co-exist. The same problem would seem to be occur if you use numpy.append. The stack functions are promising, but ideally I would want to grow the array in place.

This leads to the second question:

  • What is the best way to observe the memory usage of a Python program that heavily uses numpy arrays?

To study the above problem, I've used the usual memory profiling tools - heapy and pympler - but am only getting the size of the outer array objects (80 bytes) and not the data they are containing. Asides from a crude measuring of how much memory the Python process is using, how can I get at the "full" size of the arrays as they grow?

Local details: OSX 10.6, Python 2.6, but general solutions are welcome.

+2  A: 

There's no way to ensure you can grow the array in place other than creating an empty array (numpy.empty) of the maximum possible size and then using a view of that at the end. You can't start small because there's no guarantee that you can expand whatever memory the map is without clobbering some other data. (And all of this is much lower level than python allows you to get from inside the interpreter.)

Your best bet is probably numpy.fromiter. Looking at the source, as the number of items increase, the array is expanded by a little over 50% each time. If you can easily get the number of rows (say from counting the lines), you can even pass it a count.

AFoglia
A: 

On possible option is to do a single pass through the file first to count the number of rows, without loading them.

The other option is to double your table size each time, which has two benefits:

  1. You will only re-alloc memory log(n) times where n is the number of rows.
  2. You only need 50% more ram than your largest table size

If you take the dynamic route, you could measure the length of the first row in bytes, then guess the number of rows by calculating (num bytes in file / num bytes in first row). Start with a table of this size.

Tom Leys
Thanks all. Given the size of the file, I was reluctant to do an initial pass just to count lines, but it seems the easiest and most efficient way to solve the memory problem.
Paul-Michael Agapow
I had a co-worker ask a similar question recently, and I came up with another possibility that could save you from the initial pass. If you know the approximate size of an "element" in the file, you can divide it into the file size. Add some padding for safety, and you can then write to the entire memory. To hide the extra, uninitialized, you can use a view of only the elements with data. You will need to make sure you don't go over. It's not perfect, but if your file reads are slow, and your data is consistently laid out, it might work.
AFoglia