I have a series of large text files (up to 1 gig) that are output from an experiment that need to be analysed in Python. They would be best loaded into a 2D numpy array, which presents the first question:
- As the number of rows is unknown at the beginning of the loading, how can a very large numpy array be most efficiently built, row by row?
Simply adding the row to the array would be inefficient in memory terms, as two large arrays would momentarily co-exist. The same problem would seem to be occur if you use numpy.append
. The stack
functions are promising, but ideally I would want to grow the array in place.
This leads to the second question:
- What is the best way to observe the memory usage of a Python program that heavily uses numpy arrays?
To study the above problem, I've used the usual memory profiling tools - heapy and pympler - but am only getting the size of the outer array objects (80 bytes) and not the data they are containing. Asides from a crude measuring of how much memory the Python process is using, how can I get at the "full" size of the arrays as they grow?
Local details: OSX 10.6, Python 2.6, but general solutions are welcome.