ansaurus

Question

Reading text files into list, then storing in dictionay fills system memory ? (A what am I doing wrong?) Python

Answer 1

+2 A:

This line specifically gets the size of the function object:

print str(sys.getsizeof(ImportDataFrom))

that's unlikely to be what you're interested in.

The size of a container does not include the size of the data it contains. Consider, for example:

>>> import sys
>>> d={}
>>> sys.getsizeof(d)
140
>>> d['foo'] = 'x'*99
>>> sys.getsizeof(d)
140
>>> d['foo'] = 'x'*9999
>>> sys.getsizeof(d)
140

If you want the size of the container plus the size of all contained things you have to write your own (presumably recursive) function that reaches inside containers and digs for every byte. Or, you can use third-party libraries such as Pympler or guppy.

Alex Martelli 2010-02-21 17:17:52

Ok I guess I need to look at Pympler or guppy. Would you expect what I am doing to be a problem. Is there a better way? I guess I can do more research.

Vincent 2010-02-21 17:33:35

Yes. Note also that getsizeof(filedata) will be roughly the same every time, only alldata keeps growing with each file.

Beni Cherniavsky-Paskin 2010-02-21 17:36:26

@Vincent, I don't see anything _wrong_ offhand (except with the attempts at measuring space) -- dicts are just memory-expensive. You may consider moving `alldata` to the `shelve` module from the standard library -- that module essentially lets dictionaries live on disk (as long as keys are just strings, but values can be anything picklable).

Alex Martelli 2010-02-21 18:26:57

Answer 2

+3 A:

The dictionary itself is very small - the bulk of the data is the whole content of the files stored in lists, containing one tuple per line. The 20x size increase is bigger than I expected but seems to be real. Splitting a 27-bytes line from your example input into a tuple, gives me 309 bytes (counting recursively, on a 64-bit machine). Add to this some unknown overhead of memory allocation, and 20x is not impossible.

Alternatives: for more compact representation, you want to convert the strings to integers/floats and to tightly pack them (without all that pointers and separate objects). I'm talking not just one row (although that's a start), but a whole list of rows together - so each file will be represented by just four 2D arrays of numbers. The array module is a start, but really what you need here are numpy arrays:

# Using explicit field types for compactness and access by name
# (e.g. data[i]['mean'] == data[i][2]).
fields = [('x', int), ('y', int), ('mean', float), 
          ('stdv', float), ('npixels', int)]
# The simplest way is to build lists as you do now, and convert them
# to numpy array when done.
data = numpy.array(data, dtype=fields)
mask = numpy.array(mask, dtype=fields)
...

This gives me 40 bytes spent per row (measured on the .data attribute; sys.getsizeof reports that the array has a constant overhead of 80 bytes, but doesn't see the actual data used). This is still a ~1.5 more than the original files, but should easily fit into RAM.

I see 2 of your fields are labeled "x" and "y" - if your data is dense, you could arrange it by them - data[x,y]==... - instead of just storing (x,y,...) records. Besides being slightly more compact, it would be the most sensible structure, allowing easier processing.

If you need to handle even more data than your RAM will fit, pytables is a good library for efficient access to compact (even compressed) tabular data in files. (It's much better at this than general SQL DBs.)

Beni Cherniavsky-Paskin 2010-02-21 18:53:14

I will try ti implement this. They are going to need to be in numpy arrays eventually. Thanks for the input. I'll let you know what I end up with.

Vincent 2010-02-21 19:19:09

This worked great. Finished the import without exceeding 1.5gb. Thank you.I not sure what you mean with the "dense" data ans data[x,y], I am sure it is simple. I will need to sort (x,y) based on a key (gene sequence this is microarray data the mean ans std are array measurements) I will probably be keeping the nucleotide sequence. data in a separate array. #$%#

Vincent 2010-02-21 20:02:35

By "dense" I meant having all (or most) combinations of (x,y) in some rectangle. If you can just arrange the data by (x,y) - or whatever other filed - then you don't need to store the x and y, because they are obvious from the location. Think [(0, 'a'), (1, 'b')] vs just ['a', 'b']. I'm not exactly following your use case, so can't say if this is applicable.

Beni Cherniavsky-Paskin 2010-02-26 10:34:31

ansaurus

tags:

views:

answers:

Reading text files into list, then storing in dictionay fills system memory ? (A what am I doing wrong?) Python

related questions