views:

125

answers:

2

I have 43 text files that consume "232.2 MB on disk (232,129,355 bytes) for 43 items". what to read them in to memory (see code below). The problem I am having is that each file which is about 5.3mb on disk is causing python to use an additional 100mb of system memory. If check the size of the dict() getsizeof() (see sample of output). When python is up to 3GB of system memory getsizeof(dict()) is only using 6424 bytes of memory. I don't understand what is using the memory.

What is using up all the memory?

The related link is different in that the reported memory use by python was "correct" related question I am not very interested in other solutions DB .... I am more interested in understanding what is happening so I know how to avoid it in the future. That said using other python built ins array rather than lists are are great suggestion if it helps. I have heard suggestions of using guppy to find what is using the memory.

sample output:

Loading into memory: ME49_800.txt
ME49_800.txt has 228484 rows of data
ME49_800.txt has 0 rows of masked data
ME49_800.txt has 198 rows of outliers
ME49_800.txt has 0 modified rows of data
280bytes of memory used for ME49_800.txt
43 files of 43 using 12568 bytes of memory
120

Sample data:

CellHeader=X    Y   MEAN    STDV    NPIXELS
  0   0 120.0   28.3     25
  1   0 6924.0  1061.7   25
  2   0 105.0   17.4     25

Code:

import csv, os, glob
import sys


def read_data_file(filename):
    reader = csv.reader(open(filename, "U"),delimiter='\t')
    fname = os.path.split(filename)[1]
    data = []
    mask = []
    outliers = []
    modified = []

    maskcount = 0
    outliercount = 0
    modifiedcount = 0

    for row in reader:
        if '[MASKS]' in row:
            maskcount = 1
        if '[OUTLIERS]' in row:
            outliercount = 1
        if '[MODIFIED]' in row:
            modifiedcount = 1
        if row:
            if not any((maskcount, outliercount, modifiedcount)):
                data.append(row)
            elif not any((not maskcount, outliercount, modifiedcount)):
                mask.append(row) 
            elif not any((not maskcount, not outliercount, modifiedcount)):
                outliers.append(row)  
            elif not any((not maskcount, not outliercount, not modifiedcount)):
                modified.append(row)
            else: print '***something went wrong***'

    data = data[1:]
    mask = mask[3:]
    outliers = outliers[3:]
    modified = modified[3:]
    filedata = dict(zip((fname + '_data', fname + '_mask', fname + '_outliers', fname+'_modified'), (data, mask, outliers, modified)))
    return filedata


def ImportDataFrom(folder):

    alldata = dict{}
    infolder = glob.glob( os.path.join(folder, '*.txt') )
    numfiles = len(infolder)
    print 'Importing files from: ', folder
    print 'Importing ' + str(numfiles) + ' files from: ', folder

    for infile in infolder:
        fname = os.path.split(infile)[1]
        print "Loading into memory: " + fname

        filedata = read_data_file(infile)
        alldata.update(filedata)

        print fname + ' has ' + str(len(filedata[fname + '_data'])) + ' rows of data'
        print fname + ' has ' + str(len(filedata[fname + '_mask'])) + ' rows of masked data'
        print fname + ' has ' + str(len(filedata[fname + '_outliers'])) + ' rows of outliers'
        print fname + ' has ' + str(len(filedata[fname +'_modified'])) + ' modified rows of data'
        print str(sys.getsizeof(filedata)) +'bytes'' of memory used for '+ fname
        print str(len(alldata)/4) + ' files of ' + str(numfiles) + ' using ' + str(sys.getsizeof(alldata)) + ' bytes of memory'
        #print alldata.keys()
        print str(sys.getsizeof(ImportDataFrom))
        print ' ' 

    return alldata


ImportDataFrom("/Users/vmd/Dropbox/dna/data/rawdata")
+2  A: 

This line specifically gets the size of the function object:

print str(sys.getsizeof(ImportDataFrom))

that's unlikely to be what you're interested in.

The size of a container does not include the size of the data it contains. Consider, for example:

>>> import sys
>>> d={}
>>> sys.getsizeof(d)
140
>>> d['foo'] = 'x'*99
>>> sys.getsizeof(d)
140
>>> d['foo'] = 'x'*9999
>>> sys.getsizeof(d)
140

If you want the size of the container plus the size of all contained things you have to write your own (presumably recursive) function that reaches inside containers and digs for every byte. Or, you can use third-party libraries such as Pympler or guppy.

Alex Martelli
Ok I guess I need to look at Pympler or guppy. Would you expect what I am doing to be a problem. Is there a better way? I guess I can do more research.
Vincent
Yes. Note also that getsizeof(filedata) will be roughly the same every time, only alldata keeps growing with each file.
Beni Cherniavsky-Paskin
@Vincent, I don't see anything _wrong_ offhand (except with the attempts at measuring space) -- dicts are just memory-expensive. You may consider moving `alldata` to the `shelve` module from the standard library -- that module essentially lets dictionaries live on disk (as long as keys are just strings, but values can be anything picklable).
Alex Martelli
+3  A: 

The dictionary itself is very small - the bulk of the data is the whole content of the files stored in lists, containing one tuple per line. The 20x size increase is bigger than I expected but seems to be real. Splitting a 27-bytes line from your example input into a tuple, gives me 309 bytes (counting recursively, on a 64-bit machine). Add to this some unknown overhead of memory allocation, and 20x is not impossible.

Alternatives: for more compact representation, you want to convert the strings to integers/floats and to tightly pack them (without all that pointers and separate objects). I'm talking not just one row (although that's a start), but a whole list of rows together - so each file will be represented by just four 2D arrays of numbers. The array module is a start, but really what you need here are numpy arrays:

# Using explicit field types for compactness and access by name
# (e.g. data[i]['mean'] == data[i][2]).
fields = [('x', int), ('y', int), ('mean', float), 
          ('stdv', float), ('npixels', int)]
# The simplest way is to build lists as you do now, and convert them
# to numpy array when done.
data = numpy.array(data, dtype=fields)
mask = numpy.array(mask, dtype=fields)
...

This gives me 40 bytes spent per row (measured on the .data attribute; sys.getsizeof reports that the array has a constant overhead of 80 bytes, but doesn't see the actual data used). This is still a ~1.5 more than the original files, but should easily fit into RAM.

I see 2 of your fields are labeled "x" and "y" - if your data is dense, you could arrange it by them - data[x,y]==... - instead of just storing (x,y,...) records. Besides being slightly more compact, it would be the most sensible structure, allowing easier processing.

If you need to handle even more data than your RAM will fit, pytables is a good library for efficient access to compact (even compressed) tabular data in files. (It's much better at this than general SQL DBs.)

Beni Cherniavsky-Paskin
I will try ti implement this. They are going to need to be in numpy arrays eventually. Thanks for the input. I'll let you know what I end up with.
Vincent
This worked great. Finished the import without exceeding 1.5gb. Thank you.I not sure what you mean with the "dense" data ans data[x,y], I am sure it is simple. I will need to sort (x,y) based on a key (gene sequence this is microarray data the mean ans std are array measurements) I will probably be keeping the nucleotide sequence. data in a separate array. #$%#
Vincent
By "dense" I meant having all (or most) combinations of (x,y) in some rectangle. If you can just arrange the data by (x,y) - or whatever other filed - then you don't need to store the x and y, because they are obvious from the location. Think [(0, 'a'), (1, 'b')] vs just ['a', 'b']. I'm not exactly following your use case, so can't say if this is applicable.
Beni Cherniavsky-Paskin