I have 43 text files that consume "232.2 MB on disk (232,129,355 bytes) for 43 items". what to read them in to memory (see code below). The problem I am having is that each file which is about 5.3mb on disk is causing python to use an additional 100mb of system memory. If check the size of the dict() getsizeof() (see sample of output). When python is up to 3GB of system memory getsizeof(dict()) is only using 6424 bytes of memory. I don't understand what is using the memory.
What is using up all the memory?
The related link is different in that the reported memory use by python was "correct" related question I am not very interested in other solutions DB .... I am more interested in understanding what is happening so I know how to avoid it in the future. That said using other python built ins array rather than lists are are great suggestion if it helps. I have heard suggestions of using guppy to find what is using the memory.
sample output:
Loading into memory: ME49_800.txt
ME49_800.txt has 228484 rows of data
ME49_800.txt has 0 rows of masked data
ME49_800.txt has 198 rows of outliers
ME49_800.txt has 0 modified rows of data
280bytes of memory used for ME49_800.txt
43 files of 43 using 12568 bytes of memory
120
Sample data:
CellHeader=X Y MEAN STDV NPIXELS
0 0 120.0 28.3 25
1 0 6924.0 1061.7 25
2 0 105.0 17.4 25
Code:
import csv, os, glob
import sys
def read_data_file(filename):
reader = csv.reader(open(filename, "U"),delimiter='\t')
fname = os.path.split(filename)[1]
data = []
mask = []
outliers = []
modified = []
maskcount = 0
outliercount = 0
modifiedcount = 0
for row in reader:
if '[MASKS]' in row:
maskcount = 1
if '[OUTLIERS]' in row:
outliercount = 1
if '[MODIFIED]' in row:
modifiedcount = 1
if row:
if not any((maskcount, outliercount, modifiedcount)):
data.append(row)
elif not any((not maskcount, outliercount, modifiedcount)):
mask.append(row)
elif not any((not maskcount, not outliercount, modifiedcount)):
outliers.append(row)
elif not any((not maskcount, not outliercount, not modifiedcount)):
modified.append(row)
else: print '***something went wrong***'
data = data[1:]
mask = mask[3:]
outliers = outliers[3:]
modified = modified[3:]
filedata = dict(zip((fname + '_data', fname + '_mask', fname + '_outliers', fname+'_modified'), (data, mask, outliers, modified)))
return filedata
def ImportDataFrom(folder):
alldata = dict{}
infolder = glob.glob( os.path.join(folder, '*.txt') )
numfiles = len(infolder)
print 'Importing files from: ', folder
print 'Importing ' + str(numfiles) + ' files from: ', folder
for infile in infolder:
fname = os.path.split(infile)[1]
print "Loading into memory: " + fname
filedata = read_data_file(infile)
alldata.update(filedata)
print fname + ' has ' + str(len(filedata[fname + '_data'])) + ' rows of data'
print fname + ' has ' + str(len(filedata[fname + '_mask'])) + ' rows of masked data'
print fname + ' has ' + str(len(filedata[fname + '_outliers'])) + ' rows of outliers'
print fname + ' has ' + str(len(filedata[fname +'_modified'])) + ' modified rows of data'
print str(sys.getsizeof(filedata)) +'bytes'' of memory used for '+ fname
print str(len(alldata)/4) + ' files of ' + str(numfiles) + ' using ' + str(sys.getsizeof(alldata)) + ' bytes of memory'
#print alldata.keys()
print str(sys.getsizeof(ImportDataFrom))
print ' '
return alldata
ImportDataFrom("/Users/vmd/Dropbox/dna/data/rawdata")