views:

149

answers:

4

I am trying to serialize a list of dictionaries to a csv text file using Python's CSV module. My list has about 13,000 elements, each is a dictionary with ~100 keys consisting of simple text and numbers. My function "dictlist2file" simply calls DictWriter to serialize this, but I am getting out of memory errors.

My function is:

def dictlist2file(dictrows, filename, fieldnames, delimiter='\t',
                  lineterminator='\n', extrasaction='ignore'):
    out_f = open(filename, 'w')

    # Write out header
    if fieldnames != None:
        header = delimiter.join(fieldnames) + lineterminator
    else:
        header = dictrows[0].keys()
        header.sort()
    out_f.write(header)

    print "dictlist2file: serializing %d entries to %s" \
          %(len(dictrows), filename)
    t1 = time.time()
    # Write out dictionary
    data = csv.DictWriter(out_f, fieldnames,
              delimiter=delimiter,
              lineterminator=lineterminator,
                          extrasaction=extrasaction) 
    data.writerows(dictrows)
    out_f.close()
    t2 = time.time()
    print "dictlist2file: took %.2f seconds" %(t2 - t1)

When I try this on my dictionary, I get the following output:

dictlist2file: serializing 13537 entries to myoutput_file.txt
Python(6310) malloc: *** mmap(size=45862912) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
...
  File "/Library/Frameworks/Python.framework/Versions/6.2/lib/python2.6/csv.py", line 149, in writerows
    rows.append(self._dict_to_list(rowdict))
  File "/Library/Frameworks/Python.framework/Versions/6.2/lib/python2.6/csv.py", line 141, in _dict_to_list
    return [rowdict.get(key, self.restval) for key in self.fieldnames]
MemoryError

Any idea what could be causing this? The list has only 13,000 elements and the dictionaries themselves are very simple and small (100 keys) so I don't see why this would lead to memory errors or be so inefficient. It takes minutes for it to get to the memory error.

thanks for your help.

+1  A: 

You could be tripping over an internal Python issue. I'd report it at bugs.python.org.

owenmarshall
+1  A: 

DictWriter.writerows(...) takes all the dicts you pass in to it and creates (in memory) an entire new list of lists, one for each row. So if you have a lot of data, I can see how a MemoryError would pop up. Two ways you might proceed:

  1. Iterate over the list yourself and call DictWriter.writerow once for each one. Although this will mean a lot of writes.
  2. Batch up rows in to smaller lists and call DictWriter.writerows for them. Less IO, but you avoid the huge chunk of memory getting allocated.
Corey Porter
Using:for row in dictrows: data.writerow(row)does not make a difference. I don't understand why the memory is an issue -- it's only 13,000 dictionaries, and each one is still quite small and is not nested at all. It only contains string and numbers... is there an alternative to the csv module that is less slow?
A: 

I don't have an answer to what is happening with csv, but I found that the following substitute serializes the dictionary to a file in less than a few seconds:

for row in dictrows:
    out_f.write("%s%s" %(delimiter.join([row[name] for name in fieldnames]),
                         lineterminator))

where dictrows is a generator of dictionaries produced by DictReader from csv, fieldnames is a list of fields.

Any idea on why csv doesn't perform similarly would be greatly appreciated. thanks.

A: 

You say that if you loop over data.writerow(single_dict) that it still gets the problem. Put in code to show the row count every 100 rows. How many dicts has it processed before it gets the Memory error? Run more or fewer processes to soak up more or less memory ... does the place where it fails vary?

What is max(len(d) for d in dictrows) ? How long are the strings in the dicts?

How much free memory do you have anyway?

Update: See if Dictwriter is the problem; eliminate it and use basic csv functionality:

writer = csv.writer(.....)
for d in dictrows:
   row = [d[fieldname] for fieldname in fieldnames]
   writer.writerow(row)
John Machin