ansaurus

Question

memory use in large data-structures manipulation/processing

Answer 1

+3 A:

Before you start tearing your hair out over the garbage collector, you might be able to avoid that 100mb hit of loading the entire file into memory by using a memory-mapped file object. See the mmap module.

Crashworks 2009-02-04 19:29:07

100 Mb is just fine, problem starts when it hits 1.7 Gb of virtual memory

SilentGhost 2009-02-04 19:52:41

Yikes! That sounds more like you are hanging onto references to many things, so that the garbage collector can't clean them up. This can happen if you save off a reference to your intermediate data in the processing class.

Crashworks 2009-02-04 20:06:22

that's exactly my question!

SilentGhost 2009-02-04 20:12:04

Answer 2

+3 A:

Don't read the entire 100 meg file in at a time. Use streams to process a little bit at a time. Check out this blog post that talks about handling large csv and xml files. http://lethain.com/entry/2009/jan/22/handling-very-large-csv-and-xml-files-in-python/

Here is a sample of the code from the article.

from __future__ import with_statement # for python 2.5

with open('data.in','r') as fin:
    with open('data.out','w') as fout:
        for line in fin:
            fout.write(','.join(line.split(' ')))

Sam Corder 2009-02-04 19:32:48

it doesn't seem to scale in terms of code, I don't need just to rearrange bits, there's more processing involved

SilentGhost 2009-02-04 19:41:43

Once you have parsed a detail line and done your reduction calculations make sure you aren't hanging on to any of the objects created from parsing the details. Python GC is reference based. As long as there is a reference to an object it won't get GC'ed.

Sam Corder 2009-02-04 23:11:02

Just to add. If you have two objects that refer to each other they will never be garbage collected unless one of them lets go of the reference to the other. Check for this kind of circular reference if you see your memory usage ballooning and thing you the objects should be out of scope.

Sam Corder 2009-02-04 23:20:52

@Sam Corder: Cyclic garbage collection has long since been added to Python.

Torsten Marek 2009-02-05 00:24:53

@Torsten Marek: Very cool. Thanks for the correction.

Sam Corder 2009-02-05 03:31:22

Answer 3

+2 A:

So, from your comments I assume that your file looks something like this:

item1,item2,item3,item4,item5,item6,item7,...,itemn

which you all reduce to a single value by repeated application of some combination function. As a solution, only read a single value at a time:

def read_values(f):
    buf = []
    while True:
        c = f.read(1)
        if c == ",":
            yield parse("".join(buf))
            buf = []
        elif c == "":
            yield parse("".join(buf))
            return
        else:
            buf.append(c)

with open("some_file", "r") as f:
     agg = initial
     for v in read_values(f):
         agg = combine(agg, v)

This way, memory consumption stays constant, unless agg grows in time.

Provide appropriate implementations of initial, parse and combine
Don't read the file byte-by-byte, but read in a fixed buffer, parse from the buffer and read more as you need it
This is basically what the builtin reduce function does, but I've used an explicit for loop here for clarity. Here's the same thing using reduce:
```
with open("some_file", "r") as f:
    agg = reduce(combine, read_values(f), initial)
```

I hope I interpreted your problem correctly.

Torsten Marek 2009-02-04 19:56:08

i'm sorry if I've put it clumsy, but by reduce I meant "make 32 Kb from 100 Mb"

SilentGhost 2009-02-04 20:04:13

No, I didn't mean that, I meant the reduce builtin.

Torsten Marek 2009-02-04 20:17:55

I've added `reduce` example.

J.F. Sebastian 2009-02-04 21:24:51

btw, `f.read()` should be `f.read(1)` in your code. And open("somefile", r) -> open("somefile", "r").

J.F. Sebastian 2009-02-04 21:26:19

@J.F.: Ah, the joys of coding without testing. I've actually tried out the code and used f.read(1) there.

Torsten Marek 2009-02-04 21:33:03

+1: Process incrementally

S.Lott 2009-02-04 21:48:39

Answer 4

A:

First of all, don't touch the garbage collector. That's not the problem, nor the solution.

It sounds like the real problem you're having is not with the file reading at all, but with the data structures that you're allocating as you process the files. Condering using del to remove structures that you no longer need during processing. Also, you might consider using marshal to dump some of the processed data to disk while you work through the next 100mb of input files.

For file reading, you have basically two options: unix-style files as streams, or memory mapped files. For streams-based files, the default python file object is already buffered, so the simplest code is also probably the most efficient:

  with open("filename", "r") as f:
    for line in f:
       # do something with a line of the files

Alternately, you can use f.read([size]) to read blocks of the file. However, usually you do this to gain CPU performance, by multithreading the processing part of your script, so that you can read and process at the same time. But it doesn't help with memory usage; in fact, it uses more memory.

The other option is mmap, which looks like this:

  with open("filename", "r+") as f:
    map = mmap.mmap(f.fileno(), 0)
    line = map.readline()
    while line != '':
       # process a line
       line = map.readline()

This sometimes outperforms streams, but it also won't improve memory usage.

2009-02-05 00:22:10

Answer 5

+6 A:

I'd suggest looking at the presentation by David Beazley on using generators in Python. This technique allows you to handle a lot of data, and do complex processing, quickly and without blowing up your memory use. IMO, the trick isn't holding a huge amount of data in memory as efficiently as possible; the trick is avoiding loading a huge amount of data into memory at the same time.

Ryan Ginstrom 2009-02-05 13:09:36

Gah, as soon as I saw the question I jumped in to answer with link to the Beazley stuff and saw you give it as an answer already. Oh well, have to vote you up +1 instead! Just wish I could give it more than +1.

Van Gale 2009-02-11 21:41:06

Answer 6

A:

In your example code, data is being stored in the fc variable. If you don't keep a reference to fc around, your entire file contents will be removed from memory when the read method ends.

If they are not, then you are keeping a reference somewhere. Maybe the reference is being created in read_100_mb_file, maybe in process. If there is no reference, CPython implementation will deallocate it almost immediatelly.

There are some tools to help you find where this reference is, guppy, dowser, pysizer...

nosklo 2009-02-12 23:54:11

ansaurus

tags:

views:

answers:

memory use in large data-structures manipulation/processing

related questions