ansaurus

Question

How to write a memory efficient Python program?

Answer 1

+1 A:

You can make this program more memory efficient by not reading all 15MB from the TCP connection, but instead processing each line as it is read. This will make the remote servers wait for you, of course, but that's okay.

Python is just not very memory efficient. It wasn't built for that.

vy32 2009-11-02 06:25:40

Answer 2

+4 A:

The last line should surely be f.close()? Those trailing parens are kinda important.

cjrh 2009-11-02 07:21:50

Yes, it's f.close(). It was a typo in original post.

jack 2009-11-02 07:50:16

Answer 3

+1 A:

You could do more of your work in compiled C code if you convert this to a list comprehension:

data = []
items = GetData(url)
for item in items:
    data.append(';'.join(item))

to:

data = [';'.join(items) for items in GetData(url)]

This is actually slightly different from your original code. In your version, GetData returns a 3-tuple, which comes back in items. You then iterate over this triplet, and append ';'.join(item) for each item in it. This means that you get 3 entries added to data for every triplet read from GetData, each one ';'.join'ed. If the items are just strings, then ';'.join will give you back a string with every other character a ';' - that is ';'.join("ABC") will give back "A;B;C". I think what you actually wanted was to have each triplet saved back to the data list as the 3 values of the triplet, separated by semicolons. That is what my version generates.

This may also help somewhat with your original memory problem, as you are no longer creating as many Python values. Remember that a variable in Python has much more overhead than one in a language like C. Since each value is itself an object, and add the overhead of each name reference to that object, you can easily expand the theoretical storage requirement several-fold. In your case, reading 15Mb X 15 = 225Mb + the overhead of each item of each triple being stored as a string entry in your data list could quickly grow to your 2Gb observed size. At minimum, my version of your data list will have only 1/3 the entries in it, plus the separate item references are skipped, plus the iteration is done in compiled code.

Paul McGuire 2009-11-02 07:43:23

Answer 4

+4 A:

Consider using xrange() instead of range(), I believe that xrange is a generator whereas range() expands the whole list.

I'd say either don't read the whole file into memory, or don't keep the whole unpacked structure in memory.

Currently you keep both in memory, at the same time, this is going to be quite big. So you've got at least two copies of your data in memory, plus some metadata.

Also the final line

    f.write(os.linesep.join(data))

May actually mean you've temporarily got a third copy in memory (a big string with the entire output file).

So I'd say you're doing it in quite an inefficient way, keeping the entire input file, entire output file and a fair amount of intermediate data in memory at once.

Using the generator to parse it is quite a nice idea. Consider writing each record out after you've generated it (it can then be discarded and the memory reused), or if that causes too many write requests, batch them into, say, 100 rows at once.

Likewise, reading the response could be done in chunks. As they're fixed records this should be reasonably easy.

MarkR 2009-11-02 07:56:08

Answer 5

+5 A:

The major culprit here is as mentioned above the range() call. It will create a list with 15 million members, and that will eat up 200 MB of your memory, and with 15 processes, that's 3GB.

But also don't read in the whole 15MB file into data(), read bit by bit from the response. Sticking those 15MB into a variable will use up 15MB of memory more than reading bit by bit from the response.

You might want to consider simply just extracting data until you run out if indata, and comparing the count of data you extracted with what the first bytes said it should be. Then you need neither range() nor xrange(). Seems more pythonic to me. :)

Lennart Regebro 2009-11-02 08:31:35

actually, variable count here is not near 15 million range because every chunk of binary data is 30 bytes, so it wont create a list with 15 million elements.

jack 2009-11-02 09:10:10

Ah, I see. Still it will help about as much as not reading in all the data into `data`, then.

Lennart Regebro 2009-11-02 11:32:03

Answer 6

A:

There are 2 obvious places where you keep large data objects in memory (data variable in GetData() and data in MyThread.run() - these two will take about 500Mb) and probably there are other places in the skipped code. There are both easy to make memory efficient. Use response.read(4) instead of reading whole response at once and do it the same way in code behind UNPACK FIXED LENGTH OF BINARY DATA HERE. Change data.append(...) in MyThread.run() to

if not first:
    f.write(os.linesep)
f.write(';'.join(item))

These changes will save you a lot of memory.

Denis Otkidach 2009-11-02 09:13:58

Answer 7

A:

Make sure you delete the threads after they are stopped. (using del)

Kalmi 2009-11-02 09:22:14

Answer 8

+2 A:

Like others have said, you need at least the following two changes:

Do not create a huge list of integers with range

# use xrange
for i in xrange(0, count):
    # UNPACK FIXED LENGTH OF BINARY DATA HERE
    yield (field1, field2, field3)

do not create a huge string as the full file body to be written at once

# use writelines
f = open(filename, 'w')
f.writelines((datum + os.linesep) for datum in data)
f.close()

Even better, you could write the file as:

    items = GetData(url)
    f = open(filename, 'w')
    for item in items:
        f.write(';'.join(item) + os.linesep)
    f.close()

ΤΖΩΤΖΙΟΥ 2009-11-02 09:40:03

ansaurus

tags:

views:

answers:

How to write a memory efficient Python program?

related questions