views:

639

answers:

8

It's said that Python automatically manages memory. I'm confused because I have a Python program consistently uses more than 2GB of memory.

It's a simple multi-thread binary data downloader and unpacker.

def GetData(url):
    req = urllib2.Request(url)
    response = urllib2.urlopen(req)
    data = response.read() // data size is about 15MB
    response.close()
    count = struct.unpack("!I", data[:4])
    for i in range(0, count):
        UNPACK FIXED LENGTH OF BINARY DATA HERE
        yield (field1, field2, field3)

class MyThread(threading.Thread):
    def __init__(self, total, daterange, tickers):
        threading.Thread.__init__(self)

    def stop(self):
        self._Thread__stop()

    def run(self):
        GET URL FOR EACH REQUEST
        data = []
        items = GetData(url)
        for item in items:
            data.append(';'.join(item))
        f = open(filename, 'w')
        f.write(os.linesep.join(data))
        f.close()

There are 15 threads running. Each request gets 15MB of data and unpack it and saved to local text file. How could this program consume more than 2GB of memory? Do I need to do any memory recycling jobs in this case? How can I see how much memory each objects or functions use?

I would appreciate all your advices or tips on how to keep a python program running in a memory efficient mode.

Edit: Here is the output of "cat /proc/meminfo"

MemTotal:        7975216 kB
MemFree:          732368 kB
Buffers:           38032 kB
Cached:          4365664 kB
SwapCached:        14016 kB
Active:          2182264 kB
Inactive:        4836612 kB
+1  A: 

You can make this program more memory efficient by not reading all 15MB from the TCP connection, but instead processing each line as it is read. This will make the remote servers wait for you, of course, but that's okay.

Python is just not very memory efficient. It wasn't built for that.

vy32
+4  A: 

The last line should surely be f.close()? Those trailing parens are kinda important.

cjrh
Yes, it's f.close(). It was a typo in original post.
jack
+1  A: 

You could do more of your work in compiled C code if you convert this to a list comprehension:

data = []
items = GetData(url)
for item in items:
    data.append(';'.join(item))

to:

data = [';'.join(items) for items in GetData(url)]

This is actually slightly different from your original code. In your version, GetData returns a 3-tuple, which comes back in items. You then iterate over this triplet, and append ';'.join(item) for each item in it. This means that you get 3 entries added to data for every triplet read from GetData, each one ';'.join'ed. If the items are just strings, then ';'.join will give you back a string with every other character a ';' - that is ';'.join("ABC") will give back "A;B;C". I think what you actually wanted was to have each triplet saved back to the data list as the 3 values of the triplet, separated by semicolons. That is what my version generates.

This may also help somewhat with your original memory problem, as you are no longer creating as many Python values. Remember that a variable in Python has much more overhead than one in a language like C. Since each value is itself an object, and add the overhead of each name reference to that object, you can easily expand the theoretical storage requirement several-fold. In your case, reading 15Mb X 15 = 225Mb + the overhead of each item of each triple being stored as a string entry in your data list could quickly grow to your 2Gb observed size. At minimum, my version of your data list will have only 1/3 the entries in it, plus the separate item references are skipped, plus the iteration is done in compiled code.

Paul McGuire
+4  A: 

Consider using xrange() instead of range(), I believe that xrange is a generator whereas range() expands the whole list.

I'd say either don't read the whole file into memory, or don't keep the whole unpacked structure in memory.

Currently you keep both in memory, at the same time, this is going to be quite big. So you've got at least two copies of your data in memory, plus some metadata.

Also the final line

    f.write(os.linesep.join(data))

May actually mean you've temporarily got a third copy in memory (a big string with the entire output file).

So I'd say you're doing it in quite an inefficient way, keeping the entire input file, entire output file and a fair amount of intermediate data in memory at once.

Using the generator to parse it is quite a nice idea. Consider writing each record out after you've generated it (it can then be discarded and the memory reused), or if that causes too many write requests, batch them into, say, 100 rows at once.

Likewise, reading the response could be done in chunks. As they're fixed records this should be reasonably easy.

MarkR
+5  A: 

The major culprit here is as mentioned above the range() call. It will create a list with 15 million members, and that will eat up 200 MB of your memory, and with 15 processes, that's 3GB.

But also don't read in the whole 15MB file into data(), read bit by bit from the response. Sticking those 15MB into a variable will use up 15MB of memory more than reading bit by bit from the response.

You might want to consider simply just extracting data until you run out if indata, and comparing the count of data you extracted with what the first bytes said it should be. Then you need neither range() nor xrange(). Seems more pythonic to me. :)

Lennart Regebro
actually, variable count here is not near 15 million range because every chunk of binary data is 30 bytes, so it wont create a list with 15 million elements.
jack
Ah, I see. Still it will help about as much as not reading in all the data into `data`, then.
Lennart Regebro
A: 

There are 2 obvious places where you keep large data objects in memory (data variable in GetData() and data in MyThread.run() - these two will take about 500Mb) and probably there are other places in the skipped code. There are both easy to make memory efficient. Use response.read(4) instead of reading whole response at once and do it the same way in code behind UNPACK FIXED LENGTH OF BINARY DATA HERE. Change data.append(...) in MyThread.run() to

if not first:
    f.write(os.linesep)
f.write(';'.join(item))

These changes will save you a lot of memory.

Denis Otkidach
A: 

Make sure you delete the threads after they are stopped. (using del)

Kalmi
+2  A: 

Like others have said, you need at least the following two changes:

  1. Do not create a huge list of integers with range

    # use xrange
    for i in xrange(0, count):
        # UNPACK FIXED LENGTH OF BINARY DATA HERE
        yield (field1, field2, field3)
    
  2. do not create a huge string as the full file body to be written at once

    # use writelines
    f = open(filename, 'w')
    f.writelines((datum + os.linesep) for datum in data)
    f.close()
    

Even better, you could write the file as:

    items = GetData(url)
    f = open(filename, 'w')
    for item in items:
        f.write(';'.join(item) + os.linesep)
    f.close()
ΤΖΩΤΖΙΟΥ