views:

505

answers:

4

I'm trying to jury-rig the Amazon S3 python library to allow chunked handling of large files. Right now it does a "self.body = http_response.read()", so if you have a 3G file you're going to read the entire thing into memory before getting any control over it.

My current approach is to try to keep the interface for the library the same but provide a callback after reading each chunk of data. Something like the following:

data = []
while True:
    chunk = http_response.read(CHUNKSIZE)
    if not chunk:
        break
    if callback:
        callback(chunk)
    data.append(chunk)

Now I need to do something like:

self.body = ''.join(data)

Is join the right way to do this or is there another (better) way of putting all the chunks together?

A: 

In python3, bytes objects are distinct from str, but I don't know any reason why there would be anything wrong with this.

recursive
A: 

join seems fine if you really do need to put the entire string together, but then you just wind up storing the whole thing in RAM anyway. In a situation like this, I would try to see if there's a way to process each part of the string and then discard the processed part, so you only need to hold a fixed number of bytes in memory at a time. That's usually the point of the callback approach. (If you can only process part of a chunk at a time, use a buffer as a queue to store the unprocessed data.)

David Zaslavsky
Agreed, but I'm attempting to preserve the existing API and that requires the whole thing in memory. Ideally the body would be a generator instead of being a chunk of bytes, letting the user deal with it as they want...
Parand
+3  A: 

''join() is the best method for joining chunks of data. The alternative boils down to repeated concatenation, which is O(n**2) due to the immutability of strings and the need to create more at every concatenation. Given, this repeated concatenation is optimized by recent versions of CPython if used with += to become O(n), but that optimization only gives it a rough equivalent to ''.join() anyway, which is explicitly O(n) over the number of bytes.

Devin Jeanpierre
+1  A: 

hm - what problem are you trying to solve? I suspect the answer depends on what you are trying to do with the data.

Since in general you don't want a whole 3Gb file in memory, I'd not store the chunks in an array, but iterate over the http_response and write it straight to disk, in a temporary or persistent file using the normal write() method on an appropriate file handle.

if you do want two copies of the data in memory, your method will require be at least 6Gb for your hypothetical 3Gb file, which presumably is significant for most hardware. I know that array join methods are fast and all that, but since this is a really ram-constrained process maybe you want to find some way of doing it better? StringIO (http://docs.python.org/library/stringio.html) creates string objects that can be appended to in memory; the pure python one, since it has to work with immutable strings, just uses your array join trick internally, but the c-based cStringIO might actually append to a memory buffer internall. I don't have its source code to hand, so that would bear checking.

if you do wish to do some kind of analysis on the data and really wish to keep in in memory with minimal overhead, you might want to consider some of the byte array objets from Numeric/NumPy as an alternative to StringIO. they are high-performance code optimised for large arrays and might be what you need.

as a useful example, for a general-purpose file-handling object which has memory-efficient iterator-friendly approach you might want to check out the django File obeject chunk handling code: http://code.djangoproject.com/browser/django/trunk/django/core/files/base.py.

dan mackinlay
Excellent point regarding the need for 6GB instead of 3 with my method above. I want to process the chunks and get rid of them (just write them to disk in this case), but I also wanted to preserve the existing semantics that provide access to the data in memory. I might have to forgo the latter.
Parand