ansaurus

Question

Python: efficiently join chunks of bytes into one big chunk?

Answer 1

A:

In python3, bytes objects are distinct from str, but I don't know any reason why there would be anything wrong with this.

recursive 2009-02-28 01:21:23

Answer 2

A:

join seems fine if you really do need to put the entire string together, but then you just wind up storing the whole thing in RAM anyway. In a situation like this, I would try to see if there's a way to process each part of the string and then discard the processed part, so you only need to hold a fixed number of bytes in memory at a time. That's usually the point of the callback approach. (If you can only process part of a chunk at a time, use a buffer as a queue to store the unprocessed data.)

David Zaslavsky 2009-02-28 01:24:43

Agreed, but I'm attempting to preserve the existing API and that requires the whole thing in memory. Ideally the body would be a generator instead of being a chunk of bytes, letting the user deal with it as they want...

Parand 2009-02-28 01:26:48

Answer 3

+3 A:

''join() is the best method for joining chunks of data. The alternative boils down to repeated concatenation, which is O(n**2) due to the immutability of strings and the need to create more at every concatenation. Given, this repeated concatenation is optimized by recent versions of CPython if used with += to become O(n), but that optimization only gives it a rough equivalent to ''.join() anyway, which is explicitly O(n) over the number of bytes.

Devin Jeanpierre 2009-02-28 01:24:44

Answer 4

+1 A:

hm - what problem are you trying to solve? I suspect the answer depends on what you are trying to do with the data.

Since in general you don't want a whole 3Gb file in memory, I'd not store the chunks in an array, but iterate over the http_response and write it straight to disk, in a temporary or persistent file using the normal write() method on an appropriate file handle.

if you do want two copies of the data in memory, your method will require be at least 6Gb for your hypothetical 3Gb file, which presumably is significant for most hardware. I know that array join methods are fast and all that, but since this is a really ram-constrained process maybe you want to find some way of doing it better? StringIO (http://docs.python.org/library/stringio.html) creates string objects that can be appended to in memory; the pure python one, since it has to work with immutable strings, just uses your array join trick internally, but the c-based cStringIO might actually append to a memory buffer internall. I don't have its source code to hand, so that would bear checking.

if you do wish to do some kind of analysis on the data and really wish to keep in in memory with minimal overhead, you might want to consider some of the byte array objets from Numeric/NumPy as an alternative to StringIO. they are high-performance code optimised for large arrays and might be what you need.

as a useful example, for a general-purpose file-handling object which has memory-efficient iterator-friendly approach you might want to check out the django File obeject chunk handling code: http://code.djangoproject.com/browser/django/trunk/django/core/files/base.py.

dan mackinlay 2009-02-28 01:52:52

Excellent point regarding the need for 6GB instead of 3 with my method above. I want to process the chunks and get rid of them (just write them to disk in this case), but I also wanted to preserve the existing semantics that provide access to the data in memory. I might have to forgo the latter.

Parand 2009-02-28 05:01:01

ansaurus

tags:

views:

answers:

Python: efficiently join chunks of bytes into one big chunk?

related questions