views:

406

answers:

2

Right now, I am buffering bytes using strings, StringIO, or cStringIO. But, I often need to remove bytes from the left side of the buffer. A naive approach would rebuild the entire buffer. Is there an optimal way to do this, if left-truncating is a very common operation? Python's garbage collector should actually GC the truncated bytes.

Any sort of algorithm for this (keep the buffer in small pieces?), or an existing implementation, would really help.

Edit:

I tried to use Python 2.7's memoryview for this, but sadly, the data outside the "view" isn't GCed when the original reference is deleted:

# (This will use ~2GB of memory, not 50MB)

memoryview # Requires Python 2.7+

smalls = []

for i in xrange(10):
    big = memoryview('z'*(200*1000*1000))
    small = big[195*1000*1000:]
    del big
    smalls.append(small)
    print '.',
+1  A: 

Build your buffer as a list of characters or lines and slice the list. Only join as string on output. This is pretty efficient for most types of 'mutable string' behaviour.

The GC will collect the truncated bytes because they are no longer referenced in the list.

UPDATE: For modifying the list head you can simply reverse the list. This sounds like an inefficient thing to do however python's list implementation optimises this internally.

from http://effbot.org/zone/python-list.htm :

Reversing is fast, so temporarily reversing the list can often speed things up if you need to remove and insert a bunch of items at the beginning of the list:

L.reverse()
# append/insert/pop/delete at far end
L.reverse()
SpliFF
If you are concerned about efficiency here, you might want to use an array from the array module instead.
bayer
lists will have the same problem as buffer for left deletions however (except for the memory allocation) - they have to copy every byte to fill in the hole created by the removal.
Brian
it isn't copying bytes, it's copying references. if you are adding and removing line-by-line the overhead would be minimal. If not you can reverse the list (see answer above)
SpliFF
Copying references is even worse when dealing than characters - you need to copy 4 bytes per entry instead of 1. Reversing is a decent solution though - the onetime reverse cost prevents a cost on each removal, so long as you don't need to append more data after the reverse.
Brian
+2  A: 

A deque will be efficient if left-removal operations are frequent (Unlike using a list, string or buffer, it's amortised O(1) for either-end removal). It will be more costly memory-wise than a string however, as you'll be storing each character as its own string object, rather than a packed sequence.

Alternatively, you could create your own implementation (eg. a linked list of string / buffer objects of fixed size), which may store the data more compactly.

Brian
More recent link for deque: http://docs.python.org/library/collections.html#deque-objects.
Bastien Léonard
Thanks - I've updated the link. I just used the first google result, and hadn't noticed it was the 2.5.2 docs.
Brian