views:

278

answers:

2

Say I have a binary file of 12GB and I want to slice 8GB out of the middle of it. I know the position indices I want to cut between.

How do I do this? Obviously 12GB won't fit into memory, that's fine, but 8GB won't either... Which I thought was fine, but it appears binary doesn't seem to like it if you do it in chunks! I was appending 10MB at a time to a new binary file and there are discontinuities on the edges of each 10MB chunk in the new file.

Is there a Pythonic way of doing this easily?

A: 

http://thatslinux.blog.co.uk/2009/10/10/slicing-and-dicing-files-with-python-7137358/

Might help

anijhaw
I agree with the general seek-and-loop approach -- but the standard library contains methods which handle the loop-and-copy work for you, so doing it yourself (as that post encourages) is arguably suboptimal.
Charles Duffy
@Charles, nothing in the standard library handles the "copy only up to an offset" spec of the OP. I think you're mistaking the purpose of the `length=` parameter to `shutil.copyfileobj`, as the OP tells you in his comment to your answer.
Alex Martelli
@Alex - Quite right. Oops.
Charles Duffy
+4  A: 

Here's a quick example. Adapt as needed:

def copypart(src,dest,start,length,bufsize=1024*1024):
    f1 = open(src,'rb')
    f1.seek(start)

    f2 = open(dest,'wb')

    while length:
        chunk = min(bufsize,length)
        data = f1.read(chunk)
        f2.write(data)
        length -= chunk

    f1.close()
    f2.close()

if __name__ == '__main__':
    GIG = 2**30
    copypart('test.bin','test2.bin',1*GIG,8*GIG)
Mark Tolonen
I did something very similar than this and it didn't seem to like it, with binary if you extract a block of data out of the middle can't it sometimes mess up at the edges as it needs its surrounding bytes to make sense? Hmm. I'll try your code though cheers. Also did you get your length and buffer the wrong way round in the last line of your code?
Duncan Tait
That is start and length...the last line uses the default for bufsize. I'm not sure what you mean by "messes up at the edges". If you need surrounding bytes maybe your start and length are incorrect?
Mark Tolonen
You are correct! I had messed up myself, it all works now :) What's an optimum buffer size for file transfer then? 1MB good?
Duncan Tait
A different size may be faster or slower...the only way to know is to for sure is to profile.Don't forget to accept an answer :^)
Mark Tolonen
Sorry hadn't checked in for awhile!
Duncan Tait
I would think that `chunk = min(bufsize, length)` would be less elaborate.
ΤΖΩΤΖΙΟΥ
@ΤΖΩΤΖΙΟΥ, and you'd be right. I tend to forget about those min/max/any/all builtins since they are easy to write. Thanks.
Mark Tolonen