ansaurus

Question

Reading a binary file in Python: takes a very long time to read certain bytes.

Answer 1

+2 A:

Although it's hard to say without some kind of reproducible sample, this sounds like a buffering problem. The First part is buffered and until you reach the end of the buffer, it is fast; then it slows down until the next buffer is filled, and so on.

Max Shawabkeh 2010-02-15 12:52:19

Yes this sounds likely, do you know of a way I could test this? Or what the likely buffer size is?

Duncan Tait 2010-02-15 13:53:22

Well, one thing to at least determine whether gnibbler or I am closer to the solution is to run it and instantly throw away the results. If the slowdown still occurs, it's more likely to be a buffering problem. Then perhaps see if reading manually instead of through `numpy` changes anything.

Max Shawabkeh 2010-02-15 16:49:54

Sure, sorry to keep asking new questions but how to you 'throw away' an object in Python? I've been trying to find out about dispose etc. for ages but can't find it anywhere.

Duncan Tait 2010-02-15 20:13:56

You can use `gc.collect()` (http://docs.python.org/library/gc.html) to force a garbage collection, but what I meant was simply reading and not assigning the result to anything.

Max Shawabkeh 2010-02-15 20:17:33

Another way to test if it is caused by file buffering is to change the buffer size in the file open() function. The default is to use the OS's default size. See if changing it changes where the pause happens. N.B. Setting the buffer size does not work on all OS - see the docs.

Dave Kirby 2010-02-15 20:17:47

Max S: You may be right, the problem pretty much disappears when I don't allocate the instance (created to contain all the data) to anything, even though it still does all the processing. Is there anyway to see how big an object/list is in memory? I swear it shouldn't be that large, 2MB max, probably under 1MB, that really shouldn't be an issue should it...

Duncan Tait 2010-02-17 16:18:00

Python has a quite large memory overhead when you have lots of small objects. On my 64-bit machine, an empty string take up 40 bytes, but each extra character take up only one more byte. See this answer for details: http://stackoverflow.com/questions/2211965/python-memory-usage-loading-large-dictionaries-in-memory/2212005#2212005

Max Shawabkeh 2010-02-17 16:29:32

Answer 2

+2 A:

Where are you storing the results? When lists/dicts/whatever get very large there can be a noticeable delay when they need to be reallocated and resized.

gnibbler 2010-02-15 15:00:05

Well, essentially they're all stored in lists within lists, and then entire set of data (per 'block') is stored in an instance of a class, along with header info. This shouldn't be more than a megabyte really though... Unless Python isn't disposing of lists? Can I force it to do this?

Duncan Tait 2010-02-15 15:41:11

Answer 3

+1 A:

Could it be that garbage collection is kicking in for the lists ?

Added: is it funny data, or blockno ? What happens if you read the blocks in random order, along the lines

r = range(4096)
random.shuffle(r)  # inplace
for blockno in r:
    file.seek( blockno * ... )
    ...

Denis 2010-02-15 18:03:32

ansaurus

tags:

views:

answers:

Reading a binary file in Python: takes a very long time to read certain bytes.

related questions