views:

367

answers:

3

This is very odd

I'm reading some (admittedly very large: ~2GB each) binary files using numpy libraries in Python. I'm using the:

thingy = np.fromfile(fileObject, np.int16, 1)

method. This is right in the middle of a nested loop - I'm doing this loop 4096 times per 'channel', and this 'channel' loop 9 times for every 'receiver', and this 'receiver' loop 4 times (there's 9 channels per receiver, of which there are 4!). This is for every 'block', of which there are ~3600 per file.

So you can see, very iterative and I know it will take a long time, but it was taking a LOT longer than I expected - on average 8.5 seconds per 'block'.

I ran some benchmarks using time.clock() etc. and found everything going as fast as it should be, except for approximately 1 or 2 samples per 'block' (so 1 or 2 in 4096*9*4) where it would seem to get 'stuck' on for a few seconds. Now this should be a case of returning a simple int16 from binary, not exactly something that should be taking seconds... why is it sticking?

From the benchmarking I found it was sticking in the SAME place every time, (block 2, receiver 8, channel 3, sample 1085 was one of them, for the record!), and it would get stuck there for approximately the same amount of time each run.

Any ideas?!

Thanks,

Duncan

+2  A: 

Although it's hard to say without some kind of reproducible sample, this sounds like a buffering problem. The First part is buffered and until you reach the end of the buffer, it is fast; then it slows down until the next buffer is filled, and so on.

Max Shawabkeh
Yes this sounds likely, do you know of a way I could test this? Or what the likely buffer size is?
Duncan Tait
Well, one thing to at least determine whether gnibbler or I am closer to the solution is to run it and instantly throw away the results. If the slowdown still occurs, it's more likely to be a buffering problem. Then perhaps see if reading manually instead of through `numpy` changes anything.
Max Shawabkeh
Sure, sorry to keep asking new questions but how to you 'throw away' an object in Python? I've been trying to find out about dispose etc. for ages but can't find it anywhere.
Duncan Tait
You can use `gc.collect()` (http://docs.python.org/library/gc.html) to force a garbage collection, but what I meant was simply reading and not assigning the result to anything.
Max Shawabkeh
Another way to test if it is caused by file buffering is to change the buffer size in the file open() function. The default is to use the OS's default size. See if changing it changes where the pause happens. N.B. Setting the buffer size does not work on all OS - see the docs.
Dave Kirby
Max S: You may be right, the problem pretty much disappears when I don't allocate the instance (created to contain all the data) to anything, even though it still does all the processing. Is there anyway to see how big an object/list is in memory? I swear it shouldn't be that large, 2MB max, probably under 1MB, that really shouldn't be an issue should it...
Duncan Tait
Python has a quite large memory overhead when you have lots of small objects. On my 64-bit machine, an empty string take up 40 bytes, but each extra character take up only one more byte. See this answer for details: http://stackoverflow.com/questions/2211965/python-memory-usage-loading-large-dictionaries-in-memory/2212005#2212005
Max Shawabkeh
+2  A: 

Where are you storing the results? When lists/dicts/whatever get very large there can be a noticeable delay when they need to be reallocated and resized.

gnibbler
Well, essentially they're all stored in lists within lists, and then entire set of data (per 'block') is stored in an instance of a class, along with header info. This shouldn't be more than a megabyte really though... Unless Python isn't disposing of lists? Can I force it to do this?
Duncan Tait
+1  A: 

Could it be that garbage collection is kicking in for the lists ?

Added: is it funny data, or blockno ? What happens if you read the blocks in random order, along the lines

r = range(4096)
random.shuffle(r)  # inplace
for blockno in r:
    file.seek( blockno * ... )
    ...
Denis