views:

511

answers:

5

Right, I'm iterating through a large binary file

I need to minimise the time of this loop:

def NB2(self, ID_LEN):
    r1=np.fromfile(ReadFile.fid,dTypes.NB_HDR,1)
    num_receivers=r1[0][0]
    num_channels=r1[0][1]
    num_samples=r1[0][5]

    blockReturn = np.zeros((num_samples,num_receivers,num_channels))

    for rec in range(0,num_receivers):
        for chl in range(0,num_channels):
            for smpl in range(0,num_samples):
                r2_iq=np.fromfile(ReadFile.fid,np.int16,2)
                blockReturn[smpl,rec,chl] = np.sqrt(math.fabs(r2_iq[0])*math.fabs(r2_iq[0]) + math.fabs(r2_iq[1])*math.fabs(r2_iq[1]))

    return blockReturn

So, what's going on is as follows: r1 is the header of the file, dTypes.NB_HDR is a type I made:

NB_HDR= np.dtype([('f3',np.uint32),('f4',np.uint32),('f5',np.uint32),('f6',np.int32),('f7',np.int32),('f8',np.uint32)])

That gets all the information about the forthcoming data block, and nicely puts us in the right position within the file (the start of the data block!).

In this data block there is: 4096 samples per channel, 4 channels per receiver, 9 receivers.

So num_receivers, num_channels, num_samples will always be the same (at the moment anyway), but as you can see this is a fairly large amount of data. Each 'sample' is a pair of int16 values that I want to find the magnitude of (hence Pythagoras).

This NB2 code is executed for each 'Block' in the file, for a 12GB file (which is how big they are) there are about 20,900 Blocks, and I've got to iterate through 1000 of these files (so, 12TB overall). Any speed advantage even it's it's milliseconds would be massively appreciated.

EDIT: Actually it might be of help to know how I'm moving around inside the file. I have a function as follows:

def navigateTo(self, blockNum, indexNum):
    ReadFile.fid.seek(ReadFile.fileIndex[blockNum][indexNum],0)
    ReadFile.currentBlock = blockNum
    ReadFile.index = indexNum

Before I run all this code I scan the file and make a list of index locations at ReadFile.fileIndex that I browse using this function and then 'seek' to the absolute location - is this efficient?

Cheers

A: 

This is more of an observation than a solution, but porting that function to C++ and loading it in with the Python API would get you a lot of speed gain to begin with before loop optimization.

Xorlev
I don't know C++ at all I'm afraid (embarrassing I know). Any ideas on how you would do it? I imagine 3-dimension arrays alongside binary extraction directly to int16's without bitswapping and all that low-level nastiness isn't too easy?
Duncan Tait
I don't think it's as bad as you'd think. That being said, I'm having a hard time envisioning your data structure as I'm not the most fluent in Python. With the ifstream::seekg() function, you can grab your numbers in sequence depending on the number of bytes needed and cast+store in vectors.
Xorlev
@Duncan Tait: You have absolutely no reason to be embarrassed.
John Machin
+1  A: 

I'd try to use as few loops and as much constants as possible. Everything that can be done in a linear fashion should be done so. If values don't change, use constants to reduce lookups and such, because that eats up cpu cycles.

This is from a theoretical point of view ;-)

If possible use highly optimised libraries. I don't exaclty know what you are trying to achieve but i'd rather use an existing FFT-Lib than writing it myself :>

One more thing: http://en.wikipedia.org/wiki/Big_O_notation (can be an eye-opener)

alex
+3  A: 
import numpy as np
def NB2(self, ID_LEN):
    r1=np.fromfile(ReadFile.fid,dTypes.NB_HDR,1)
    num_receivers=r1[0][0]
    num_channels=r1[0][1]
    num_samples=r1[0][5]

    # first, match your array bounds to the way you are walking the file
    blockReturn = np.zeros((num_receivers,num_channels,num_samples))

    for rec in range(0,num_receivers):
        for chl in range(0,num_channels):
            # second, read in all the samples at once if you have enough memory
            r2_iq=np.fromfile(ReadFile.fid,np.int16,2*num_samples)
            r2_iq.shape = (-1,2) # tell numpy that it is an array of two values

            # create dot product vector by squaring data elementwise, and then
            # adding those elements together.  Results is of length num_samples
            r2_iq = r2_iq * r2_iq
            r2_iq = r2_iq[:,0] + r2_iq[:,1]
            # get the distance by performing the square root "into" blockReturn
            np.sqrt(r2_iq, out=blockReturn[rec,chl,:])

    return blockReturn

This should help your performance. Two main ideas in numpy work. First, your result arrays dimensions should match how your loop dimensions are crafted, for memory locality.
Second, Numpy is FAST. I've beaten hand coded C with numpy, simply because it uses LAPack and vector acceleration. However to get that power, you have to let it manipulate more data at a time. That is why your sample loop has been collapsed to read in the full sample for the receiver and channel in one large read. Then use the supreme vector powers of numpy to calculate your magnitude by dot product.

There is a little more optimization to be had in the magnitude calculation, but numpy recycles buffers for you, making it less important than you might think. I hope this helps!

Shane Holloway
Thanks alot for this in depth answer - I tried out both this and the one below and that was slightly faster.
Duncan Tait
+1  A: 

Most importantly, you shouldn't do file access at the lowest level of a triple nested loop, whether you do this in C or Python. You've got to read in large chunks of data at a time.

So to speed this up, read in large chunks of data at a time, and process that data using numpy indexing (that is, vectorize your code). This is particularly easy in your case since all your data is int32. Just read in big chunks of data, and reshape the data into an array that reflects the (receiver, channel, sample) structure, and then use the appropriate indexing to multiply and add things for Pythagoras, and the 'sum' command to add up the terms in the resulting array.

tom10
+2  A: 

Because you know the length of a block after you read the header, read the whole block at once. Then reshape the array (very fast, only affects metadata) and take use the np.hypot ufunc:

blockData = np.fromfile(ReadFile.fid, np.int16, num_receivers*num_channels*num_samples*2)
blockData = blockData.reshape((num_receivers, num_channes, num_samples, 2))
return np.hypot(blockData[:,:,:,0], blockData[:,:,:,1])

On my machine it runs in 11ms per block.

Ants Aasma
This is an awesome solution, if you have enough memory to load all the receiver's channels into memory at the same time.
Shane Holloway
Awesome, literally, thanks!Doing it in under 10ms per block, thats a factor of 10 improvement!
Duncan Tait