views:

499

answers:

6

Hi,

I need to handle tens of Gigabytes data in one binary file. Each record in the data file is variable length.

So the file is like:

<len1><data1><len2><data2>..........<lenN><dataN>

The data contains integer, pointer, double value and so on.

I found python can not even handle this situation. There is no problem if I read the whole file in memory. It's fast. But it seems the struct package is not good at performance. It almost stuck on unpack the bytes.

Any help is appreciated.

Thanks.

+2  A: 

have a look at array module, specifically at array.fromfile method. This bit:

Each record in the data file is variable length.

is rather unfortunate. but you could handle it with a try-except clause.

SilentGhost
+1  A: 

For a similar task, I defined a class like this:

class foo(Structure):
        _fields_ = [("myint", c_uint32)]

created an instance

bar = foo()

and did,

block = file.read(sizeof(bar))
memmove(addressof(bar), block, sizeof(bar))

In the event of variable-size records, you can use a similar method for retrieving lenN, and then read the corresponding data entries. Seems trivial to implement. However, I have no idea of how fast this method is compared to using pack() and unpack(), perhaps someone else has profiled both methods.

Michael Foukarakis
+1  A: 

What if you use dump the data file into sqlite3 in memory.

import sqlite3
sqlite3.Connection(":memory:")

You can then use sql to process the data.

Besides, you might want to look at generators (or here) and iterators (or here and here).

Selinap
Presumably, if the data is variable length, then the rows would not be normalized for use in a database.
hughdbrown
+3  A: 

struct and array, which other answers recommend, are fine for the details of the implementation, and might be all you need if your needs are always to sequentially read all of the file or a prefix of it. Other options include buffer, mmap, even ctypes, depending on many details you don't mention regarding your exact needs. Maybe a little specialized Cython-coded helper can offer all the extra performance you need, if no suitable and accessible library (in C, C++, Fortran, ...) already exists that can be interfaced for the purpose of handling this humongous file as you need to.

But clearly there are peculiar issues here -- how can a data file contain pointers, for example, which are intrinsically a concept related to addressing memory? Are they maybe "offsets" instead, and, if so, how exactly are they based and coded? Are your needs at all more advanced than simply sequential reading (e.g., random access), and if so, can you do a first "indexing" pass to get all the offsets from start of file to start of record into a more usable, compact, handily-formatted auxiliary file? (That binary file of offsets would be a natural for array -- unless the offsets need to be longer than array supports on your machine!). What is the distribution of record lengths and compositions and number of records to make up the "tens of gigabytes"? Etc, etc.

You have a very large scale problem (and no doubt very large scale hardware to support it, since you mention that you can easily read all of the file into memory that means a 64bit box with many tens of GB of RAM -- wow!), so it's well worth the detailed care to optimize the handling thereof -- but we can't help much with such detailed care unless we know enough detail to do so!-).

Alex Martelli
+2  A: 

For help with parsing the file without reading it into memory you can use the bitstring module.

Internally this is using the struct module and a bytearray, but an immutable Bits object can be initialised with a filename so it won't read it all into memory.

For example:

from bitstring import Bits

s = Bits(filename='your_file')
while s.bytepos != s.length:
    # Read a byte and interpret as an unsigned integer
    length = s.read('uint:8')
    # Read 'length' bytes and convert to a Python string
    data = s.read(length*8).bytes
    # Now do whatever you want with the data

Of course you can parse the data however you want.

You can also use slice notation to read the file contents, although note that the indices will be in bits rather than bytes so for example s[-800:] would be the final 100 bytes.

Scott Griffiths
+1  A: 

PyTables is a very good library to handle HDF5, a binary format used in astronomy and meteorology to handle very big datasets:

It works more or less like an hierarchical database, where you can store multiple tables, inside columns. Have a look at it.

dalloliogm