views:

283

answers:

6

I have been tasked with setting up a flat-file SKU database for use on embedded devices with limited storage and processor speed.

Basically the data I need to store consists of the following:

SKU Description Location Price Qty

The file will consist of several million records.

The most important considerations are storage space and retrieval time. Records will only need to be retrieved by SKU and it will be read-only, so the file can be sorted by SKU.

I would like to access this data with Python. So my questions comes down to this.

Are there existing Python libraries that can provide this functionality for me, or do I need to roll my own?

If the answer comes down to roll my own, does anyone have a suggestions, or good references for doing so?

+4  A: 

How about SQLite with Python bindings? It has a little more than you need, but it's standard software and well-tested.

Mark Byers
SQLite, has features far beyond what I need. Also, I'm not sure how compact it can store the data... Perhaps someone can shed some light on that?
Steven Potter
@Stephen Potter, SQLite is native, suitably compact, hard to corrupt, standard, scalable, and fast. It is likely to be a to more robust than flat files and more efficient and easy than anything you roll yourself. SQLite performs well spacewise.
Mike Graham
I find that a had a bit of a bias against SQLite, I'm not really sure why, but I am pleasantly surprised by the results of some testing I have been doing with it. I still need to do a little more real world testing, but it looks like a viable option.
Steven Potter
+1  A: 

How about HDF? If you don't need SQL and require fast access to your data, there's nothing faster... in Python... for numerical or structured data.

Take a look at the DatabaseInterfaces section on the Python wiki. It's comprehensive. There are a couple of "pure" Python options listed (like SnakeSQL), which are a tad nicer to deploy. And, of course, there's always Berkeley DB and the like, which are super lean & raw.

Honestly, SQLite will probably work fine for you. If you really need to eek out more performance, then you'd be looking at a record-based format like BDB.

Pestilence
I will have to do some further reading on this. However I was really thinking something quite a bit simpler. Perhaps, sorted fixed length records with some sort of data compression for everything but the key and a very simple index to make jumping to the appropriate place in the file faster.
Steven Potter
If BerkeleyDB isn't too raw for you, I'd suggest it. Take a look at the database interfaces registry on the Python Wiki. I've edited my answer with the URL.
Pestilence
Pestilence, I think you are correct SQLite will probably work for me. I find that it seems to use a little more storage space than I would like. May have to do with a high structure overhead vs very small records. I still have to do a little more testing. If it doesn't work, I am looking at BDB as a second option.
Steven Potter
One oddity you might want to keep in mind with SQLite: Dynamic column types. If you try to put a string into an integer column, it'll allow you. This can sometimes become a source of bloat. If you specify all of your column types and pass "sane" values, then it shouldn't be a problem. It's just something to look out for.
Pestilence
A: 

A simple solution is CPickle. You can also find similar questions on SO.

Casey
The data would be much larger than what I could fit in RAM.
Steven Potter
How much RAM have you budgeted for this?
Casey
+4  A: 

The old way would be to use a simple key/value data table like gdbm module. Python comes with support for that, but it's not built into the default Python installation on my machine.

In general, use SQLite. As others wrote, it comes standard with Python, and it's used in a lot of embedded systems already.

If the records are fixed length then you can use the bisect module. The file size / the record size gives the number of records in the file. The bisect search will do an O(log(n)) lookup in the file, and you'll need to write an adapter to test for equality. While I haven't tested it, here's a sketch:

import bisect

RECORD_SIZE = 50

class MatchFirst10Chars(object):
    def __init__(self, word):
        self.word = word
    def __lt__(self, other):
        return self.word < other[:10]

class FileLookup(object):
    def __init__(self, f):
        self.f = f
        f.seek(0, 2)
        self.size = f.tell() // RECORD_SIZE
    def __len__(self):
        return self.size

    def __getitem__(self, i):
        self.f.seek(i*RECORD_SIZE)
        return self.f.read(RECORD_SIZE)


SKU = "123-56-89 "
f = open("data_file")
fl = FileLookup(f)
i = bisect.bisect(fl, MatchFirst10Chars(SKU))

You could additionally gzip the file and seek on a gzip'ped file, but that's a tradeoff for space vs. time that you'll have to test.

Andrew Dalke
+1 This is a pretty good way to do it if there is enough space to make all the records fixed length.
gnibbler
A: 

A variation of Andrew Dalke's answer (so you can still use binary search to locate the SKU quickly) which may reduce the space requirements would be to have fixed sized records at the start of the file (one per SKU) and then all the Descriptions and Locations (as null terminated strings say)

You get to save space by not having to pad out the locations and descriptions to fixed length. Also you can save space if there are lots of duplicate locations

Here is an example: say you have

SKU         16 bytes
Description Variable length
Location    Variable length
Price       4 bytes (up to $42949672.95)
Quantity    4 bytes (up to 4294967295)



 offset          SKU        desc_off   loc_off      Price      Quantity
0x00000000 SKU0000000000001 0x01f78a40 0x01f78a47  0x000003e8  0x000f4240
0x00000020 SKU0000000000002 0x01f78a53 0x01f78a59    ...
...
... # 999998 more records
...
0x01f78a40 Widget\x00
0x01f78a47 Head office\x00
0x01f78a53 Table\x00
0x01f78a59 Warehouse\x00
gnibbler
+1  A: 

May I suggest cdb? (Python bindings: python-cdb.)

It's a format used for read-only data, like you have; it's basically 256 giant hash tables, each able to have a different number of buckets. The cool thing about cdb is that the file doesn't need to be loaded into memory; it's structured in a way that you can do lookups by just mmaping in the bits you need.

The cdb spec is a good read, not least because the lines are formatted to create a uniform right margin. :-D

Chris Jester-Young