views:

194

answers:

8

I need a memory-efficient data structure for for storing about a million key--value pairs, where keys are strings of about 80 bytes, and values are strings of about 200 bytes, the total key and value size being about 280MB. I also need efficient lookup of value by key, preferably a hash-map. The memory overhead should be as little as possible, e.g. for 280MB of useful data, the data structure shouldn't use more than 300MB of virtual memory (including malloc() overhead and everything else). The usage pattern is the following: we start with an empty data structure, and we populate it gradually, never changing keys, and never changing the length of values. As a plus, the data structure may support changing the length of values, at the expense of a 100% value overhead (meaning that for x value bytes, x bytes might be wasted in temporarily in unused buffer space).

I need a pure Python module, or a built-in Python module, or a C implementation preferably with (C)Python bindings. I'd prefer if it was possible to serialize the whole data structure to disk, and to read it back very quickly.

Just to prove that such a small overhead is possible, I've created a simple design with open addressing, the hash table of 1.25 million elements containing 4-byte pointers to 1MB data blocks, the data blocks containing the key and value lengths as base-128 varints. This design has an important limitation: it doesn't allow removing or changing pairs without wasting their memory area. According to my calculations with 1 million key--value pairs of 280 bytes each, the overhead is less than 3.6% (10 080 000 bytes). The limits above are more generous, they allow 20 000 000 bytes of overhead.

I've just found http://www.pytables.org/ , which provides fast access and memory-efficient packing of data. I have to examine it more closely to check if it suits my needs.

+1  A: 

You could use struct module to pack binary data and unpack it when needed. You can implement a memory efficient storage using this approach. I guess access would be a pain.

pyfunc
Not only it would be a pain, but all those byte-manipulations would be terribly slow. I would also have to use the `bytearray` type, which is mutable -- but still slow if implemented in pure Python.
pts
+6  A: 

Martijn mentioned this in a comment (not sure why people comment with answers), but I agree: use SQLite. You should give it a try and see if it will meet your needs.

Ned Batchelder
Cause if i'm not really sure I know if it would meet the demands, and it's more like a lose comment that an actual well thought through answer, I use a comment.
Martijn
+5  A: 

Have you tried using a straightforward dict? Most of your data is in strings, so the overhead might fit within your requirements.

Ned Batchelder
For an expansion on this, see my answer or hughdbrown's answer below.
Steven Rumbalski
+2 for stating the obvious!
bstpierre
+1  A: 

Apache Portable Runtime (aka APR) has a c-based hash table. You can see documentation at http://apr.apache.org/docs/apr/0.9/group_apr_hash.html

With apr_hash_t all you store is void*. So it gives you full control over values. SO if you want you can store pointer to a 100 byte block instead of actual length of the string.

F Yaqoob
Thanks for the idea, This answer needs verification to check if apr_hash is within the memory limits allowed in the question. Also it's not a complete implementation, because I have to write and add a considerable amount of C code.
pts
+6  A: 

If you don't plan to have a large amounts of deletes, then this isn't that hard. Deletes lead to fragmentation.

You also need to commit to a fixed length key. You mentioned 80 bytes. Are your keys allowed to duplicate? If not, it's even easier.

So, here is what you do.

You create an array of:

struct {
    char value[80];
    char *data;
} key;

And you keep this array sorted.

If you keys can duplicate, then you need:

struct link {
    char *data;
    link *next;
}

struct {
    char value[80];
    link *data;
} key;

(My C is rusty, but this is the gist of it) The latter has each key pointing to a linked list of values.

Then a lookup is a simple binary search. The "pain" is in maintaining this array and inserting/deleting keys. It's not as painful as it sounds, but it saves a LOT of memory, especially on 64bit systems.

What you want to reduce is the number of pointers. Pointers are expensive when you have lots of structures filled with pointers. On a 64bit system, a pointer is 8 bytes. So for a single pointer, there goes 8MB of your memory budget.

So, the expense is in building the array, copying and compacting memory (if you "know" you will have a million rows and can commit to that, then malloc(1000000 * sizeof(key)) right away, it'll save you some copying during expansion).

But don't be afraid of, once it's up and running, performance is quite good. Modern cpus are actually pretty good at copying 100M blocks of memory around.

Just as an aside, I just did something much like this in Java. On a 64bit JVM, a Map with 25M entries is 2G of RAM. My solution (using similar techniques to this) has at around 600M). Java uses more pointers than C, but the premise is the same.

Will Hartung
I gave a +1 for this answer for the many insights it provides for implementing the dictionary I need. However, I'd prefer an existing, tested and optimized implementation, if available.
pts
+2  A: 

You can use the sha1 of the key instead of the key itself. If the keys are unique, then the sha1 hash of the keys is likely, too. It provides a memory savings to try to squeak in under your limit.

from random import choice
from string import letters
from hashlib import sha1

def keygen(length):
    return "".join(choice(letters) for _ in xrange(length))

def gentestdata(n=1000*1000):
    # return dict((sha1(keygen(80)).digest(), keygen(200)) for _ in xrange(n))
    d = {}
    for _ in xrange(n):
        key = sha1(keygen(80)).digest()
        assert key not in d
        value = keygen(200)
        d[key] = value
    return d

if __name__ == '__main__':
    d = gentestdata()

On my ubuntu box, this tops out at 304 MB of memory:

2010-10-26 14:26:02 hbrown@hbrown-ubuntu-wks:~$ ps aux | grep python
[...]
hbrown   12082 78.2  7.5 307420 303128 pts/1   S+   14:20   4:47 python

Close enough? It's python, not C.


Later: also, if your data is somewhat redundant, you can gzip the values. It's a time versus space trade-off.

hughdbrown
Appears close enough, but it's not storing the keys: it stores a 20 byte digest instead of the 80 byte key -- so the real memory use would be 307420 * 1024 + (80 - 20) * 1000000 == 374798080 bytes, which is way too much when compared to the 280000000 bytes in the question.
pts
If you put a (key, value) pair into the dictionary, you can retrieve the value later by taking the key and sha1-hashing it. If there is data for that key, there will be data for the sha1-hashed key, and you will retrieve the value. But if you need the unhashed key (like, "Tell me all the keys I've used"), then no, it doesn't do what you want. Perhaps you could update your question to cover this point. On a related point, can the keys be gzipped -- is that against your question's rules? Could the keys *profitably* be gzipped? And why 300MB?
hughdbrown
SHA1 is a hash, hashes are lousy keys, as they are lossy. Despite what "the odds are" whether there will be overlap, the fact is that there are odds at all. if your data is < 160bits, then there's no reason to use SHA1. If it's more than 160bits, then you are losing data and risk overlap. If your system can handle overlap, then fine. Most systems do not handle that problem elegantly, as they can not detect when it happens.
Will Hartung
@WIll Hartung: I'll defer to the experts on this, but my understanding was that "a database with 10^18 (a million million million) entries has a chance of about 1 in 0.0000000000003 of a collision." See: http://stackoverflow.com/questions/297960/hash-collision-what-are-the-chances
hughdbrown
@hughdbrown: as I said, if the system can handle overlap, then fine. If not, then they need to worry about that. There are some lotteries I choose not to want to win. As they say "You can't win if you don't play", so I don't play. The ODDS may be low, but Murphy uses loaded dice. That's just aged experience butting in. So, they just need to decide what the ramifications of failure are and go from there.
Will Hartung
@Will Hartung: I play Powerball when the stakes are over $100 million USD, but I would never play the "Draw my SHA1 hash" lottery. I don't really understand your concern about collisions. Your data center is more likely to be hit by a meteorite over the life of the project than you are to get a collision. The odds-savvy development manager uses SHA1 hashes but budgets for a meteorite-proof roof to sleep at night.
hughdbrown
@hughdbrown: I don't know what kind of entropy to expect from the plain text data to be hashed. It it likely NOT random, evenly distributed, 0-255. More likely its character data. Perhaps even english text, or limited to a some other small subset (Famliy names in Cincinnati for example). I don't know. But all of those reductions in the key space can potentially either increase the chance of collision, or even eliminate it. Who knows. Do you know? Have you done the math on their data? He could be playing Powerball or Roulette. So, it comes down to handling the failure condition.
Will Hartung
@Will Hartung: Okay, a hypothetical for you then. Distributed version control systems like git and mercurial use SHA1 hashes as if they were identity. If you have a file that has a certain hash and another file is found that hashes to that number, the files are treated as if they are identical. Do you consider git and mercurial bad software engineering because they do not take account of collisions? Is it irresponsible not to examine all the data that will be source controlled or is it sufficient to know it is a 160-bit SHA1 hash?
hughdbrown
+6  A: 

Ok, the dirt-simple approach.

Use a python dictionary for the data structure. I filled a python dictionary with 1 million random key-value pairs where the key was 80 characters and the value 200 characters. It took 360,844 Kb on my computer, which is outside of your specification of no more than 300 MB, but I offer it up as a solution anyway because it's still pretty memory efficient.

This also fails your requirement of having a C API. I'm not sure why you need C, but as the question is tagged Python and lacks a C tag, I'll offer the pure Python to see if it just might fit the bill.

Regarding persistence. Use the cPickle module. It's very fast and, again, dirt-simple. To save your dictionary:

cPickle.dump(mydict, "myfile.pkl")

To reload your dictionary:

mydict = cPickle.load("myfile.pkl")

A second dirt-simple idea is to use the shelve module, which is basically disk-based python dictionary. Memory overhead is very low (it's all on disk). But it's also much slower.

Steven Rumbalski
360844 Kb has 360844*1024-280000000 == 89504256 bytes of overhead, which is way more than the 20000000 bytes allowed in the question. But its simplicity is compelling, it might be worth to relax the requirements. I gave a +1 for this answer.
pts
"Allowed in the question"? Like this is homework?
hughdbrown
The `shelve` built-in module uses dbm, gdbm or bsddb for key--value storage, none of them providing the speed and tight memory packing requirements posed in the question.
pts
@pts my memory measurement was very basic (half-assed), 360844 Kb was the amount of memory for the entire python process. Most of that was the dictionary. But still, of course, outside of your memory requirement. I'm curious, where are the memory requirements coming from?
Steven Rumbalski
@hughdbrown: No, it's not homework. @Steven Rumbalski: The memory requirements are quite ad hoc, and not strict, there is no scientific reasoning behind them. I just want to make my software use less memory if it's easy to implement. Some systems (with little free memory) would benefit, but I can't assess exactly how many of them.
pts
FYI On my 280000000 byte (280MB) test input, marshal is about 3.83 times faster to write and 6.90 times faster to read than cPickle -- about 10 seconds to read and 20 seconds to write for cPickle. So if I need very fast (de)serialization, I can find or write a C extension which would do it much faster than cPickle.
pts
@pts It makes sense that marshal would be faster, but I do want to make sure you tested using cPickle.HIGHEST_PROTOCOL. Also beware that marshal can break across Python versions.
Steven Rumbalski
@Steven Rumbalski: cPickle.HIGHEST_PROTOCOL makes writing 10% faster than the default cPickle, and it makes reading 4.173 times faster. Thanks for the suggestion.
pts
+2  A: 

Using SQLite is a good idea. A quick implementation can tell if you are fast enough with little effort.


If you determine you have to roll your own, I'd recommend the following:

How well can you predict the number of pairs, or an upper limit for that?
How well can you predict the total data size, or an upper limit for that?

Arena allocator for strings and nodes. (Usually, you'd work on a list of arenas, so you don't have to predict the total size).

Alignment depends on your algorithms, in principle you could pack it byte-tight, and the only overhead is your overallocation, which only minimally affects your working set.

However, if you have to run any cmp/copy etc. operations on these strings, remember that with the following guarantees, you can squeeze a little or a lot from these string operations:

  • all elements are CPU word aligned
  • all pad bytes are (e.g.) 0
  • you can safely read "beyond" a string end as long as you don't cross a CPU border

Hash table for the index. A dictionary would work, too, but that makes sense only if potential degradation / rehashing would be a serious issue. I don't know any "stock" hashtable implementation for C, but there should be one, right? right? Just replace allocations with calls to the arena allocator.


Memory Locality

If you can guarantee that lookup will never request a string that is not in the map, you should store the keys in a separate arena, as they are needed only on hash collisions. That can improve memory locality significantly. (In that case, if you ever have a "final" table, you could even copy the colliding keys to a new arena, and throw away all the others. The benefits of that are probably marginal, though.)

Separation can help or hurt, depending on your access patterns. If you typically use the value once after each lookup, having them pair-wise in the same arena is great. If you e.g. look up a few keys, then use their values repeatedly, separate arenas make sense.


If you have to support "funny characters" / Unicode, normalize your strings before storing them.

peterchen
Thanks for the insights!
pts