ansaurus

Question

Storing huge hash table in a file in Python

Answer 1

+3 A:

You can use the shelve module to store a dictionary like structure in a file. From the Python documentation:

import shelve

d = shelve.open(filename) # open -- file may get suffix added by low-level
                          # library

d[key] = data   # store data at key (overwrites old data if
                # using an existing key)
data = d[key]   # retrieve a COPY of data at key (raise KeyError if no
                # such key)
del d[key]      # delete data stored at key (raises KeyError
                # if no such key)
flag = d.has_key(key)   # true if the key exists
klist = d.keys() # a list of all existing keys (slow!)

# as d was opened WITHOUT writeback=True, beware:
d['xx'] = range(4)  # this works as expected, but...
d['xx'].append(5)   # *this doesn't!* -- d['xx'] is STILL range(4)!

# having opened d without writeback=True, you need to code carefully:
temp = d['xx']      # extracts the copy
temp.append(5)      # mutates the copy
d['xx'] = temp      # stores the copy right back, to persist it

# or, d=shelve.open(filename,writeback=True) would let you just code
# d['xx'].append(5) and have it work as expected, BUT it would also
# consume more memory and make the d.close() operation slower.

d.close()       # close it

2009-08-30 18:27:40

It's not very informative to down-vote without an explanation... Please provide one, given that lutz spent some time trying to come up with a suitable answer.

Steen 2009-08-30 18:53:32

I don't think this will work for large datasets.

ilya n. 2009-08-30 18:54:42

Though I'm not the one who voted you down.

ilya n. 2009-08-30 18:55:14

+1, I don't see what's wrong with this answer. Does anyone actually know how fast this is?

Kiv 2009-08-30 18:56:38

Answer 2

A:

You can also go one step down the ladder and use pickle. Shelve imports from pickle (link), so if you don't need the added functionality of shelve, this may spare you some clock cycles (although, they really don't matter to you, as you have choosen python to do large number storing)

Steen 2009-08-30 18:52:06

Answer 3

+1 A:

For Project Euler, I stored a precomputed list of primes up to 10**8 in a text file just by writing them in comma separated format. It worked well for that size, but it doesn't scale well to going much larger.

If your huge is not really that huge, I would use something simple like me, otherwise I would go with shelve as the others have said.

Kiv 2009-08-30 18:54:37

Answer 4

+9 A:

For a list of primes up to 10**9, why do you need a hash? What would the KEYS be?! Sounds like a perfect opportunity for a simple, straightforward binary file! By the Prime Number Theorem, there's about 10**9/ln(10**9) such primes -- i.e. 50 millions or a bit less. At 4 bytes per prime, that's only 200 MB or less -- perfect for an array.array("L") and its methods such as fromfile, etc (see the docs). In many cases you could actually suck all of the 200 MB into memory, but, worst case, you can get a slice of those (e.g. via mmap and the fromstring method of array.array), do binary searches there (e.g. via bisect), etc, etc.

When you DO need a huge key-values store -- gigabytes, not a paltry 200 MB!-) -- I used to recommend shelve but after unpleasant real-life experience with huge shelves (performance, reliability, etc), I currently recommend a database engine instead -- sqlite is good and comes with Python, PostgreSQL is even better, non-relational ones such as CouchDB can be better still, and so forth.

Alex Martelli 2009-08-30 18:56:01

From what I understand. `array.fromfile` reads the contents into memory and I assume the OP defines *huge* as "something that doesn't fit into memory".

ilya n. 2009-08-30 19:01:15

"Doesn't fit into memory" is a useful definition of "huge", but doesn't match the `primes < 10**9` given use case: that's what I explain in the first paragraph, and what to do for other, completely different cases that ARE indeec huge, in the second paragraph;-).

Alex Martelli 2009-08-30 19:11:27

Mine was a good example of replying before reading the whole point. Apologies.

ilya n. 2009-08-30 19:24:41

There is no way to combine buffer and an array from file into a mmap view of the file? *somehow magically*

kaizer.se 2009-09-06 12:02:00

@kaizer.se, I don't know of any existing module that offers such an "array view". Maybe one could be made with a little editing to arraymodule.c, depending on the application's exact needs.

Alex Martelli 2009-09-06 15:34:45

Answer 5

A:

Let's see where the bottleneck is. When you're going to read a file, the hard drive has to turn enough to be able to read from it; then it reads a big block and caches the results.

So you want some method that will guess exactly what position in file you're going to read from and then do it exactly once. I'm pretty much sure standard DB modules will work for you, but you can do it yourself -- just open the file in binary mode for reading/writing and store your values as, say, 30-digits (=100-bit = 13-byte) numbers.

Then use standard file methods .

ilya n. 2009-08-30 18:59:23

Answer 6

+1 A:

Just naively sticking a hash table onto disk will result in about 5 orders of magnitude performance loss compared to an in memory implementation (or at least 3 if you have a SSD). When dealing with hard disks you'll want to extract every bit of data-locality and caching you can get.

The correct choice will depend on details of your use case. How much performance do you need? What kind of operations do you need on data-structure? Do you need to only check if the table contains a key or do you need to fetch a value based on the key? Can you precompute the table or do you need to be able to modify it on the fly? What kind of hit rate are you expecting? Can you filter out a significant amount of the operations using a bloom filter? Are the requests uniformly distributed or do you expect some kind of temporal locality? Can you predict the locality clusters ahead of time?

If you don't need ultimate performance or can parallelize and throw hardware at the problem check out some distributed key-value stores.

Ants Aasma 2009-08-30 19:50:31

Answer 7

+3 A:

You could also just go with the ultimate brute force, and create a Python file with just a single statement in it:

seedprimes = [3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,
79,83,89,97,101,103,107,109,113,127,131,137,139,149,151,157,163,167,173, ...

and then just import it. (Here is file with the primes up to 1e5: http://python.pastebin.com/f177ec30).

from primes_up_to_1e9 import seedprimes

Paul McGuire 2009-08-30 20:49:04

ansaurus

tags:

views:

answers:

Storing huge hash table in a file in Python

related questions