ansaurus

Question

Need memory efficient way to store tons of strings (was: HAT-Trie implementation in java)

Answer 1

+5 A:

The trie seems like a very good idea for your constraints.

A "thinking outside the box" alternative:

If you can afford some probability of answering "present" for a string that is absent

EDIT: if you can afford false positives, use a Bloom filter as suggested by WizardOfOdds in the comments.

For k=1, a Bloom filter is like a hash table without the keys: each "bucket" is simply a boolean that tells if at least one input with the same hash was present. If 1% false positives is acceptable, your hash table can be as small as about 100 * 20 million bits or roughly 200 MiB. For 1 in 1000 false positives, 2GiB.

Using several hash functions instead of one can improve the false positive rate for the same amount of bits.

Pascal Cuoq 2010-02-08 00:17:01

@Pascaul Cuoq: I'm not downvoting you but you're re-inventing a wheel here, probably less efficient than what exist. I don't know where you're getting your numbers from but there's a known data structure that allows for a % of false positive, it's called a "Bloom Filter". A bloom filter for 200 millions entries with a 1% acceptable false positive would take 154 MB.

Webinator 2010-02-08 04:52:17

Actually, 23MB for 20 million entries as the original poster specified. But of course we haven't been told false positives are OK...

Darius Bacon 2010-02-08 08:43:59

@WizardOfOdds Thanks for the pointer. I was suggesting is indeed a naive (k=1) bloom filter.

Pascal Cuoq 2010-02-08 17:38:18

@All I hadn't heard of Bloom filters before. They seem very suitable for my purpose. I can tolerate a 0.1% false positive rate.

hashable 2010-02-09 04:03:11

There's a nice-looking Java Bloom filter implementation with a simple + clean API at http://code.google.com/p/java-bloomfilter/

Cowan 2010-02-10 04:10:27

Answer 2

+2 A:

For space efficiency, O(log(n)) lookup, and simple code, try binary search over an array of characters. 20 million keys of average length 10 makes 200 million characters: 400MB if you need 2 bytes/char; 200MB if you can get away with 1. On top of this you need to somehow represent the boundaries between the keys in the array. If you can reserve a separator character, that's one way; otherwise you might use a parallel array of int offsets.

The simplest variant would use an array of Strings, at a high space cost from per-object overhead. It ought to still beat a hashtable in space efficiency, though not as impressively.

Darius Bacon 2010-02-08 00:59:47

@Darius Bacon: entire dictionaries using O(log n) lookup can be stored using less than 10 bits per String (!!!). Really. Less than 10 bits, I've done it. There are also high compression algorithms for dictionaries using 12 bits per word that also allow for fast suggestions lookup. But the original question explicitely asked about a O(1) contains, not a O(log n) one, so I cannot suggest such kind of "high compression, 10 bits per word" data structure as an answer.

Webinator 2010-02-08 05:01:31

Yeah, I've pointed out such compressed dictionaries in my answer to another question. I'd not try for anything so fancy as my first suggestion here -- it'd take considerable work to make it as fast, if it can be done at all, wouldn't it? And the question asked for *near* constant time; whether this is near enough will have to be up to the original poster.

Darius Bacon 2010-02-08 08:32:21

(This scenario of a hashtable running into memory limits getting changed to binary search has played out before in my work life, actually. The more junior programmer who'd run into this problem was plotting out a complex solution, but binary search worked fine. Incidentally I introduced Bloom filters into another part of the same project... it's like it was all preparation for commenting on this stackoverflow problem here.)

Darius Bacon 2010-02-08 08:58:25

Answer 3

+1 A:

Google brings up a blog post on HAT tries in Java. But I don't see how this will solve your problem directly: the structure is a shallow trie over prefixes of the keys, with the leaves being hashtables holding the suffixes of all keys with the given prefix. So in total, you have a lot of hashtables storing all of the keys that are in your current one big hashtable (perhaps saving a few bytes per key overall because of the common prefixes). Either way, you need a more space-efficient hashtable than the default Java one, or the per-object overhead will hit you just as badly. So why not start with a specialized hashtable class for string keys only, if you take this route, and worry about the trie part only if it still seems worthwhile then?

Darius Bacon 2010-02-08 01:02:59

Answer 4

+1 A:

Similar to a trie is a ternary search tree, but a ternary search tree has the advantage of using less memory. You can read about ternary search trees here, here, and here. Also one of the main papers on the subject by Jon Bentley and Robert Sedgewick is here. It also talks about sorting strings quickly, so don't be put off by that.

Justin Peel 2010-02-10 02:52:26

"Ternary trees are notably larger than hash maps or most binary tree designs" (http://abc.se/~re/code/tst/tst_docs/perf_notes.html)

ArtemGr 2010-10-06 09:26:29

ansaurus

tags:

views:

answers:

Need memory efficient way to store tons of strings (was: HAT-Trie implementation in java)

related questions