Way to store a large dictionary with low memory footprint + fast lookups (on Android)

views:

636

answers:

+8 Q:

Way to store a large dictionary with low memory footprint + fast lookups (on Android)

I'm developing an android word game app that needs a large (~250,000 word dictionary) available. I need:

reasonably fast look ups e.g. constant time preferable, need to do maybe 200 lookups a second on occasion to solve a word puzzle and maybe 20 lookups within 0.2 second more often to check words the user just spelled.

EDIT: Lookups are typically asking "Is in the dictionary?". I'd like to support up to two wildcards in the word as well, but this is easy enough by just generating all possible letters the wildcards could have been and checking the generated words (i.e. 26 * 26 lookups for a word with two wildcards).

as it's a mobile app, using as little memory as possible and requiring only a small initial download for the dictionary data is top priority.

My first naive attempts used Java's HashMap class, which caused an out of memory exception. I've looked into using the SQL lite databases available on android, but this seems like overkill.

What's a good way to do what I need?

You'll be wanting some sort of trie. Perhaps a ternary search trie would be good I think. They give very fast look-up and low memory usage. This paper gives some more info about TSTs. It also talks about sorting so not all of it will apply. This article might be a little more applicable. As the article says, TSTs

combine the time efficiency of digital tries with the space efficiency of binary search trees.

As this table shows, the look-up times are very comparable to using a hash table.

Justin Peel 2010-02-16 21:56:48

My post original had a comment about tries taking a lot of memory but I deleted it because I wasn't so sure. I'm still not sure after reading your links. From the wiki: "A trie is optimized for speed at the expense of size." and "Ternary search trees are usually rejected in modern practice in favor of hash tables. Also, there are other ways to construct a trie. For example, a radix tree trie in which each node stores a bit string, and has two children for the possible next bits, may be more compact than a ternary search tree without being slow."

BobbyJim 2010-02-16 22:15:02

I found an interesting looking apache license trie implementation for Java:http://code.google.com/p/patricia-trie/wiki/Examples

BobbyJim 2010-02-16 22:23:16

Yes, I'm not sure if they will necessarily be much better space-wise, but there are ways to reduce this such as collapsing nodes that have only one descendant. You might also want to check out HAT-tries

Justin Peel 2010-02-16 22:30:51

Most trie implementations I know of do the collapsing for you. I'll just have to benchmark some I suppose. Thanks, will look into HAT trees.

BobbyJim 2010-02-16 22:40:46

First of all, you need to figure out WHAT questions you need to ask your data structure. Is it a "is this spelling correct, yes/no?" or "which words COULD this incorrect word have been if the user typed it correctly?"

The second structure will require much more than the first, either in term of memory or cpu.

Thorbjørn Ravn Andersen 2010-02-16 22:13:22

Oops, I've updated my post.

BobbyJim 2010-02-16 22:17:32

-1? oh well....

Thorbjørn Ravn Andersen 2010-02-16 23:02:23

Not from me, it was a useful comment! +1

BobbyJim 2010-02-16 23:12:24

+2 A:

I am assuming that you want to check if given word belongs to dictionary.

Have a look at bloom filter.

The bloom filter can do "does X belong to predefined set" type of queries with very small storage requirements. If the answer to query is yes, it has small (and adjustable) probability to be wrong, if the answer to query is no, then the answer guaranteed to be correct.

According the Wikipedia article you could need less than 4 MB space for your dictionary of 250 000 words with 1% error probability.

The bloom filter will correctly answer "is in dictionary" if the word actually is contained in dictionary. If dictionary does not have the word, the bloom filter may falsely give answer "is in dictionary" with some small probability.

Juha Syrjälä 2010-02-16 22:21:00

Looks interesting. It's a word game though, so I need 100% sure positive answers for if the word is in the dictionary or the user is going to get annoyed!

BobbyJim 2010-02-16 22:24:51

Bloom filter may give false positives. Is it a problem if your game sometimes accepts a word that is not in dictionary?

Juha Syrjälä 2010-02-16 22:36:16

Ah, it's "False positives are possible, but false negatives are not". I suppose 95% of the time, the user is going to submit actual words. It depends on how often a false word gets through though as you don't want to make that into part of the gameplay strategy (could be a funny game though where you try to sneak by non-words that look like real words)!

BobbyJim 2010-02-16 22:39:37

You can adjust the false positive rate by changing size of bloom filter's bit array. Larger bit arrays have smaller false negative probability.

Juha Syrjälä 2010-02-16 22:44:30

I was reading some more and the false positives probability is in the order of magnitude of 1 in 1 000 000 and usually higher for typical bloom filters. In other words, massively unlikely to be a problem. Thanks.

BobbyJim 2010-02-16 22:51:30

+7 A:

You can achieve your goals with more lowly approaches also... if it's a word game then I suspect you are handling 27 letters alphabet. So suppose an alphabet of not more than 32 letters, i.e. 5 bits per letter. You can cram then 12 letters (12 x 5 = 60 bits) into a single Java long by using 5 bits/letter trivial encoding.

This means that actually if you don't have longer words than 12 letters / word you can just represent your dictionary as a set of Java longs. If you have 250,000 words a trivial presentation of this set as a single, sorted array of longs should take 250,000 words x 8 bytes / word = 2,000,000 ~ 2MB memory. Lookup is then by binary search, which should be very fast given the small size of the data set (less than 20 comparisons as 2^20 takes you to above one million).

IF you have longer words than 12 letters, then I would store the >12 letters words in another array where 1 word would be represented by 2 concatenated Java longs in an obvious manner.

NOTE: the reason why this works and is likely more space-efficient than a trie and at least very simple to implement is that the dictionary is constant... search trees are good if you need to modify the data set, but if the data set is constant, you can often run a way with simple binary search.

antti.huima 2010-02-16 22:30:45

Wow, cool idea. I will look into this. I wonder what the memory difference is between this and e.g. a trie. I've found lots of free well tested trie implementations I could use but this idea seems more specialized I imagine. I suppose all I need to do is write the "convert string to long" function and use a basic collections class, which shouldn't be too error prone really. To store the dictionary, all I need to do is store the file of longs and then insert them into the search tree on load. I could save and load the actual tree, but then the file will be bigger.

BobbyJim 2010-02-16 22:37:36

You don't need to insert anything in a search tree... you just load the array in memory, and do the binary search in the array. No tree. Only one array. No collection class. No ANYTHING but the array of longs!!!

antti.huima 2010-02-16 22:41:33

BTW if you use tries you end up allocation lots of heap objects, which then strains the Java garbage collector. The single array of longs is only 1 object for the garbage collector and it doesn't contain pointers so the stress to the collector is minimal.

antti.huima 2010-02-16 22:43:50

Doh, of course! Use a sorted array and you're done. Seems you can even look if a prefix is there by just searching for the first part of bit pattern and this can easily be adapted to wildcard searches. I wonder why I've never come across such a simple and elegant representation before. Have you read about this or used it somewhere before? Is it not well known because of the assumptions about the max word size and number of letters? I didn't elaborate in my post but, yes, under android GC calls are concern. You can just build your trie once on startup though so it isn't so bad here.

BobbyJim 2010-02-16 22:55:48

Normally an allocated object stresses the GC also after it has been constructed, although generational garbage collectors reduce the problem... don't know how it works on Android's Java. Anyway, I'm not sure where the solution comes from... to me it was "obvious".

antti.huima 2010-02-16 23:09:03

I've had a look about but I cannot find anything so far on this representation. It's a brilliant idea. I'm maybe being lazy, but it would save testing time if I could find a library to use. My game probably wouldn't need to support more than 10 letter words btw as needing that would be rare. For other uses, you could even allow adding to the dictionary by just maintaining a separate binary tree for user words but use your representation for the core static dictionary. I cannot see how anything is going to beat a sorted list of longs compressed in a file for storage size and loading time.

BobbyJim 2010-02-16 23:18:44

I like your encoding approach. Very compact!

Thorbjørn Ravn Andersen 2010-02-17 06:12:49

@BobbyJim, you WILL need unit tests for your dictionary code. Try using Test Driven Design for this - just write the unit tests _first_ and after each test write enough library code to make the tests pass - you will most likely find then that when you are done you do not need a library as the code will be simple enough to not warrant one.

Thorbjørn Ravn Andersen 2010-02-17 06:15:50

Also note that the 32 bit int can contain 6 letters in your encoding. That would save half the space for small words by having these in a separate int[].

Thorbjørn Ravn Andersen 2010-02-17 06:18:26

@BobbyJim: JDK has java.util.Arrays.sort() and java.util.Arrays.binarySearch() methods that'll be helpful.

Juha Syrjälä 2010-02-17 06:27:01

You could also use the Android NDK and do the structure in C or C++.

Macarse 2010-02-17 02:45:23

What will that let me do that Java cannot though?

BobbyJim 2010-02-17 03:15:22

C/C++ is faster than Java.

Macarse 2010-02-17 09:47:55

+1 A:

A very efficient way to store a directory is a Directed Acyclic Word Graph (DAWG).

Here are some links:

Directed Acyclic Word Graph or DAWG description with sourcecode
Construction of the CDAWG for a Trie
Implementation of directed acyclic word graph

martinus 2010-02-17 13:51:18

The devices that I worked basically worked from a binary compressed file, with a topology that resembled the structure of a binary tree. At the leafs, you would have the Huffmann compressed text. Finding a node would involve having to skip to various locations of the file, and then only load the portion of the data really needed.

Wilfred Springer 2010-03-18 13:19:37

ansaurus

tags:

views:

answers:

Way to store a large dictionary with low memory footprint + fast lookups (on Android)

related questions