views:

1008

answers:

6

Is there anything better than a Trie for this situation?

  • Storing a list of ~100k English words
  • Needs to use minimal memory
  • Lookups need to be reasonable, but don't have to be lightning fast

I'm working with Java, so my first attempt was to just use a Set<String>. However, I'm targeting a mobile device and already running low on memory. Since many English words share common prefixes, a trie seems like a decent bet to save some memory -- anyone know some other good options?

EDIT - More info - The data structure will be used for two operations

  • Answering: Is some word XYZ in the list?
  • Generating the neighborhood of words around XYZ with one letter different

Thanks for the good suggestions

+1  A: 

You still have to maintain the tree structure itself with Trie. Huffman encoding the alphabet or N-letters (for common forms like "tion", "un", "ing") can take advantage of the occurrence frequency in your dictionary and compress the entry to bits.

eed3si9n
+3  A: 

What are you doing? If it's spell checking, you could use a bloom filter - see this code kata.

Mike Scott
I was going to suggest a Bloom filter, too, but given his goals, I don't think a Bloom filter would work. Bloom filters won't answer with a definitive yes/no if a word is in the list, and it won't allow for the generation of a neighborhood.
mipadi
A bloom filter *will* answer a definitive no if the word *isn't* in the list. Yeah, the neighbourhood requirement was added later :)
Mike Scott
+6  A: 

One structure I saw for minimizing space in a spelling dictionary was to encode each word as:

  • the number of characters (a byte) in common with the last; and
  • the new ending.

So the word list

HERE            would encode as    THIS
sanctimonious                      0,sanctimonious
sanction                           6,on
sanguine                           3,guine
trivial                            0,trivial

You're saving 7 bytes straight up there (19%), I suspect the saving would be similar for a 20,000 word dictionary just due to the minimum distances between (common prefixes of) adjacent words.

To speed lookup, there was a 26-entry table in memory which held the starting offsets for words beginning with a, b, c, ..., z. The words at these offsets always had 0 as the first byte as they had no letters in common with the previous word.

This seems to be sort of a trie but without the pointers, which would surely get space-expensive if every character in the tree had a 4-byte pointer associated with it.

Mind you, this was from my CP/M days where memory was much scarcer than it is now.

paxdiablo
+1 - thanks for sharing a clever algorithm. BTW: back then, my memory's reliability more than compensated for scarcity.... :-)
Adam Liss
+6  A: 

A Patricia trie may be more appropriate:

http://en.wikipedia.org/wiki/Patricia_tree

My (fuzzy) memory tells me there were used in some of the early full-text search engines ...

Paul.

Paul W Homer
I was just thinking of this...
Rich
Java implementation herehttp://code.google.com/p/radixtree/
Pete
A: 

Completely wild idea... (i.e. most likely very wrong)

How about storing the words as a tree of all possible letter combinations?

Then each "word" only costs a single char and two pointers (one to the char and one to a terminator.) This way the more letters they have in common the less the cost for each word.

      . .
     / /
    r-p-s-.
   /\\
  a  \s-.
 /    t-.
c      \
        s-.

car carp carps cars cart carts

So for 9 chars and 14 pointers we get 6 "words" totalling 25 letters.

Searches would be quick (pointer lookups instead of char comparisons) and you could do some stemming optimisations to save even more space...?

EDIT: Looks like I reinvented the wheel. ;-)

Chris Nava
+1  A: 

Related to Paul's post:

Any reason why you can't consider a Trie in your case? If it's just an implementaiton issue, here is a tight implementation of Patricia trie insert and search in C (from NIST):

Patricia Insert in C

Patricia Search in C

Rich