ansaurus

Question

Space-Efficient Data Structure for Storing a Word List?

Answer 1

+1 A:

You still have to maintain the tree structure itself with Trie. Huffman encoding the alphabet or N-letters (for common forms like "tion", "un", "ing") can take advantage of the occurrence frequency in your dictionary and compress the entry to bits.

eed3si9n 2008-12-11 02:05:41

Answer 2

+3 A:

What are you doing? If it's spell checking, you could use a bloom filter - see this code kata.

Mike Scott 2008-12-11 02:05:58

I was going to suggest a Bloom filter, too, but given his goals, I don't think a Bloom filter would work. Bloom filters won't answer with a definitive yes/no if a word is in the list, and it won't allow for the generation of a neighborhood.

mipadi 2008-12-12 15:04:33

A bloom filter *will* answer a definitive no if the word *isn't* in the list. Yeah, the neighbourhood requirement was added later :)

Mike Scott 2008-12-12 16:00:20

Answer 3

+6 A:

One structure I saw for minimizing space in a spelling dictionary was to encode each word as:

the number of characters (a byte) in common with the last; and
the new ending.

So the word list

HERE            would encode as    THIS
sanctimonious                      0,sanctimonious
sanction                           6,on
sanguine                           3,guine
trivial                            0,trivial

You're saving 7 bytes straight up there (19%), I suspect the saving would be similar for a 20,000 word dictionary just due to the minimum distances between (common prefixes of) adjacent words.

To speed lookup, there was a 26-entry table in memory which held the starting offsets for words beginning with a, b, c, ..., z. The words at these offsets always had 0 as the first byte as they had no letters in common with the previous word.

This seems to be sort of a trie but without the pointers, which would surely get space-expensive if every character in the tree had a 4-byte pointer associated with it.

Mind you, this was from my CP/M days where memory was much scarcer than it is now.

paxdiablo 2008-12-11 02:26:03

+1 - thanks for sharing a clever algorithm. BTW: back then, my memory's reliability more than compensated for scarcity.... :-)

Adam Liss 2008-12-11 02:35:45

Answer 4

+6 A:

A Patricia trie may be more appropriate:

http://en.wikipedia.org/wiki/Patricia_tree

My (fuzzy) memory tells me there were used in some of the early full-text search engines ...

Paul.

Paul W Homer 2008-12-11 03:35:29

I was just thinking of this...

Rich 2008-12-12 14:48:22

Java implementation herehttp://code.google.com/p/radixtree/

Pete 2009-09-01 19:40:09

Answer 5

A:

Completely wild idea... (i.e. most likely very wrong)

How about storing the words as a tree of all possible letter combinations?

Then each "word" only costs a single char and two pointers (one to the char and one to a terminator.) This way the more letters they have in common the less the cost for each word.

      . .
     / /
    r-p-s-.
   /\\
  a  \s-.
 /    t-.
c      \
        s-.

car carp carps cars cart carts

So for 9 chars and 14 pointers we get 6 "words" totalling 25 letters.

Searches would be quick (pointer lookups instead of char comparisons) and you could do some stemming optimisations to save even more space...?

EDIT: Looks like I reinvented the wheel. ;-)

Chris Nava 2008-12-11 04:17:01

Answer 6

+1 A:

Related to Paul's post:

Any reason why you can't consider a Trie in your case? If it's just an implementaiton issue, here is a tight implementation of Patricia trie insert and search in C (from NIST):

Patricia Insert in C

Patricia Search in C

Rich 2008-12-12 14:58:20

ansaurus

tags:

views:

answers:

Space-Efficient Data Structure for Storing a Word List?

related questions