views:

720

answers:

9

I'm am looking for specific suggestions or references to an algorithm and/or data structures for encoding a list of words into what would effectively would turn out to be a spell checking dictionary. The objectives of this scheme would result in a very high compression ratio of the raw word list into the encoded form. The only output requirement I have on the encoded dictionary is that any proposed target word can be tested for existence against the original word list in a relatively efficient manner. For example, the application might want to check 10,000 words against a 100,000 word dictionary. It is not a requirement for the encoded dictionary form to be able to be [easily] converted back into the original word list form - a binary yes/no result is all that is needed for each word tested against the resulting dictionary.

I am assuming the encoding scheme, to improve compression ratio, would take advantage of known structures in a given language such as singular and plural forms, possessive forms, contractions, etc. I am specifically interested in encoding mainly English words, but to be clear, the scheme must be able to encode any and all ASCII text "words".

The particular application I have in mind you can assume is for embedded devices where non-volatile storage space is at a premium and the dictionary would be a randomly accessible read-only memory area.

EDIT: To sum up the requirements of the dictionary:

  • zero false positives
  • zero false negatives
  • very high compression ratio
  • no need for decompression
+6  A: 

See McIlroy's "Development of a Spelling List" at his pubs page. Classic old paper on spellchecking on a minicomputer, which constraints map surprisingly well onto the ones you listed. Detailed analysis of affix stripping and two different compression methods: Bloom filters and a related scheme Huffman-coding a sparse bitset; I'd go with Bloom filters probably in preference to the method he picked, which squeezes a few more kB out at significant cost in speed. (Programming Pearls has a short chapter about this paper.)

See also the methods used to store the lexicon in full-text search systems, e.g. Introduction to Information Retrieval. Unlike the above methods this has no false positives.

Darius Bacon
Thanks in particular for the McIlroy reference, this is a really good starting point.
Tall Jeff
You're welcome! Hope you have fun with this.
Darius Bacon
+3  A: 

A Bloom Filter (http://en.wikipedia.org/wiki/Bloom_filter and http://www.coolsnap.net/kevin/?p=13) is a data structure used to store the dictionary words in a very compactly in some spell checkers. There is, however, a risk for false positives.

ahy1
+2  A: 

To sum up:

  • zero false positives
  • zero false negatives
  • high compression ratio
  • no need for inverse (i.e. no uncompression necessary)

I was going to suggest Bloom filters, but these have non-zero false positives.

Instead, Programming Pearls talks of a similar set of requirements (/usr/share/dict/words in 41K).

This took the approach of contraction of stems: For example: sent was the root, so could have pre- and post-fixes added:

  • present
  • represent
  • representation
  • misrepresentation
jamesh
Affix stripping introduces false positives. I didn't see any 'zero false positives' requirement here though -- spellchecking is inherently imprecise.
Darius Bacon
spellchecking, as defined, is perfectly precise. the definition simply does not cover context.
Sparr
It's imprecise because human language is open-ended. It's possible Jeff does have the precise problem of dictionary lookup, not the imprecise one of spellchecking (I'm not sure from his problem statement) -- but if it's spellchecking, it's just a question of how many errors and of what sort.
Darius Bacon
A: 

Knuth mentions a "Patricia trie" in The Art of Computer Programming vol. 3. I've never used it for any real work but maybe that would be helpful.

edit: what's your RAM constraint? If you have lots more RAM than ROM available, perhaps data compression in ROM (requiring decompression into RAM) is the right way to go. I suppose if you have a medium but not large amount of RAM, technically you could also store portions of the data structure as compressed blobs in memory, and a least-recently-used cache to keep several of them around, then dynamically decompress the appropriate blob when it's not in the cache.

Jason S
Tall Jeff
+2  A: 

You can get a 30%+ compression ratio out of storing words as successive suffixes in 7-bit format. I'm not sure what this is called, but it translates pretty effectively into a tree-structure.

ex.: a+n+d+s|an+d+y|and+es+roid

is 26 characters, compared to:

a an ad as and any andes android

which is 33.

Factoring in 12.5% compression ratio for storing as 7-bit content, that's about 31% compression total. Compression ratio depends, of course, on the size and content of your word list.

Turning this into a 26-root tree structure would probably result in lookups that are faster than a plaintext substring comparison against a flat file.

Come to think of it, if you're only using 26 characters plus two for delimiters, you can do everything in 5 bits, which is 37.5% compression in and of itself, bringing the above example to over a 50% compression rate.

Max
You could even use a special character to choose the next suffix: a+n!+d+s|d!+y|es+roid - to make it 21 characters. Or you always write the next suffix first (19 characters). You can then use a newline for a completely new suffix.
schnaader
+1  A: 

I'm not an expert on this, but isn't prefix tree pretty much standard solution to this? That stores common prefixes of words only once.

mattiast
+3  A: 

I'd suggest a padded suffix tree. Good compression on wordlists, and excellent lookup times.

http://en.wikipedia.org/wiki/Suffix_tree

+1  A: 

For pure compression, the Maximum Compression site offers some results for a 4 MB english wordlist, best program compresses this to around 400 KB. Some other compression resources for text/word compression are the Hutter Prize page and the Large Text Compression Benchmark.

schnaader
+2  A: 

I think your best bet is a Compressed Suffix Tree / Compressed Suffix Array. You can find a wealth of information in the above links. This is an ongoing research area, very interesting indeed.

martinus