views:

348

answers:

3

[EDIT]In Short: How would you write an automatic spell checker? The idea is that the checker builds a list of words from a known good source (a dictionary) and automatically adds new words when they are used often enough. Words which haven't been used a while should be phased out. So if I delete part of a scene which contains "Mungrohyperiofier", the checker should remember it for a while and when I type "Mung<Ctrl+Space>" in another scene, it should offer it again. If I don't use the word for, say, a few days, it should forget about it.

At the same time, I'd like to avoid adding typos to the dictionary.[/EDIT]

I want to write a text editor for SciFi stories. The editor should offer word completion for any word used anywhere in the current story. It will only offer a single scene of the story for editing (so you can easily move scenes around).

This means I have three sets:

  1. The set of all words in all other scenes
  2. The set of word in the current scene before I started editing it
  3. The set of words in the current editor

I need to store the sets somewhere as it would be too expensive to build the list from scratch every time. I think a simple plain text file with one-word-per-line is enough for that.

As the user edits the scene, we have these situations:

  1. She deletes a word. This word is not used anywhere else in the current scene.
  2. She types a word which is new
  3. She types a word which already exists
  4. She types a word which already exists but makes a typo
  5. She corrects a typo in a word which is in set #2.
  6. She corrects a typo in a word which is in set #1 (i.e. the typo is elsewhere, too).
  7. She deletes a word which she plans to use again. After the deletion, the word is no longer in the sets #1 and #3, though.

The obvious strategy would be to rebuilt the word sets when a scene is saved and build the set #1 from a word-list file per scene.

So my question is: Is there a clever strategy to keep words which aren't used anywhere anymore but still be able to phase out typos? If possible, this strategy should work in the background without the user even noticing what is going on (i.e. I want to avoid to have to grab the mouse to select "add word to dictionary" from the menu).

[EDIT] Based on a comment from grieve

A: 

The structure you should use is a trie. Tail/suffix compression will help with memory. You can use a pseudo reference counting GC for keeping track of usage.

For the actual nodes, you would probably need no more than a 32-bit integer, 21-bits for unicode, and the rest for various other tags and information.

leppie
That's a good answer how to store the information but how do I phase out typos? And I'm unsure how to efficiently merge the tries for each scene to get set #1. I also probably don't want to build this structure with every keypress.
Aaron Digulla
If you go the extra mile and add spell checking to your list of features you can mitigate the typo problem as well.
grieve
I've edited my question to make this more clear.
Aaron Digulla
A: 

Reminds me of what I have been told about garbage collecting in modern LISP implementations :

data when created is put in "pool 1",

when there is a need to garbage collect the garbage collector look in pool 1 for unused entries and remove them.

Then any remaining entry is moved to pool 2.

Pool 2 is examined only when there is a need to more memory than pool 1 can release.

Data from pool 2 that survive a garbage collection is put in pool 3 and ... so on.

The idea is to put dynamically the data in a pool corresponding to its lifetime...

siukurnin
Any ideas how I could avoid moving typos to the next pool?
Aaron Digulla
+1  A: 

So you want to write a spelling checker. Here's Peter Norvig's paper about writing a spelling corrector. It describes a simple and robust spelling corrector. You can use the already-written part of the book, plus a reference list (say from a free dictionary) for the language model. I would also go to existing open-source spelling checkers, such as aspell and hunspell, to get some ideas.

Yuval F
Actually, I want a spell checker that works automatic in the background. Spell checking is a tedious and error-prone process. You have to concentrate really hard and after a typo has been committed to the dict, it's there forever.
Aaron Digulla
I'm dreaming of a "community" spell checker where the code tries to figure out with some heuristics whether a word is probably correct or not.
Aaron Digulla
That paper is very good, thanks! I had a look at aspell but the code is very ... complex and only halfway at what I want.
Aaron Digulla
The aspell code is indeed complex (spent a month or so trying to understand parts of it). Good luck.
Yuval F