efficient algorithm to perform spell check on HTML document

views:

232

answers:

efficient algorithm to perform spell check on HTML document

I have a HTML document, a list of common spelling mistakes, and the correct spelling for each case. The HTML documents will be up to ~50 pages and there are ~30K spelling correction entries.

What is an efficient way to correct all spelling mistakes in this HTML document?
(Note: my implementation will be in Python, in case you know of any relevant libraries.)

I have thought of 2 possibles approaches:

build hashtable of the spelling data
parse text from HTML
split text by whitespace into tokens
if token in spelling hashtable replace with correction
build new HTML document with updated text

This approach will fail for multi-word spelling corrections, which will exist. The following is a simpler though seemingly less efficient approach that will work for multi-words:

iterate spelling data
search for word in HTML document
if word exists replace with correction

+3 A:

You are correct that the first approach will be MUCH faster than the second (additionally, I would recommend looking into Tries instead of a straight hash, the space savings will be quite dramatic for 30k words).

To still be able to handle the multi-word cases, you could either keep track of the previous token and thereby check your hash for a combined string such as "prev cur".

Or else you could leave the multi-word corrections out of the hash and combine your two approaches, first using the hash for single words and then doing a scan for the multi-word combos (or vice versa). This could still be relatively fast if the number of multi-word corrections is relatively small.

Be careful tho, pulling out word tokens is trickier than just splitting on whitespace. You don't want to fail to correct an error simply because you didn't find 'instence,' with a comma in your hash.

Rob Van Dam 2009-12-24 08:39:35

A 30K hashtable is TINY, no need for a trie.

Keith Randall 2009-12-24 21:36:44

You are correct that space is probably not a major issue here. But a trie structure is still very convenient for this type of problem.

Rob Van Dam 2009-12-25 19:47:55

+2 A:

I agree with Rob's suggestion of using a trie, based on characters, because I programmed a spelling correction algorithm ages ago based on having a dictionary of valid words stored as a trie. By using branch-and-bound I was able to suggest possibly correct spellings of misspelled words (by Levenshtein distance). In addition, since a trie is just a big finite-state-machine, it is fairly easy to add common prefixes and suffixes, so it could handle "words" like "postnationalizationalism's".

Mike Dunlavey 2009-12-25 18:37:56

ansaurus

tags:

views:

answers:

efficient algorithm to perform spell check on HTML document

related questions