I have a HTML document, a list of common spelling mistakes, and the correct spelling for each case. The HTML documents will be up to ~50 pages and there are ~30K spelling correction entries.
What is an efficient way to correct all spelling mistakes in this HTML document?
(Note: my implementation will be in Python, in case you know of any relevant libraries.)
I have thought of 2 possibles approaches:
- build hashtable of the spelling data
- parse text from HTML
- split text by whitespace into tokens
- if token in spelling hashtable replace with correction
- build new HTML document with updated text
This approach will fail for multi-word spelling corrections, which will exist. The following is a simpler though seemingly less efficient approach that will work for multi-words:
- iterate spelling data
- search for word in HTML document
- if word exists replace with correction