If the focus is on performance, I would implement an algorithm based on a trie
structure
(works well to find words in a text, or to help correct a word, but in your case you can find quickly all words containing a given word or all but one letter, for instance).
Please follow first the wikipedia link above.Tries
is the fastest words sorting method (n words, search s, O(n) to create the trie, O(1) to search s (or if you prefer, if a is the average length, O(an) for the trie and O(s) for the search)).
A fast and easy implementation (to be optimized) of your problem (similar words) consists of
- Make the trie with the list of words, having all letters indexed front and back (see example below)
- To search s, iterate from s[0] to find the word in the trie, then s[1] etc...
- In the trie, if the number of letters found is len(s)-k the word is displayed, where k is the tolerance (1 letter missing, 2...).
- The algorithm may be extended to the words in the list (see below)
Example, with the words car
, vars
.
Building the trie (big letter means a word end here, while another may continue). The >
is post-index (go forward) and <
is pre-index (go backward). In another example we may have to indicate also the starting letter, it is not presented here for clarity.
The <
and >
in C++ for instance would be Mystruct *previous,*next
, meaning from a > c < r
, you can go directly from a
to c
, and reversely, also from a
to R
.
1. c < a < R
2. a > c < R
3. > v < r < S
4. R > a > c
5. > v < S
6. v < a < r < S
7. S > r > a > v
Looking strictly for car the trie gives you access from 1., and you find car (you would have found also everything starting with car, but also anything with car inside - it is not in the example - but vicar for instance would have been found from c > i > v < a < R
).
To search while allowing 1-letter wrong/missing tolerance, you iterate from each letter of s, and, count the number of consecutive - or by skipping 1 letter - letters you get from s in the trie.
looking for car
,
c
: searching the trie for c < a
and c < r
(missing letter in s). To accept a wrong letter in a word w, try to jump at each iteration the wrong letter to see if ar
is behind, this is O(w). With two letters, O(w²) etc... but another level of index could be added to the trie to take into account the jump over letters - making the trie complex, and greedy regarding memory.
a
, then r
: same as above, but searching backwards as well
This is just to provide an idea about the principle - the example above may have some glitches (I'll check again tomorrow).