I've been looking into fuzzy string matching lately, so even at the risk of answering to an abandoned question, here goes. Hope you find this useful.
I suppose you're only interested in the strings for which the edit distance is smaller than a given value. And your q-grams (or n-grams) look like this
2-grams for "foobar": {"fo","oo","ob","ba","ar"}
You could use positional q-grams:
"foobar": {("fo",1),("oo",2),("ob",3),("ba",4),("ar",5)}
The positional information can be used to determine if a matching
q-gram is really a "good match".
For example, if you're searching for
"foobar" with maximum edit distance
of 2, this means that you're only
interested in words where
2-gram "fo" exists in with position from 1 to 3 or
2-gram "oo" exists in with position from 2 to 4 or
... and so on
String "barfoo" doesn't get any
matches because the positions of the
otherwise matching 2-grams differ by
3.
Also, it might be useful to use
the relation between edit distance
and the count of matching q-grams.
The intution is that since
a string s has len(s)-q+1 q-grams
and
a single edit operation can affect at most q q-grams,
we can deduce that
strings s1 and s2 within edit distance of d have at least
max(len(s1),len(s2))-q+1-qk matching non-positional q-grams.
If you're searching for "foobar"
with an maximum edit distance of 2, a matching
7-character string (such as
"fotocar") should contain at least
two common 2-grams.
- Finally, the obvious thing to do is
to filter by lenght. The edit
distance between two strings is at
least the difference of the lengths
of the strings. For example if your
threshold is 2 and you search for
"foobar", "foobarbar" cannot
obviously match.
See http://pages.stern.nyu.edu/~panos/publications/deb-dec2001.pdf for more and some pseudo SQL.