Is Levenshtein distance supposed to be
used as an absolute value?
It seems like it would depend on your requirements. (To clarify: Levenshtein distance is an absolute value, but as the OP pointed out, the raw value may not be as useful as for a given application as a measure that takes the length of the word into account. This is because we are really more interested in similarity than distance per se.)
I am using both Daitch-Mokotoff
soundexing and Damerau-Levenshtein to
find out if a user entry and a value
in the application are "the same".
Sounds like you're trying to determine whether the user intended their entry to be the same as a given data value?
Are you doing spell-checking? or conforming invalid input to a known set of values?
What are your priorities?
- Minimize false positives (try to make sure all suggested words are very "similar", and list of suggestions is short)
- Minimize false negatives (try to make sure that the string the user intended is in the list of suggestions, even if it makes the list long)
- Maximize average matching accuracy
You might end up using the Levenshtein distance in one way to determine whether a word should be offered in a suggestion list; and another way to determine how to order the suggestion list.
It seems to me, if I've inferred your purpose correctly, that the core thing you want to measure is similarity rather than difference between two strings. As such, you could use Jaro or Jaro-Winkler distance, which takes into account the length of the strings and the number of characters in common:
The Jaro distance dj of two given
strings s1 and s2 is
(m / |s1| + m / |s2| + (m - t) / m) / 3
where:
- m is the number of matching characters
- t is the number of transpositions
Jaro–Winkler distance uses a prefix
scale p which gives more favourable
ratings to strings that match from the
beginning for a set prefix length l.