views:

44

answers:

2

I am using both Daitch-Mokotoff soundexing and Damerau-Levenshtein to find out if a user entry and a value in the application are "the same".

Is Levenshtein distance supposed to be used as an absolute value? If I have a 20 letter word, a distance of 4 is not so bad. If the word has 4 letters...

What I am now doing is taking the distance / length to get a distance that better reflects what percentage of the word has been changed.

Is that a valid/proven approach? Or is it plain stupid?

A: 

The levenshtein distance is a relative value between two words. Comparing the LD to the length is not relevant eg

cat -> scat = 1 (75% similar??)

difference -> differences = 1 (90% similar??)

Both these words have lev distances of 1 ie they differ by one character, but when compared to their lengths the second set would appear to be 'more' similar.

I use soundexing to rank words that have the same lev distance eg

cat and fat both have a LD of 1 relative to kat, but the word is more likely to be kat than fat when using soundex (assuming the word is incrrectly spelt, not incorrectly typed!)

So the short answer is just use the lev distance to determine the similarity.

James Westgate
I don't see how your example is demonstrating your point that "Comparing the LD to the length is not relevant." "cat" and "scat" are more different than "difference" and "differences" even though they have the same LD
Davy8
I think that in my case it does make a difference. Especially because I use soundexing... (see my comment to LarsH's answer below).
Joseph Tura
+3  A: 

Is Levenshtein distance supposed to be used as an absolute value?

It seems like it would depend on your requirements. (To clarify: Levenshtein distance is an absolute value, but as the OP pointed out, the raw value may not be as useful as for a given application as a measure that takes the length of the word into account. This is because we are really more interested in similarity than distance per se.)

I am using both Daitch-Mokotoff soundexing and Damerau-Levenshtein to find out if a user entry and a value in the application are "the same".

Sounds like you're trying to determine whether the user intended their entry to be the same as a given data value?

Are you doing spell-checking? or conforming invalid input to a known set of values? What are your priorities?

  • Minimize false positives (try to make sure all suggested words are very "similar", and list of suggestions is short)
  • Minimize false negatives (try to make sure that the string the user intended is in the list of suggestions, even if it makes the list long)
  • Maximize average matching accuracy

You might end up using the Levenshtein distance in one way to determine whether a word should be offered in a suggestion list; and another way to determine how to order the suggestion list.

It seems to me, if I've inferred your purpose correctly, that the core thing you want to measure is similarity rather than difference between two strings. As such, you could use Jaro or Jaro-Winkler distance, which takes into account the length of the strings and the number of characters in common:

The Jaro distance dj of two given strings s1 and s2 is

(m / |s1| + m / |s2| + (m - t) / m) / 3

where:

  • m is the number of matching characters
  • t is the number of transpositions

Jaro–Winkler distance uses a prefix scale p which gives more favourable ratings to strings that match from the beginning for a set prefix length l.

LarsH
As I want to find out how similar two words are (speed is not an issue), Jaro Winkler seems like a good suggestion.
Joseph Tura
@Joseph: It sounds like a good application for Jaro-Winkler, which has the nice property that it goes from 0 (no similarity) to 1 (exact match), so you can say e.g. anything over 0.9 similarity is close enough. You could then tweak that threshold based on user testing.
LarsH