Levenshtein distance based methods Vs Soundex

views:

2016

answers:

+5 Q:

Levenshtein distance based methods Vs Soundex

As per this comment in a related thread, I'd like to know why Levenshtein distance based methods are better than Soundex.

+4 A:

Soundex is rather primitive - it was originally developed to be hand calculated. It results in a key that can be compared.

Soundex works well with western names, as it was originally developed for US census data. It's intended for phonetic comparison.

Levenshtein distance looks at two values and produces a value based on their similarity. It's looking for missing or substituted letters.

Basically Soundex is better for finding that "Schmidt" and "Smith" might be the same surname.

Levenshtein distance is better for spotting that the user has mistyped "Levnshtein" ;-)

Keith 2008-09-03 16:09:41

@Keith:

As I posted on the other question, Daitch-Mokotoff is better for us Europeans (and I'd argue the US).

I've also read the Wiki on Levenshtein. But I don't see why (in real life) it's better for the user than Soundex.

ColinYounger 2008-09-03 16:15:12

+3 A:

erickson 2008-09-03 16:18:45

and I'd go for double-metaphone, it returns 2 codes, one for western sounding, and another for 'foreign' (more slavic IIRC) sounds.

gbjbaanb 2009-01-01 15:51:18

+2 A:

I agree with you on Daitch-Mokotoff, Soundex is biased because the original US census takers wanted 'Americanized' names.

Maybe an example on the difference would help:

Soundex puts addition value in the start of a word - in fact it only considers the first 4 phonetic sounds. So while "Schmidt" and "Smith" will match "Smith" and "Wmith" won't.

Levenshtein's algorithm would be better for finding typos - one or two missing or replaced letters produces a high correlation, while the phonetic impact of those missing letters is less important.

I don't think either is better, and I'd consider both a distance algorithm and a phonetic one for helping users correct typed input.

Keith 2008-09-03 16:24:08

With Levensthein Im trying to find spelling mistakes by looking up a txt file filled with words against it, Ive got to say most of the time even if it is spelled correctly ill always get a difference.

hadith 2010-02-22 20:00:23

ansaurus

tags:

views:

answers:

Levenshtein distance based methods Vs Soundex

related questions