views:

144

answers:

3

Will the Levenshtein distance algorithm work well for non-English language strings too?

Update: Would this work automatically in a language like Java when comparing Asian characters?

+1  A: 

Yes. But you have to treat the non-english characters as "1 character", not as multiple characters (for example with utf-8). For example, in python you would use the unicode class to represent the string (and characters).

ondra
+1  A: 

Levenshtein doesn't care about languages, it just tells you how many characters need to be changed (added, removed, exchanged) to get from one string to the other.

So: yes, but you'll have to check your charset, some foreign "single" characters my otherwise be treated as two (or more) characters.

Select0r
updated question: what if my programming language supports unicode strings?
Ryan Fernandes
+2  A: 

Only if language is letter based. For example Russian, German,... but hieroglyph (China for example) or syllable (like Laos) - not.

Dewfy
updated question: what if my programming language supports unicode strings?
Ryan Fernandes
@Ryan Fernandes Then you use instead of matrix 256 x 256 the matrix 65536 x 65536
Dewfy
@Dewfy: what is this matrix 256 x 256 that you mention???
John Machin
@John Machin - could you open wiki link from question above. It is not necessary to implement algorithm with a matrix, but from mathematical point of view Levenshtein Distance is defined at matrix of available letters. So In fact I didn't mean 256, but number of letters in expected language. The same story with unicode - you don't need to declare 65536 entries, just subset of used letters.
Dewfy
@Dewfy: "from mathematical point of view ... is defined at matrix of available letters"????. I have never seen a reference to the alphabet size ("number of letters in expected language"), it's totally irrelevant, and there certainly is no such reference in that Wikipedia article. In an implementation, you don't need to declare "subset of used letters". In fact in a language like Python, you can write a Levenshtein function that will work with sequences of any objects that can be compared for equality. BTW Unicode has more than 65536 codepoints.
John Machin
@John Machin - fully agree with you. Lucene gives good implementation of what you talking about. About unicode - yes, I know, once again - you are absolutely right. Some time it is easy to make rough assumption than describe in many words the precise thing.
Dewfy