Levenshtein distance on non-English strings

views:

144

answers:

+2 Q:

Levenshtein distance on non-English strings

Will the Levenshtein distance algorithm work well for non-English language strings too?

Update: Would this work automatically in a language like Java when comparing Asian characters?

+1 A:

Yes. But you have to treat the non-english characters as "1 character", not as multiple characters (for example with utf-8). For example, in python you would use the unicode class to represent the string (and characters).

ondra 2010-02-17 11:08:38

+1 A:

Levenshtein doesn't care about languages, it just tells you how many characters need to be changed (added, removed, exchanged) to get from one string to the other.

So: yes, but you'll have to check your charset, some foreign "single" characters my otherwise be treated as two (or more) characters.

Select0r 2010-02-17 11:10:28

updated question: what if my programming language supports unicode strings?

Ryan Fernandes 2010-02-17 11:31:03

+2 A:

Only if language is letter based. For example Russian, German,... but hieroglyph (China for example) or syllable (like Laos) - not.

Dewfy 2010-02-17 11:11:10

updated question: what if my programming language supports unicode strings?

Ryan Fernandes 2010-02-17 11:30:29

@Ryan Fernandes Then you use instead of matrix 256 x 256 the matrix 65536 x 65536

Dewfy 2010-02-17 12:33:15

@Dewfy: what is this matrix 256 x 256 that you mention???

John Machin 2010-07-29 08:01:57

@John Machin - could you open wiki link from question above. It is not necessary to implement algorithm with a matrix, but from mathematical point of view Levenshtein Distance is defined at matrix of available letters. So In fact I didn't mean 256, but number of letters in expected language. The same story with unicode - you don't need to declare 65536 entries, just subset of used letters.

Dewfy 2010-07-29 10:26:10

@Dewfy: "from mathematical point of view ... is defined at matrix of available letters"????. I have never seen a reference to the alphabet size ("number of letters in expected language"), it's totally irrelevant, and there certainly is no such reference in that Wikipedia article. In an implementation, you don't need to declare "subset of used letters". In fact in a language like Python, you can write a Levenshtein function that will work with sequences of any objects that can be compared for equality. BTW Unicode has more than 65536 codepoints.

John Machin 2010-07-29 10:50:16

@John Machin - fully agree with you. Lucene gives good implementation of what you talking about. About unicode - yes, I know, once again - you are absolutely right. Some time it is easy to make rough assumption than describe in many words the precise thing.

Dewfy 2010-07-29 13:09:48

ansaurus

tags:

views:

answers:

Levenshtein distance on non-English strings

related questions