views:

277

answers:

2

I need to implement a string-matching algorithm to determine which strings most closely match. I see the the Hamming distance is a good matching algorithm when this fixed-length is obtainable.

Is there any advantage in the quality of matching if I were to use the Levenshtein distance formula instead? I know this method is less efficient, given that it accounts for variable-length strings, but what I'm really concerned with here are the quality of the matches. Also, are there any better algorithms out there I may want to consider? I work in Java if that makes any difference.

http://en.wikipedia.org/wiki/Levenshtein_distance

http://en.wikipedia.org/wiki/Hamming_distance

Much thanks

+3  A: 

Consider the strings: "abcdefg" and "bcdefgh".

The Levenshtein distance is 2. The Hamming distance (operating on characters rather than bits) is 7.

So it really depends whether you want to treat those strings as being similar, or not. Hamming distance has its appropriate uses, but "will these strings look similar to a human being?" is not one of them.

Steve Jessop
I see. Sounds like Hamming distance is out then, as your illustration makes Levenshtein appear more appropriate. Are there any other algorithms you know of that I may want to consider?
Cuga
You say in another comment that you want d("aaaaa","kaaaa") > d("aaaaa","baaaa"). Levenshtein doesn't do that, it's 1 in both cases. I'm afraid I don't know any other algorithms that are relevant. But maybe by breaking each character up into (say) 2-bit chunks, and calculating Levenshtein distance over those chunks instead of over chars, you might get something a little closer to what you want. Still not right, though, since the change O -> P flips more bits than the change P -> Q.
Steve Jessop
Ah, how about this: you could calculate the root mean square distance between corresponding characters. ('a' - 'k')^2 is 100, ('a'-'b')^2 is only 1. This would mean that "abc" is much closer to "cba" than "amz" is to "zam", though, so you still might not get results you like.
Steve Jessop
Thanks for pointing this out! I'll use your latest comment!
Cuga
A: 

You may find interesting the Bitap algorithm.

The bitap algorithm (also known as the shift-or, shift-and or Baeza-Yates-Gonnet algorithm) is a fuzzy string searching algorithm. The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of Levenshtein distance — if the substring and pattern are within a given distance k of each other, then the algorithm considers them equal. The algorithm begins by precomputing a set of bitmasks containing one bit for each element of the pattern. Then it is able to do most of the work with bitwise operations, which are extremely fast.

Nick D