views:

120

answers:

1

I have two strings which must be compared for similarity. The algorithm must be designed to find the maximal similarity. In this instance, the ordering matters, but intervening (or missing) characters do not. Edit distance cannot be used in this case for various reasons.

The situation is basically as follows: string 1: ABCDEFG string 2: AFENBCDGRDLFG

the resulting algorithm would find the substrings A, BCD, FG

I currently have a recursive solution, but because this must be run on massive amounts of data, any improvements would be greatly appreciated

+4  A: 

Looking at your sole example it looks like you want to find longest common subsequence. Take a look at LCS

Is it just me, or is this NP-hard? – David Titarenco (from comment)

If you want LCS of arbitrary number of strings its NP-hard. But it the number of input strings is constant ( as in this case, 2) this can be done in polynomial time.

codaddict
+1 good answer :)
David Titarenco
This isn't right. given OP's example input, it would return 'ABCDFG'
aaronasterling
@Aaron: You'll have to modify the algorithm a bit to find the actual substring that contributed to the answer. But the basic idea remains very much the same.
codaddict
Thanks, I'll look into modifying an LCS algo
Peter Chang