views:

322

answers:

2

I have two very large strings and I am trying to find out their Longest Common Substring.

One way is using suffix trees (supposed to have a very good complexity, though a complex implementation), and the another is the dynamic programming method (both are mentioned on the Wikipedia page linked above).

Using dynamic programming alt text

The problem is that the dynamic programming method has a huge running time (complexity is O(n*m), where n and m are lengths of the two strings).

What I want to know (before jumping to implement suffix trees): Is it possible to speed up the algorithm if I only want to know the length of the common substring (and not the common substring itself)?

+1  A: 

Will it be faster in practice? Yes. Will it be faster regarding Big-Oh? No. The dynamic programming solution is always O(n*m).

The problem that you might run into with suffix trees is that you trade the suffix tree's linear time scan for a huge penalty in space. Suffix trees are generally much larger than the table you'd need to implement for a dynamic programming version of the algorithm. Depending on the length of your strings, it's entirely possible that dynamic programming will be faster.

Good luck :)

Billy ONeal
@Billy ONeal: are you comparing suffix tree and dynamic programming? I am not asking for that. `What I what to know is that is there a way to make the dynamic programming algorithm faster if I only want to know the length of the common substring?`
Lazer
@eSKay: I believe the first part of my answer answers that question.
Billy ONeal
@Billy ONeal: okay, *how* can I make it faster in practice?
Lazer
@eSKay: The algorithm for finding the length and for finding the actual string are the same. The only difference is that the algorithm finding the actual string stores the actual string along with that length in the table. By not storing the string, you can go faster because you spend less time allocating strings. Therefore, it's faster, but not in terms of Big-Oh.
Billy ONeal
+2  A: 

These will make it run faster, though it'll still be O(nm).

One optimization is in space (which might save you a little allocation time) is noticing that LCSuff only depends on the previous row -- therefore if you only care about the length, you can optimize the O(nm) space down to O(min(n,m)).

The idea is to keep only two rows -- the current row that you are processing, and the previous row that you just processed, and throw away the rest.

Larry
@Larry: thanks! I had already implemented this one, though. Any others that occur to you?
Lazer
The other is to implement both top-down and bottom-up. You can apply some branch and bound techniques with top-down to speed things up, and possibly skip states that will never be needed.
Larry