views:

95

answers:

2

If I have two sequences (for example, string)

//   01234567890123456789012  
a = "AAACDDFFFEE1122VV1VAADD"

//   0123456789012345678901
b = "DDFFAA11221DHHVV1VAAFE"

I want to know the best substring matching (unordered) from b to a, for instance:

optimal (6 matched parts, 19 characters of a matched)
b         a
DDFF   -> DDFF     (4,7)
AA     -> AA       (0,1)
1122   -> 1122     (11,14)
1     
D      -> D        (21)
HH
VV1VAA -> VV1VAA   (15,20)
FE     -> FE       (8,9)

there is another solution, but not optimal:

not optimal (8 parts matched, 19 characters of a matched)
b        a
DDFF  -> DDFF  (4,7)
AA    -> AA    (0,1)
1122  -> 1122  (11,14)
1     -> 1     (17)
D     -> D     (21)
HH
VV    -> VV    (15,16)
1     
VAA   -> VAA   (18,20)
FE    -> FE    (8,9)

What kind of algorithm is better for this problem??? (I need optimal result and performance is critical).

Thanks.

+1  A: 

Interesting problem, you could solve it in O(|a|.|b| + |b|^2) using Boyer-Moore ( http://en.wikipedia.org/wiki/Boyer–Moore_string_search_algorithm ) or KMP ( http://en.wikipedia.org/wiki/Knuth–Morris–Pratt_algorithm ) algorithms or any other linear time string search algorithm.

  • For each b[0..i] ty to find it in the string a (in O(|a| + i) ) until you can't find it anymore
  • You know you can find b[0..i] but not b[0..i+1], so you have a match for b[0..i] and you continue with b[i+1..i+1],b[i+1..i+2]..
  • At the end every part of b has been matched or not, and if it has been matched if was as big as possible.

Total complexity is at most sum( O(|a| + i) , i=0..|b|) = O(|a|.|b| + |b|^2) but it can be much smaller if only small substrings of b can be found in a.

EDIT :

The above approach is greedy and will not minimize the number of parts, but I think it will maximize the total number of characters matched.


Thoughts on an optimal solution :

  • for every |b|^2 substring of |b| determine if it can be found in |a|, and keep only the ones for which it is the case
  • we need to find among these strings a subset of them with :
    • no overlap between any two of them
    • sum of length is maximum
    • at equal length, the number of strings must be minimum

The sum is easy to calculate because a very simple solution would be to match only substring of size 1 : then length is the number of common letters between a and b.

So if we add substring of size 1 of b (even the letters that are not in a) to the set of matching strings above we need to find a minimal set-cover of b with no overlapping.

The general set-cover is NP-complete, but here with have the no-overlapping constraints, does it helps ?

I'm looking into it.

Indeed, NP-complete : http://www.springerlink.com/content/n73561q050w54pn6/

You might want to look for approximation algorithms....

Loïc Février
Thank you. I need optimal solution for this problem (I changed my problem to describe that more clear), repeating substring search algorithm won't get optimal solution.
Indeed, the question of optimality wasn't clear in the first example, I understand better now. My (greedy) method will be optimal in total number of matched characters but might have more parts that the "optimal" one.
Loïc Février
I still see only two examples in the problem statement. Is it too much to ask for a definition of a total order of the solutions? Is number of characters matched always more important than total number of parts? There are trivial solutions if you don't define this things clearly (one is to regard the strings as multisets of characters and output the intersection, the other extreme is longest common substring)
piccolbo
@piccolbo I think what he wants is to maximize the number of matched characters (and multiset would give that number) and among all solutions with that number he wants the one which minimize the number of matched parts.
Loïc Février
Thanks, than this is a counterexample to a greedy approach in the size of the parts: ABAACAADA, ADAACAABA. The optimal solution has 3 parts of size 3, the greedy one 4 of size 4,2,2,1. Nobody had suggested that this heuristic is optimal, but now we know for sure.
piccolbo
Indeed greedy is not optimal, second example in the question is optimal and greedy will produce that one and not the first one.
Loïc Février
A: 

If I understand your problem, you want to find a set of non-overlapping common substrings of two given strings that either maximizes the total length of the common substrings and among those minimizes the number of common substrings. I will propose the following heuristic: find the longest common substring (LCS) of the two strings, remove it, repeat. I can't prove this is optimal, but I have a very efficient algorithm for it

So in your example AAACDDFFFEE1122VV1VAADD DDFFAA11221DHHVV1VAAFE LCS = VV1VAA

AAACDDFFFEE1122DD DDFFAA11221DHHFE

LCS = DDFF

AAACFEE1122DD AA11221DHHFE

LCS = 1122

AAACFEEDD AADHHFE

and so forth

The algorithm is as follows 1)Use the standard LCS algorithm based on suffix trees which is 1.1 build the suffix trees of the two strings concatenated and with unique terminators 1.2 mark every node with 1,2 or both depending whether the subtree rooted there has leaves from either or both strings 1.3 compute the string-depth of every node 1.4 find the string-deepest node that is labeled both 1 and 2 2) remove the subtree rooted at that node, and update the labels of nodes above it 3) repeat from 1.4

the algorithm terminates when the tree has no nodes that are labeled both 1 and 2 1.1 can be done in time proportional to the sum of the length of the two strings 1.2, 1.3 and 1.4 are little more than tree traversals 2 the removal should be constant time if the tree is implemented right and the update is bounded by the length of the LCS 3 is again a tree traversal but of a smaller tree

So this is one optimization, to avoid repeated tree traversals let's add step 1.35: sort internal nodes that have both labels by string depth (in a separate data structure, the tree is still there). Now you can scan that sorted list of nodes, perform 2) and repeat. With this optimization and if you can use radix sorting, it looks like the algorithm is linear time, and you can't beat that in an asymptotic sense.

I hope this is correct and clear enough, I am sure you will need to familiarize yourself with the suffix tree literature a little bit before it sounds obvious. I recommend Dan Gusfield's book Algorithms on String, Trees and Sequences, the section in particular is 7.4 Let me know if you have questions.

piccolbo
This is a greedy algorithm but it should give good result as this kind of algorithm is used for approximating the cover-set problem (see my answer, the problem is in fact a variation of cover-set).
Loïc Février
Thank you, I am not familiar with suffix tree, but I will try it later then tell the result. BTW, I am very curious about big-O of this algorithm compared to Février's algorithm.
O(n + m) where n and m are the string lengths, but it is just a heuristic and as Loïc said, one that gives acceptable results in many combinatorial optimization problems. The relation with set cover is not clear to me though.
piccolbo
This is a counterexample to the greedy algorithm being optimal: ABAACAADA, ADAACAABA. Optimal should be 3,3,3, greedy is 4,2,2,1
piccolbo
How about its space complexity? I must deal with sequences which length are great than 3000 ~ 100000. Suffix tree's memory consumption seems large (I am not sure about that)? Or this is not a problem?
There has been a lot of work improving on space consumption for suffix tree. If you are OK with a log n speed penalty, you can switch to suffix array and save a factor of 3. There are also more sophisticated approaches like compressed suffix arrays, pretty neat stuff. I am not sure what the current record holder is, but they should be down to 4X or something like that.
piccolbo