views:

292

answers:

4

While developing part of a simulator I came across the following problem. Consider a string of length N, and M substrings of this string with a non-negative score assigned to each of them. Of particular interest are the sets of substrings that meet the following requirements:

  1. They do not overlap.
  2. Their total score (by sum, for simplicity) is maximum.
  3. They span the entire string.

I understand the naive brute-force solution is of O(M*N^2) complexity. While the implementation of this algorithm would probably not impose a lot on the performance of the whole project (nowhere near the critical path, can be precomputed, etc.), it really doesn't sit well with me. I'd like to know if there are any better solutions to this problem and if so, which are they? Pointers to relevant code are always appreciated, but just algorithm description will do too.

Thanks in advance.

A: 

O(N+M) solution:

Set f[1..N]=-1
Set f[0]=0
for a = 0 to N-1
    if f[a] >= 0
        For each substring beginning at a
            Let b be the last index of the substring, and c its score
            If f[a]+c > f[b+1]
                Set f[b+1] = f[a]+c
                Set g[b+1] = [substring number]
Now f[N] contains the answer, or -1 if no set of substrings spans the string.
To get the substrings:
b = N
while b > 0
    Get substring number from g[N]
    Output substring number
    b = b - (length of substring)
jcd
That only provides the maximum score, not with the desired set of non-overlapping substrings. Maybe I should edit the question to make this point clearer.
Michael Foukarakis
It's easy to track the set by maintaining a 'substring ending here giving the current maximum' array in parallel.
wrang-wrang
Which for M substrings gives a total complexity of O(N*M + M^2). That's generally slightly better than a naive solution and worse than the O(N*logN) solution I linked, so what I'm really looking for is a yes/no answer to the question, is there something better than O(N*logN)?
Michael Foukarakis
I think you misunderstood the O(N*lg N) algorithm you linked to. It's trying to find the single subsequence in an array of ints that has the greatest sum. Since your values are all nonnegative, that algorithm would always return the whole array.
wrang-wrang
Well, not only do you have nonnegative values, but they're for intervals, not for positions. I don't even see how it makes sense to apply http://stackoverflow.com/questions/624540/algorithm-to-return-the-maximum-possible-sum-of-subsequences-in-a-sequence
wrang-wrang
Oh I see, you want the substrings as well. Edited the code to do that; it's still O(N+M).
jcd
@wrang-wrang: You're right, that post's title was very misleading.
Michael Foukarakis
There's no doubt in my mind that this (and my essentially identical version) is optimal in general. Now, if you happen to have some special properties of the substring scores, maybe you can do better.
wrang-wrang
A: 

It is not clear whether M substrings are given as sequences of characters or indeces in the input string, but the problem doesn't change much because of that.

Let us have input string S of length N, and M input strings Tj. Let Lj be the length of Tj, and Pj - score given for string Sj. We say that string

This is called Dynamic Programming, or DP. You keep an array res of ints of length N, where the i-th element represents the score one can get if he has only the substring starting from the i-th element (for example, if input is "abcd", then res[2] will represent the best score you can get of "cd").

Then, you iterate through this array from end to the beginning, and check whether you can start string Sj from the i-th character. If you can, then result of (res[i + Lj] + Pj) is clearly achievable. Iterating over all Sj, res[i] = max(res[i + Lj] + Pj) for all Sj which can be applied to the i-th character.

res[0] will be your final asnwer.

Olexiy
This algorithm is very similar to solution I have implemented, without the additional code to solve the constraint of non-overlapping strings (I used recursion, but that's off-topic). The complexity of this, if it were to be extended also, is still O(N^2 * M). I'm looking for a better solution than that. Thanks, though!
Michael Foukarakis
You should precompute which strings can be applied to which positions to get rid of N^2 and make your algo O(N*M) (you should use some faster string search algorithms).
Olexiy
I'm not currently interested in the potential for optimization, which is already large. I'm interested in an efficient algorithm. Also, this has little to do with string searching..
Michael Foukarakis
I just dont understand where did you get N^2. The algo is N*M - you check each of M strings for each of N positions. Could you please explain why do you think it is N^2*M?
Olexiy
A: 

inputs:

N, the number of chars in a string
e[0..N-1]: (b,c) an element of set e[a] means [a,b) is a substring with score c.

(If all substrings are possible, then you could just have c(a,b).)

By e.g. [1,2) we mean the substring covering the 2nd letter of the string (half open interval).

(empty substrings are not allowed; if they were, then you could handle them properly only if you allow them to be "taken" at most k times)

Outputs:

s[i] is the score of the best substring covering of [0,i)
a[i]: [a[i],i) is the last substring used to cover [0,i); else NULL

Algorithm - O(N^2) if the intervals e are not sparse; O(N+E) where e is the total number of allowed intervals. This is effectively finding the best path through an acyclic graph:

for i = 0 to N:
    a[i] <- NULL
    s[i] <- 0
a[0] <- 0
for i = 0 to N-1
    if a[i] != NULL
        for (b,c) in e[i]:
            sib <- s[i]+c
            if sib>s[b]:
                a[b] <- i
                s[b] <- sib

To yield the best covering triples (a,b,c) where cost of [a,b) is c:

i <- N
if (a[i]==NULL):
    error "no covering"
while (a[i]!=0):
    from <- a[i]
    yield (from,i,s[i]-s[from]
    i <- from

Of course, you could store the pair (sib,c) in s[b] and save the subtraction.

wrang-wrang
+2  A: 

This can be thought of as finding the longest path through a DAG. Each position in the string is a node and each substring match is an edge. You can trivially prove through induction that for any node on the optimal path the concatenation of the optimal path from the beginning to that node and from that node to the end is the same as the optimal path. Thanks to that you can just keep track of the optimal paths for each node and make sure you have visited all edges that end in a node before you start to consider paths containing it.

Then you just have the issue to find all edges that start from a node, or all substring that match at a given position. If you already know where the substring matches are, then it's as trivial as building a hash table. If you don't you can still build a hashtable if you use Rabin-Karp.

Note that with this you'll still visit all the edges in the DAG for O(e) complexity. Or in other words, you'll have to consider once each substring match that's possible in a sequence of connected substrings from start to the end. You could get better than this by doing preprocessing the substrings to find ways to rule out some matches. I have my doubts if any general case complexity improvements can come for this and any practical improvements depend heavily on your data distribution.

Ants Aasma
This actually helped a lot. I started the implementation for your solution and found out about a linear solution in the meantime. Thanks a lot though!
Michael Foukarakis