ansaurus

Question

Optimal solution for non-overlapping maximum scoring sequences

Answer 1

A:

O(N+M) solution:

Set f[1..N]=-1
Set f[0]=0
for a = 0 to N-1
    if f[a] >= 0
        For each substring beginning at a
            Let b be the last index of the substring, and c its score
            If f[a]+c > f[b+1]
                Set f[b+1] = f[a]+c
                Set g[b+1] = [substring number]
Now f[N] contains the answer, or -1 if no set of substrings spans the string.
To get the substrings:
b = N
while b > 0
    Get substring number from g[N]
    Output substring number
    b = b - (length of substring)

jcd 2009-09-11 07:06:05

That only provides the maximum score, not with the desired set of non-overlapping substrings. Maybe I should edit the question to make this point clearer.

Michael Foukarakis 2009-09-11 07:31:15

It's easy to track the set by maintaining a 'substring ending here giving the current maximum' array in parallel.

wrang-wrang 2009-09-11 07:54:37

Which for M substrings gives a total complexity of O(N*M + M^2). That's generally slightly better than a naive solution and worse than the O(N*logN) solution I linked, so what I'm really looking for is a yes/no answer to the question, is there something better than O(N*logN)?

Michael Foukarakis 2009-09-11 08:03:16

I think you misunderstood the O(N*lg N) algorithm you linked to. It's trying to find the single subsequence in an array of ints that has the greatest sum. Since your values are all nonnegative, that algorithm would always return the whole array.

wrang-wrang 2009-09-11 08:14:52

Well, not only do you have nonnegative values, but they're for intervals, not for positions. I don't even see how it makes sense to apply http://stackoverflow.com/questions/624540/algorithm-to-return-the-maximum-possible-sum-of-subsequences-in-a-sequence

wrang-wrang 2009-09-11 08:20:19

Oh I see, you want the substrings as well. Edited the code to do that; it's still O(N+M).

jcd 2009-09-11 08:44:42

@wrang-wrang: You're right, that post's title was very misleading.

Michael Foukarakis 2009-09-11 08:59:49

There's no doubt in my mind that this (and my essentially identical version) is optimal in general. Now, if you happen to have some special properties of the substring scores, maybe you can do better.

wrang-wrang 2009-09-11 18:28:09

Answer 2

A:

It is not clear whether M substrings are given as sequences of characters or indeces in the input string, but the problem doesn't change much because of that.

Let us have input string S of length N, and M input strings Tj. Let Lj be the length of Tj, and Pj - score given for string Sj. We say that string

This is called Dynamic Programming, or DP. You keep an array res of ints of length N, where the i-th element represents the score one can get if he has only the substring starting from the i-th element (for example, if input is "abcd", then res[2] will represent the best score you can get of "cd").

Then, you iterate through this array from end to the beginning, and check whether you can start string Sj from the i-th character. If you can, then result of (res[i + Lj] + Pj) is clearly achievable. Iterating over all Sj, res[i] = max(res[i + Lj] + Pj) for all Sj which can be applied to the i-th character.

res[0] will be your final asnwer.

Olexiy 2009-09-11 07:41:14

This algorithm is very similar to solution I have implemented, without the additional code to solve the constraint of non-overlapping strings (I used recursion, but that's off-topic). The complexity of this, if it were to be extended also, is still O(N^2 * M). I'm looking for a better solution than that. Thanks, though!

Michael Foukarakis 2009-09-11 07:53:34

You should precompute which strings can be applied to which positions to get rid of N^2 and make your algo O(N*M) (you should use some faster string search algorithms).

Olexiy 2009-09-11 10:50:23

I'm not currently interested in the potential for optimization, which is already large. I'm interested in an efficient algorithm. Also, this has little to do with string searching..

Michael Foukarakis 2009-09-12 07:49:03

I just dont understand where did you get N^2. The algo is N*M - you check each of M strings for each of N positions. Could you please explain why do you think it is N^2*M?

Olexiy 2009-09-12 20:57:03

Answer 3

A:

inputs:

N, the number of chars in a string
e[0..N-1]: (b,c) an element of set e[a] means [a,b) is a substring with score c.

(If all substrings are possible, then you could just have c(a,b).)

By e.g. [1,2) we mean the substring covering the 2nd letter of the string (half open interval).

(empty substrings are not allowed; if they were, then you could handle them properly only if you allow them to be "taken" at most k times)

Outputs:

s[i] is the score of the best substring covering of [0,i)
a[i]: [a[i],i) is the last substring used to cover [0,i); else NULL

Algorithm - O(N^2) if the intervals e are not sparse; O(N+E) where e is the total number of allowed intervals. This is effectively finding the best path through an acyclic graph:

for i = 0 to N:
    a[i] <- NULL
    s[i] <- 0
a[0] <- 0
for i = 0 to N-1
    if a[i] != NULL
        for (b,c) in e[i]:
            sib <- s[i]+c
            if sib>s[b]:
                a[b] <- i
                s[b] <- sib

To yield the best covering triples (a,b,c) where cost of [a,b) is c:

i <- N
if (a[i]==NULL):
    error "no covering"
while (a[i]!=0):
    from <- a[i]
    yield (from,i,s[i]-s[from]
    i <- from

Of course, you could store the pair (sib,c) in s[b] and save the subtraction.

wrang-wrang 2009-09-11 08:09:56

Answer 4

+2 A:

This can be thought of as finding the longest path through a DAG. Each position in the string is a node and each substring match is an edge. You can trivially prove through induction that for any node on the optimal path the concatenation of the optimal path from the beginning to that node and from that node to the end is the same as the optimal path. Thanks to that you can just keep track of the optimal paths for each node and make sure you have visited all edges that end in a node before you start to consider paths containing it.

Then you just have the issue to find all edges that start from a node, or all substring that match at a given position. If you already know where the substring matches are, then it's as trivial as building a hash table. If you don't you can still build a hashtable if you use Rabin-Karp.

Note that with this you'll still visit all the edges in the DAG for O(e) complexity. Or in other words, you'll have to consider once each substring match that's possible in a sequence of connected substrings from start to the end. You could get better than this by doing preprocessing the substrings to find ways to rule out some matches. I have my doubts if any general case complexity improvements can come for this and any practical improvements depend heavily on your data distribution.

Ants Aasma 2009-09-11 20:22:32

This actually helped a lot. I started the implementation for your solution and found out about a linear solution in the meantime. Thanks a lot though!

Michael Foukarakis 2009-09-12 07:56:07

ansaurus

tags:

views:

answers:

Optimal solution for non-overlapping maximum scoring sequences

related questions