ansaurus

Question

Answer 1

A:

The first thing to ask is if you want to find the tilling of {CDB, CDA}? There is no single tilling.

2009-09-17 20:48:53

or ABC + CDE + CFG

Lasse V. Karlsen 2009-09-17 20:51:10

No, I require a full overlap of one of the strings. Using my algorithm, that pair of strings would return the EMPTY string.

JG 2009-09-17 20:51:31

A simple approximate algorithm would be to build a de bruijn graph. I am thinking others.

2009-09-17 20:52:59

@Lasse V. Karlsen: That's an interesting case, as there are two possible combinations ("ABC"+"CDE" or "ABC"+"CFG"). But consider a single pairwise string tiling.

JG 2009-09-17 20:57:08

What is output of this pair {ABC, CBA}? Also empty due to ambiguity? What about this {ABC, BCA}? Prefer ABCA because overlap is longer? Do you allow imperfect matches? It is hard to go further without a complete picture of your problem.

2009-09-17 20:57:51

@lh3: "ABCBA" and "ABCA", respectively. Basically, I first try to tile at the "right" and then at the "left". I didn't consider those cases, because I'm actually tiling words (e.g. "the king" and "king is" => "the king is"), but I simplified my examples because I thought it'd be easier to explain. What do you mean by imperfect matches? Thanks for the interest, btw.

JG 2009-09-17 21:04:18

@JG: If you are mainly thinking about two strings, you may build a suffix tree for the first string and scan through the second string against the suffix tree. The time complexity is O(L) where L is the total length of the two strings.

2009-09-17 21:11:17

@JG: I realize that you may also use KMP which is easier to implement. The time complexity should also be linear.

2009-09-17 21:14:12

@lh3: I've looked into KMP, and that would indeed improve the running time of the algorithm, by removing the repeated `substring` call. Now, regarding the brujin graph, care to explain how to utilize it in this situation?

JG 2009-09-17 22:29:40

Answer 2

+1 A:

I think this should work for the tiling of two strings, and be more efficient than your current implementation using substring and contains. Conceptually I loop across the characters in the 'left' string and compare them to a character in the 'right' string. If the two characters match, I move to the next character in the right string. Depending on which string the end is first reached of, and if the last compared characters match or not, one of the possible tiling cases is identified.

I haven't thought of anything to improve the time complexity of tiling more than two strings. As a small note for multiple strings, this algorithm below is easily extended to checking the tiling of a single 'left' string with multiple 'right' strings at once, which might prevent extra looping over the strings a bit if you're trying to find out whether to do ("ABC", "BCX", "XYZ") or ("ABC", "XYZ", BCX") by just trying all the possibilities. A bit.

string Tile(string a, string b)
{
    // Try both orderings of a and b,
    // since TileLeftToRight is not commutative.

    string ab = TileLeftToRight(a, b);

    if (ab != "")
        return ab;

    return TileLeftToRight(b, a);

    // Alternatively you could return whichever
    // of the two results is longest, for cases
    // like ("ABC" "BCABC").
}

string TileLeftToRight(string left, string right)
{
    int i = 0;
    int j = 0;

    while (true)
    {
        if (left[i] != right[j])
        {
            i++;

            if (i >= left.Length)
                return "";
        }
        else
        {
            i++;
            j++;

            if (i >= left.Length)
                return left + right.Substring(j);

            if (j >= right.Length)
                return left;
        }
    }
}

Joren 2009-09-18 02:06:07

Yes, it's definitely faster, thanks.

JG 2009-09-18 11:05:12

Answer 3

+3 A:

Order the strings by the first character, then length (smallest to largest), and then apply the adaptation to KMP found in this question about concatenating overlapping strings.

Daniel 2009-09-18 02:27:49

Thanks, I was searching for tiling and alignment and couldn't find that question.

JG 2009-09-18 11:03:06

It *was* tough finding it. Fortunately, I had answered it, so it narrowed down a bit the search.

Daniel 2009-09-18 23:05:52

Answer 4

A:

Interesting problem. You need some kind of backtracking. For example if you have:

ABC, BCD, DBC

Combining DBC with BCD results in:

ABC, DBCD

Which is not solvable. But combining ABC with BCD results in:

ABCD, DBC

Which can be combined to:

ABCDBC.

Gamecat 2009-09-18 12:03:16

Yes, I need to delve into that. The alternative is to generate all `n!` permutations of the strings, and then proceed from left to right for each possible permutation, but this is obviously uber-slow.

JG 2009-09-20 00:03:25

Answer 5

+1 A:

If Open Source code is acceptable, then you should check the genome benchmarks in Stanford's STAMP benchmark suite: it does pretty much exactly what you're looking for. Starting with a bunch of strings ("genes"), it looks for the shortest string that incorporates all the genes. So for example if you have ATGC and GCAA, it'll find ATGCAA. There's nothing about the algorithm that limits it to a 4-character alphabet, so this should be able to help you.

redtuna 2009-09-21 13:21:03

Yes, it's perfectly acceptable. Thanks a lot!

JG 2009-09-21 18:44:41

ansaurus

tags:

views:

answers:

String Tiling Algorithm

related questions