views:

471

answers:

5

I'm looking for an efficient algorithm to do string tiling. Basically, you are given a list of strings, say BCD, CDE, ABC, A, and the resulting tiled string should be ABCDE, because BCD aligns with CDE yielding BCDE, which is then aligned with ABC yielding the final ABCDE.

Currently, I'm using a slightly naïve algorithm, that works as follows. Starting with a random pair of strings, say BCD and CDE, I use the following (in Java):

public static String tile(String first, String second) {
  for (int i = 0; i < first.length() || i < second.length(); i++) {
    // "right" tile (e.g., "BCD" and "CDE")
    String firstTile = first.substring(i);
    // "left" tile (e.g., "CDE" and "BCD")  
    String secondTile = second.substring(i);
    if (second.contains(firstTile)) {
      return first.substring(0, i) + second;
    } else if (first.contains(secondTile)) {
      return second.substring(0, i) + first;
    }
  }
  return EMPTY;
}

System.out.println(tile("CDE", "ABCDEF")); // ABCDEF
System.out.println(tile("BCD", "CDE")); // BCDE
System.out.println(tile("CDE", "ABC")); // ABCDE
System.out.println(tile("ABC", tile("BCX", "XYZ"))); // ABCXYZ

Although this works, it's not very efficient, as it iterates over the same characters over and over again.

So, does anybody know a better (more efficient) algorithm to do this ? This problem is similar to a DNA sequence alignment problem, so any advice from someone in this field (and others, of course) are very much welcome. Also note that I'm not looking for an alignment, but a tiling, because I require a full overlap of one of the strings over the other.

I'm currently looking for an adaptation of the Rabin-Karp algorithm, in order to improve the asymptotic complexity of the algorithm, but I'd like to hear some advice before delving any further into this matter.

Thanks in advance.


For situations where there is ambiguity -- e.g., {ABC, CBA} which could result in ABCBA or CBABC --, any tiling can be returned. However, this situation seldom occurs, because I'm tiling words, e.g. {This is, is me} => {This is me}, which are manipulated so that the aforementioned algorithm works.

Similar question: http://stackoverflow.com/questions/1285434/efficient-algorithm-for-string-concatenation-with-overlap

A: 

The first thing to ask is if you want to find the tilling of {CDB, CDA}? There is no single tilling.

or ABC + CDE + CFG
Lasse V. Karlsen
No, I require a full overlap of one of the strings. Using my algorithm, that pair of strings would return the EMPTY string.
JG
A simple approximate algorithm would be to build a de bruijn graph. I am thinking others.
@Lasse V. Karlsen: That's an interesting case, as there are two possible combinations ("ABC"+"CDE" or "ABC"+"CFG"). But consider a single pairwise string tiling.
JG
What is output of this pair {ABC, CBA}? Also empty due to ambiguity? What about this {ABC, BCA}? Prefer ABCA because overlap is longer? Do you allow imperfect matches? It is hard to go further without a complete picture of your problem.
@lh3: "ABCBA" and "ABCA", respectively. Basically, I first try to tile at the "right" and then at the "left". I didn't consider those cases, because I'm actually tiling words (e.g. "the king" and "king is" => "the king is"), but I simplified my examples because I thought it'd be easier to explain. What do you mean by imperfect matches? Thanks for the interest, btw.
JG
@JG: If you are mainly thinking about two strings, you may build a suffix tree for the first string and scan through the second string against the suffix tree. The time complexity is O(L) where L is the total length of the two strings.
@JG: I realize that you may also use KMP which is easier to implement. The time complexity should also be linear.
@lh3: I've looked into KMP, and that would indeed improve the running time of the algorithm, by removing the repeated `substring` call. Now, regarding the brujin graph, care to explain how to utilize it in this situation?
JG
+1  A: 

I think this should work for the tiling of two strings, and be more efficient than your current implementation using substring and contains. Conceptually I loop across the characters in the 'left' string and compare them to a character in the 'right' string. If the two characters match, I move to the next character in the right string. Depending on which string the end is first reached of, and if the last compared characters match or not, one of the possible tiling cases is identified.

I haven't thought of anything to improve the time complexity of tiling more than two strings. As a small note for multiple strings, this algorithm below is easily extended to checking the tiling of a single 'left' string with multiple 'right' strings at once, which might prevent extra looping over the strings a bit if you're trying to find out whether to do ("ABC", "BCX", "XYZ") or ("ABC", "XYZ", BCX") by just trying all the possibilities. A bit.

string Tile(string a, string b)
{
    // Try both orderings of a and b,
    // since TileLeftToRight is not commutative.

    string ab = TileLeftToRight(a, b);

    if (ab != "")
        return ab;

    return TileLeftToRight(b, a);

    // Alternatively you could return whichever
    // of the two results is longest, for cases
    // like ("ABC" "BCABC").
}

string TileLeftToRight(string left, string right)
{
    int i = 0;
    int j = 0;

    while (true)
    {
        if (left[i] != right[j])
        {
            i++;

            if (i >= left.Length)
                return "";
        }
        else
        {
            i++;
            j++;

            if (i >= left.Length)
                return left + right.Substring(j);

            if (j >= right.Length)
                return left;
        }
    }
}
Joren
Yes, it's definitely faster, thanks.
JG
+3  A: 

Order the strings by the first character, then length (smallest to largest), and then apply the adaptation to KMP found in this question about concatenating overlapping strings.

Daniel
Thanks, I was searching for tiling and alignment and couldn't find that question.
JG
It *was* tough finding it. Fortunately, I had answered it, so it narrowed down a bit the search.
Daniel
A: 

Interesting problem. You need some kind of backtracking. For example if you have:

ABC, BCD, DBC

Combining DBC with BCD results in:

ABC, DBCD

Which is not solvable. But combining ABC with BCD results in:

ABCD, DBC

Which can be combined to:

ABCDBC.
Gamecat
Yes, I need to delve into that. The alternative is to generate all `n!` permutations of the strings, and then proceed from left to right for each possible permutation, but this is obviously uber-slow.
JG
+1  A: 

If Open Source code is acceptable, then you should check the genome benchmarks in Stanford's STAMP benchmark suite: it does pretty much exactly what you're looking for. Starting with a bunch of strings ("genes"), it looks for the shortest string that incorporates all the genes. So for example if you have ATGC and GCAA, it'll find ATGCAA. There's nothing about the algorithm that limits it to a 4-character alphabet, so this should be able to help you.

redtuna
Yes, it's perfectly acceptable. Thanks a lot!
JG