I'm looking for an efficient algorithm to do string tiling. Basically, you are given a list of strings, say BCD
, CDE
, ABC
, A
, and the resulting tiled string should be ABCDE
, because BCD
aligns with CDE
yielding BCDE
, which is then aligned with ABC
yielding the final ABCDE
.
Currently, I'm using a slightly naïve algorithm, that works as follows. Starting with a random pair of strings, say BCD
and CDE
, I use the following (in Java):
public static String tile(String first, String second) {
for (int i = 0; i < first.length() || i < second.length(); i++) {
// "right" tile (e.g., "BCD" and "CDE")
String firstTile = first.substring(i);
// "left" tile (e.g., "CDE" and "BCD")
String secondTile = second.substring(i);
if (second.contains(firstTile)) {
return first.substring(0, i) + second;
} else if (first.contains(secondTile)) {
return second.substring(0, i) + first;
}
}
return EMPTY;
}
System.out.println(tile("CDE", "ABCDEF")); // ABCDEF
System.out.println(tile("BCD", "CDE")); // BCDE
System.out.println(tile("CDE", "ABC")); // ABCDE
System.out.println(tile("ABC", tile("BCX", "XYZ"))); // ABCXYZ
Although this works, it's not very efficient, as it iterates over the same characters over and over again.
So, does anybody know a better (more efficient) algorithm to do this ? This problem is similar to a DNA sequence alignment problem, so any advice from someone in this field (and others, of course) are very much welcome. Also note that I'm not looking for an alignment, but a tiling, because I require a full overlap of one of the strings over the other.
I'm currently looking for an adaptation of the Rabin-Karp algorithm, in order to improve the asymptotic complexity of the algorithm, but I'd like to hear some advice before delving any further into this matter.
Thanks in advance.
For situations where there is ambiguity -- e.g., {ABC, CBA}
which could result in ABCBA
or CBABC
--, any tiling can be returned. However, this situation seldom occurs, because I'm tiling words, e.g. {This is, is me} => {This is me}
, which are manipulated so that the aforementioned algorithm works.
Similar question: http://stackoverflow.com/questions/1285434/efficient-algorithm-for-string-concatenation-with-overlap