views:

988

answers:

10

We need to combine 3 columns in a database by concatenation. However, the 3 columns may contain overlapping parts and the parts should not be duplicated. For example,

  "a" + "b" + "c" => "abc"
  "abcde" + "defgh" + "ghlmn" => "abcdefghlmn"
  "abcdede" + "dedefgh" + "" => "abcdedefgh"
  "abcde" + "d" + "ghlmn" => "abcdedghlmn"
  "abcdef" + "" + "defghl" => "abcdefghl"

Our current algorithm is pretty slow because it uses brute-force to identify the overlapping part between 2 strings. Does any one know an efficient algorithm to do this?

Say we have 2 strings A and B. The algorithm needs to find the longest common substring S so that A ends with S and B starts with S.

Our current brute-force implementation in Java is attached for reference,

public static String concat(String s1, String s2) {
 if (s1 == null)
  return s2;
 if (s2 == null)
  return s1;
 int len = Math.min(s1.length(), s2.length());

 // Find the index for the end of overlapping part
 int index = -1;
 for (int i = len; i > 0; i--) {
  String substring = s2.substring(0, i);
  if (s1.endsWith(substring)) {
   index = i;
   break;
  }
 }
 StringBuilder sb = new StringBuilder(s1);
 if (index < 0) 
  sb.append(s2);
 else if (index <= s2.length())
  sb.append(s2.substring(index));
 return sb.toString();
}
A: 

Why not just do something like this. First get the first character or word (which is going to signify the overlap) in the three columns.

Then, start to add the first string to a stringbuffer, one character at a time.

Each time look to see if you reached a part that is overlapped with the second or third string.

If so then start concatenating the string that also contains what is in the first string.

When done start, if no overlap, start with the second string and then the third string.

So in the second example in the question I will keep d and g in two variables.

Then, as I add the first string abc come from the first string, then I see that d is also in the second string so I shift to adding from the second string def are added from string 2, then I move on and finish with string 3.

If you are doing this in a database why not just use a stored procedure to do this?

James Black
Our DBA has decided that this operation is too complicated to do in stored procedure. Besides, we are moving from Sybase to MySQL. So this will be done by a batch job, which can be written in any language.
ZZ Coder
A: 

If you're doing it outside the database, try perl:

sub concat {
  my($x,$y) = @_;

  return $x if $y eq '';
  return $y if $x eq '';

  my($i) = length($x) < length($y) ?  length($x) : length($y);
  while($i > 0) {
      if( substr($x,-$i) eq substr($y,0,$i) )  {
          return $x . substr($y,$i);
      }
      $i--;
  }
  return $x . $y;
}

It's exactly the same algorithms as yours, I'm just curios if java or perl is faster ;-)

bjelli
+2  A: 

You may use a DFA. For example, a string XYZ should be read by the regular expression ^((A)?B)?C. That regular expression will match the longest prefix which matches a suffix of the XYZ string. With such a regular expression you can either match and get the match result, or generate a DFA, on which you can use the state to indicate the proper position for the "cut".

In Scala, the first implementation -- using regex directly -- might go like this:

def toRegex(s1: String) = "^" + s1.map(_.toString).reduceLeft((a, b) => "("+a+")?"+b) r
def concatWithoutMatch(s1 : String, s2: String) = {
  val regex = toRegex(s1)
  val prefix = regex findFirstIn s2 getOrElse ""
  s1 + s2.drop(prefix length)
}

For example:

scala> concatWithoutMatch("abXabXabXac", "XabXacd")
res9: java.lang.String = abXabXabXacd

scala> concatWithoutMatch("abc", "def")
res10: java.lang.String = abcdef

scala> concatWithoutMatch(concatWithoutMatch("abcde", "defgh"), "ghlmn")
res11: java.lang.String = abcdefghlmn
Daniel
Nice :) ... still, it misses out on the (potentially significant) optimisation in my implementation that discards most typical mismatches based on the first and last character of the overlap under consideration.
jerryjvl
Not really. Consider that we are matching against the second string. For each character of that second string, the match can either be done or not. Once the first non-matching character is found, you know what the maximum possible match is. This *never* has to compare a character more than once. Now, the overhead of creating and compiling the regex might be excessive.
Daniel
I may be missing something, but although it never compares a character in 's1' more than once, it may have to check the same character in 's2' twice, surely? ... for example in your first example, where 's1' contains 'XabXabXac'? ... it will have to check the first three characters of 's2' at least twice.
jerryjvl
Conversion of NFA to DFA can result in exponential increase in states. You've just moved the complexity from brute-force scanning to building DFA states.
Barry Kelly
Yes, it can result in exponential increase in states, if loops and alternatives are used. The number of states for this regular expression is equal to the size of the first string, plus the initial and the non-accepting terminal state.
Daniel
@jerryjvl: no, I won't. That's why regex is so popular.
Daniel
+1  A: 

Or you could do it in mysql with the following stored function:

DELIMITER //

DROP FUNCTION IF EXISTS concat_with_overlap //

CREATE FUNCTION concat_with_overlap(a VARCHAR(100), b VARCHAR(100))
  RETURNS VARCHAR(200) DETERMINISTIC
BEGIN 
  DECLARE i INT;
  DECLARE al INT;
  DECLARE bl INT;
  SET al = LENGTH(a);
  SET bl = LENGTH(a);
  IF al=0 THEN 
    RETURN b;
  END IF;
  IF bl=0 THEN 
    RETURN a;
  END IF;
  IF al < bl THEN
     SET i = al;
  ELSE
     SET i = bl;
  END IF;

  search: WHILE i > 0 DO
     IF RIGHT(a,i) = LEFT(b,i) THEN
    RETURN CONCAT(a, SUBSTR(b,i+1));
     END IF;
     SET i = i - 1;
  END WHILE search;

  RETURN CONCAT(a,b);
END//

I tried it with your test data:

mysql> select a,b,c,
    -> concat_with_overlap( concat_with_overlap( a, b ), c ) as result 
    -> from testing //
+-------------+---------+--------+-------------+
| a           | b       | c      | result      |
+-------------+---------+--------+-------------+
| a           | b       | c      | abc         |
| abcde       | defgh   | ghlmn  | abcdefghlmn |
| abcdede     | dedefgh |        | abcdedefgh  |
| abcde       | d       | ghlmn  | abcdedghlmn |
| abcdef      |         | defghl | abcdefghl   |
| abXabXabXac | XabXac  |        | abXabXabXac |
+-------------+---------+--------+-------------+
6 rows in set (0.00 sec)
bjelli
This is a very interesting one. I would definitely use it if we use the same database but we are moving from Sybase to MySQL.
ZZ Coder
so, do it in MySQL after you arrive ;-)
bjelli
Our new schema only has the combined column. I have to create some temporary columns to do this. It takes us long time to remove columns (days) in MySQL because the database is fairly big so we try to avoid it.
ZZ Coder
+1  A: 

How about (pardon the C#):

public static string OverlapConcat(string s1, string s2)
{
    // Handle nulls... never return a null
    if (string.IsNullOrEmpty(s1))
    {
        if (string.IsNullOrEmpty(s2))
            return string.Empty;
        else
            return s2;
    }
    if (string.IsNullOrEmpty(s2))
        return s1;

    // Checks above guarantee both strings have at least one character
    int len1 = s1.Length - 1;
    char last1 = s1[len1];
    char first2 = s2[0];

    // Find the first potential match, bounded by the length of s1
    int indexOfLast2 = s2.LastIndexOf(last1, Math.Min(len1, s2.Length - 1));
    while (indexOfLast2 != -1)
    {
        if (s1[len1 - indexOfLast2] == first2)
        {
            // After the quick check, do a full check
            int ix = indexOfLast2;
            while ((ix != -1) && (s1[len1 - indexOfLast2 + ix] == s2[ix]))
                ix--;
            if (ix == -1)
                return s1 + s2.Substring(indexOfLast2 + 1);
        }

        // Search for the next possible match
        indexOfLast2 = s2.LastIndexOf(last1, indexOfLast2 - 1);
    }

    // No match found, so concatenate the full strings
    return s1 + s2;
}

This implementation does not make any string copies (partial or otherwise) until it has established what needs copying, which should help performance a lot.

Also, the match check first tests the extremeties of the potentially matched area (2 single characters) which in normal english text should give a good chance of avoiding checking any other characters for mismatches.

Only once it establishes the longest match it can make, or that no match is possible at all, will two strings be concatenated. I have used simple '+' here, because I think the optimisation of the rest of the algorithm has already removed most of the inefficiencies in your original. Give this a try and let me know if it is good enough for your purposes.

jerryjvl
Also note that due to the 'LastIndexOf' it does not even consider all possible overlap offsets, only the ones that could potentially match, which means it could do significantly less than O(n) iterations of the `while` loop.
jerryjvl
I think it's always dangerous to use built-in functions (like LastIndexOf) and assume the implementation is super-efficient. It would be worth re-writing this so it works in any language (given that the question is language agnostic) which would mean no use of String.LastIndexOf in the implementation.That said, the basic premise of the algorithm looks sound.
Phil
I wanted the answer to stay relatively brief ;) ... implementing a 'LastIndexOf' that does an efficient character scan within the assumptions that can be made in this implementation should not be very complicated (it does not need to do any bounds checks on the second parameter).
jerryjvl
Also note that if even more performance needs to be squeezed out the character scan done on `s2` through repeated use of `LastIndexOf` may be sub-optimal if `s2` is significantly longer than `s1`, in which case you could make it special case based on the relative lengths of `s1` and `s2`, where the opposite version obviously would need to use `IndexOf` instead, etc. ... this goes a little beyond the scope of the question though.
jerryjvl
Thinking a bit more... the separate 'single-character check' to quickly eliminate mismatches can be folded into the comparison loop by iterating 'ix' in the opposite direction.
jerryjvl
+1  A: 

Here's a solution in Python. It should be faster just by not needing to build substrings in memory all the time. The work is done in the _concat function, which concatenates two strings. The concat function is a helper that concatenates any number of strings.

def concat(*args):
    result = ''
    for arg in args:
        result = _concat(result, arg)
    return result

def _concat(a, b):
    la = len(a)
    lb = len(b)
    for i in range(la):
        j = i
        k = 0
        while j < la and k < lb and a[j] == b[k]:
            j += 1
            k += 1
        if j == la:
            n = k
            break
    else:
        n = 0
    return a + b[n:]

if __name__ == '__main__':
    assert concat('a', 'b', 'c') == 'abc'
    assert concat('abcde', 'defgh', 'ghlmn') == 'abcdefghlmn'
    assert concat('abcdede', 'dedefgh', '') == 'abcdedefgh'
    assert concat('abcde', 'd', 'ghlmn') == 'abcdedghlmn'
    assert concat('abcdef', '', 'defghl') == 'abcdefghl'
FogleBird
You mean this:def _concat(a, b): la = len(a) for i in range(la): if a[i:] == b[:la-i]: return a + b[la-i:] return a + bdef _concat(a, b): la = len(a) for i in range(la): if a[i:] == b[:la-i]: return a + b[la-i:] return a + b
hughdbrown
Or run this: print "\ndef _concat(a, b):\n\tla = len(a)\n\tfor i in range(la):\n\t\tif a[i:] == b[:la-i]:\n\t\t\treturn a + b[la-i:]\n\treturn a + b\n".replace(r'\n', '\n').replace(r'\t', '\t')
hughdbrown
We need a better way to paste code in comments.
FogleBird
Yeah, but that was pretty good, right? You run it and it formats itself.
hughdbrown
Anyway, your version creates substrings which might take more time. But since a lot of that happens down in C-land, it might still run faster, I haven't checked. But the same might not happen when ported to C#.
FogleBird
I did it for the clarity, not the speed. Body of _concat: 5 lines versus 14.
hughdbrown
A: 

I think this will be pretty quick:

You have two strings, string1 and string2. Look backwards (right to left) through string1 for the first character of string2. Once you have that position, determine if there is overlap. If there isn't, you need to keep searching. If there is you need to determine if there is any possibility for another match.

To do that, simply explore the shorter of the two strings for a recurrence of the overlapping characters. ie: If the location of the match in string1 leaves a short string1 remaining, repeat the initial search from the new starting point in string1. Conversely, if the unmatched portion of string2 is shorter, search it for a repeat of the overlapping characters.

Repeat as required.

Job done!

This doesn't require much in terms of memory allocation (all searching done in place, just need to allocate the resultant string buffer) and only requires (at most) one pass of one of the strings being overlapped.

Phil
It is better to start with the longest possible overlap and work backwards from there... see my solution ;)
jerryjvl
I disagree. Your algorithm is slightly shorter, perhaps, but the implementation of String.LastIndexOf is probably implementing my algorithm.
Phil
My use of `LastIndexOf` scans for a single character... I don't see how that corresponds to your algorithm at all...
jerryjvl
There's some homework for you, then. Implement both algorithms, run them against the same set of a few million randomly generated strings and measure the relative performance of each.
Phil
I'll be keen to see the results if you do.
Phil
Okay... in trying to implement your algorithm: once the initial overlap is found, matching the recurring part will require looking in *both* string1 and string2 for the overlapped part... only if it occurs in both have you found a longer match. (Also, your solution does not use a short-cut to quickly eliminate likely mismatches) ... I'd still happily perform a benchmark, but you will have to be a little more explicit about how to efficiently handle the 'find a longer match' portion of your answer first.
jerryjvl
Not quite right. If string2 is not self-similar then you don't need to keep searching either string. If it is self-similar then you have the length of the self-similarity, and therefore can do one very direct comparison for a match - no searching required at all. There's probably an optimisation that can be made around single character overlaps (don't look for self-similarities in that instance), etc.You seem cluey, you'll work it out.PS: There are also some global optimisations to apply in the context of the originally stated problem. eg: passing an output stream. String op+ is bad.
Phil
Why look from Right to Left? Look from Left to Right, and the first match you find will be the one with the largest overlap.
Kirk Broadhurst
Okay, it appears there is a further problem... when comparing "hilarious gag" and "gag reflex" for instance, this method will find "g" as the first match, try to look further and decide that "a" does not match, so no point looking further, resulting in "hilarious gagag reflex", which is not right... I think you'll have to go with 'longest-match-first' like Kirk suggests to avoid this problem.
jerryjvl
We go right to left on the assumption that the overlap is small and that self-similarity is unlikely to be present in many string2's
Phil
Actually (working r-l), "g" will match, "a" won't match, "g" will match, then "gag" will match. No self-similarity in string2 means we're done.
Phil
Maybe I'll dust off VS.NET and see if I can produce an implementation....
Phil
OK, have done the experiment. There's nothing in it, and depends upon the data set. If string2 is self-similar jerryjvl's code is faster, if string2 is not, my code is faster - but it's only +/- 20% in either case. I haven't gone all the way with extensive testing, just far enough to get to the point where it isn't worth continuing any further.Kirk's code is at least 2x slower than either jerrjvl's or mine and has some boundary problems that can introduce an infinite loop, so debug that before you use it.
Phil
+1  A: 

I'm trying to make this C# as pleasant to read as possible.

    public static string Concatenate(string s1, string s2)
    {
        if (string.IsNullOrEmpty(s1)) return s2;
        if (string.IsNullOrEmpty(s2)) return s1;
        if (s1.Contains(s2)) return s1;
        if (s2.Contains(s1)) return s2;

        char endChar = s1.ToCharArray().Last();
        char startChar = s2.ToCharArray().First();

        int s1FirstIndexOfStartChar = s1.IndexOf(startChar);
        int overlapLength = s1.Length - s1FirstIndexOfStartChar;

        while (overlapLength >= 0 && s1FirstIndexOfStartChar >=0)
        {
            if (CheckOverlap(s1, s2, overlapLength))
            {
                return s1 + s2.Substring(overlapLength);
            }

            s1FirstIndexOfStartChar = 
                s1.IndexOf(startChar, s1FirstIndexOfStartChar);
            overlapLength = s1.Length - s1FirstIndexOfStartChar;

        }

        return s1 + s2;
    }

    private static bool CheckOverlap(string s1, string s2, int overlapLength)
    {
        if (overlapLength <= 0)
            return false;

        if (s1.Substring(s1.Length - overlapLength) == 
            s2.Substring(0, overlapLength))
            return true;

        return false;            
    }

EDIT: I see that this is almost the same as jerryjvl's solution. The only difference is that this will work with the "abcde", "d" case.

Kirk Broadhurst
This still makes string copies for the `CheckOverlap`, ... I would at the least re-implement this helper method with an in-place comparison loop.
jerryjvl
I am not sure I understand your "the only difference is ..." ... my algorithm gives the correct result for that example from the original problem...
jerryjvl
+9  A: 

All these people tinkering around with micro-optimizations are wasting time. Look at your algorithm: it's O(N^2). This seems like a problem that can be solved much faster than that!

Consider Knuth Morris Pratt. It keeps track of the maximum amount of substring we have matched so far throughout. That means it knows how much of S1 has been matched at the end of S2, and that's the value we're looking for! Just modify the algorithm to continue instead of returning when it matches the substring early on, and have it return the amount matched instead of 0 at the end.

That gives you an O(n) algorithm. Nice!

    int OverlappedStringLength(string s1, string s2) {
        //Trim s1 so it isn't longer than s2
        if (s1.Length > s2.Length) s1 = s1.Substring(s1.Length - s2.Length);

        int[] T = ComputeBackTrackTable(s2); //O(n)

        int m = 0;
        int i = 0;
        while (m + i < s1.Length) {
            if (s2[i] == s1[m + i]) {
                i += 1;
                //<-- removed the return case here, because |s1| <= |s2|
            } else {
                m += i - T[i];
                if (i > 0) i = T[i];
            }
        }

        return i; //<-- changed the return here to return characters matched
    }

    int[] ComputeBackTrackTable(string s) {
        var T = new int[s.Length];
        int cnd = 0;
        T[0] = -1;
        T[1] = 0;
        int pos = 2;
        while (pos < s.Length) {
            if (s[pos - 1] == s[cnd]) {
                T[pos] = cnd + 1;
                pos += 1;
                cnd += 1;
            } else if (cnd > 0) {
                cnd = T[cnd];
            } else {
                T[pos] = 0;
                pos += 1;
            }
        }

        return T;
    }

OverlappedStringLength("abcdef", "defghl") returns 3

Strilanc
Running this against my implementation over a million randomly generated strings (from alphabet a-z, length 8-23, using the same collection of strings for both implementations) yours takes twice as long. Whether this implementation is actually better will depend significantly on the nature of the overlap in the actual strings being concatenated, so as always... ZZ Coder will need to measure against his dataset before deciding what *actually* works best.
jerryjvl
Even so, +1 for use of Knuth ;)
jerryjvl
Yeah, looks like a lot of constant time in this implementation. If the strings are the size of those in the OP, might not be worth it. Gotta profile.
FogleBird
I agree that your search will be faster for small random strings. In fact, if the input is random, you can achieve an expected-linear-time algorithm by progressively filtering the possible starting positions by checking more and more characters. But relying on the input for anything is a bad idea. What if you get long strings, repetitive strings, strings with only 0s and 1s, etc? You're algorithm's performance relies on a reasonably random input, which may or may not apply in practice (or might stop applying in the future). eg. English character frequencies hurt your performance slightly.
Strilanc
I wasn't intending my measurement or rebuttal to be generic though; keep in mind that we know very little about the actual data ZZ Coder will be running the algorithm against; without further knowledge like that we cannot say anything about which implementation is best. As I said... he'll need to measure with his actual real strings to be able to make an informed decision.
jerryjvl
My goal was to find an O(n) solution and I think this is it. Thanks!
ZZ Coder
A: 

This problem seems like a variation of the longest common sub-sequence problem, which can be solved via dynamic programming.

http://www.algorithmist.com/index.php/Longest%5FCommon%5FSubsequence

Jonathan