views:

2169

answers:

5

I need to match up two almost-the-same long freetext strings; i.e., to find index-to-index correspondences wherever possible.

Because this is freetext, the comparison should not be line-based as in code diffing.

Any suggestions for Java libraries?

A simple example (In real life , of course, there would not be extra whitespace to line things up, and there may be more complex challenges like entire clauses moved around.)

The quick brown  fox jumped over the  lazy     dog.
||||||||||      |||||||||||||||||||||         |||||
The quick yellow fox jumped over the well-bred dog.
+3  A: 

This one might be good Diff Match Patch. Others?

Joshua Fox
+2  A: 

Depending on your exact requirements, the StringUtils class of the Apache Commons Lang component might be helpful, e.g.:

  • StringUtils#difference: Compares two Strings, and returns the portion where they differ
  • StringUtils#getLevenshteinDistance: Find the Levenshtein distance between two Strings
Fabian Steeg
A: 

Here's a (lightly-tested) version of code that does what you asked. You can easily traverse the result in parallel with the inputs to locate insertions and deletions.

public class StringDiff {

    private static int   length(String s) { return s == null ? 0 : s.length(); }
    private static char[] chars(String s) { return s == null ? new char[0] : s.toCharArray(); }

    private final String left;
    private final String right;

    private final char[] lccs;
    private final String lcs;

    public StringDiff(String left, String right) {
        this.left = left;
        this.right = right;
        lccs = init();
        lcs = new String(lccs);
    }

    public String getLcs()  { return lcs; }
    public char[] getLccs() { return lccs.clone(); }

    private char[] init() {
        int lLength = length(left);
        int rLength = length(right);
        char[] lChars = chars(left);
        char[] rChars = chars(right);
        int [][] t = new int [lLength + 1][rLength + 1];
        for (int i = lLength - 1; i >= 0; --i) {
            for (int j = rLength - 1; j >= 0; --j) {
                if (lChars[i] == rChars[j]) {
                    t[i][j] = t[i + 1][j + 1] + 1;
                } else {
                    t[i][j] = Math.max(t[i + 1][j], t[i][j + 1]);
                }
            }
        }
        char[] result = new char[t[0][0]];
        int l = 0, r = 0, p = 0;
        while (l < lLength && r < rLength) {
            if (lChars[l] == rChars[r]) {
                result[p++] = lChars[l++];
                r++;
            } else {
                if (t[l + 1][r] > t[l][r + 1]) {
                    ++l;
                } else {
                    ++r;
                }
            }
        }
        return result;
    }

}

According to it, the actual longest subsequence of your original inputs:

The quick brown  fox jumped over the  lazy     dog.
The quick yellow fox jumped over the well-bred dog.

is:

The quick ow fox jumped over the l dog.

(because "brown" and "yellow" have "ow" in common, etc.)

It's relatively straightforward to modify the above to split on whitespace (instead of into char arrays) and substitute String#equals for == to get a version that finds the longest common subsequence of words instead of characters. For your example above that change would produce the obvious result:

found 7 words
    'The'
    'quick'
    'fox'
    'jumped'
    'over'
    'the'
    'dog.'

(Your question implied character comparisons, as you matched the spaces between words.)

joel.neely
That is not what he asked for is it? It should be index to index based and thus not return a match for ow and yellow and brown since they are not on the same index?
willcodejavaforfood
A: 

See the pseudo-code of the Needleman-Wunsch algorithm ( http://en.wikipedia.org/wiki/Needleman-Wunsch_algorithm) used in bioinformatics to align two DNA sequence.

Pierre
A: 

If you're example is really what you want to do - ie subsequences only match if they start at the same index (which is different from how diffs normally operate) - this is all you need to do:

import java.util.*;

class StringDiff {
    public static List<int[]> from(String s1, String s2) {
     int start = -1;
     int pos = 0;
     LinkedList<int[]> list = new LinkedList<int[]>();

     for(; pos < s1.length() && pos < s2.length(); ++pos) {
      if(s1.charAt(pos) == s2.charAt(pos)) {
       if(start < 0) start = pos;
      }
      else {
       if(start >= 0) list.add(new int[] { start, pos });
       start = -1;
      }
     }

     if(start >= 0) list.add(new int[] { start, pos });

     return list;
    }

    public static void main(String[] args) {
     for(int[] idx : from(args[0], args[1]))
      System.out.println(args[0].substring(idx[0], idx[1]));
    }
}

An actual diff implementation will be far more sophisticated.

Christoph