views:

108

answers:

3

I have been using the Python difflib library to find where 2 documents differ. The Differ().compare() method does this, but it is very slow - atleast 100x slower for large HTML documents compared to the diff command.

How can I efficiently determine where 2 documents differ in Python? (Ideally I am after the positions rather the actual text, which is what SequenceMatcher().get_opcodes() returns.)

+1  A: 

An ugly and stupid solution: If diff is faster, use it; through a call from python via subprocess, parse the command output for the information you need. This won't be as fast as just diff, but maybe faster than difflib.

The MYYN
+2  A: 
a = open("file1.txt").readlines()
b = open("file2.txt").readlines()
count = 0
pos = 0

while 1:
    count += 1
    try:
        al = a.pop(0)
        bl = b.pop(0)
        if al != bl:
            print "files differ on line %d, byte %d" % (count,pos)
        pos += len(al)
    except IndexError:
        break
Kimvais
the 2 documents may differ at multiple locations...
Plumo
good point, fixed now.
Kimvais
That is a good idea to compare by lines instead of characters, which is what I was doing. When I changed Differ to use lines instead of characters the efficiency became comparable to the diff command!
Plumo
+1  A: 

Google has a diff library for plain text with a python API, which should apply to the html documents you want to work with. I am not sure if it is suited for your particular use case where you are specifically interested in the location of the differences, but it is worth having a look at.

Raja
Looks promising, but their wiki page does warn that "The diff, match and patch algorithms in this library are plain text only. Attempting to feed HTML, XML or some other structured content through them may result in problems."
Plumo