ansaurus

Question

Determine where documents differ with Python

Answer 1

+1 A:

An ugly and stupid solution: If diff is faster, use it; through a call from python via subprocess, parse the command output for the information you need. This won't be as fast as just diff, but maybe faster than difflib.

The MYYN 2010-01-04 12:18:52

Answer 2

+2 A:

a = open("file1.txt").readlines()
b = open("file2.txt").readlines()
count = 0
pos = 0

while 1:
    count += 1
    try:
        al = a.pop(0)
        bl = b.pop(0)
        if al != bl:
            print "files differ on line %d, byte %d" % (count,pos)
        pos += len(al)
    except IndexError:
        break

Kimvais 2010-01-04 12:30:02

the 2 documents may differ at multiple locations...

Plumo 2010-01-04 21:53:08

good point, fixed now.

Kimvais 2010-01-05 08:50:25

That is a good idea to compare by lines instead of characters, which is what I was doing. When I changed Differ to use lines instead of characters the efficiency became comparable to the diff command!

Plumo 2010-01-05 09:46:29

Answer 3

+1 A:

Google has a diff library for plain text with a python API, which should apply to the html documents you want to work with. I am not sure if it is suited for your particular use case where you are specifically interested in the location of the differences, but it is worth having a look at.

Raja 2010-01-04 13:13:00

Looks promising, but their wiki page does warn that "The diff, match and patch algorithms in this library are plain text only. Attempting to feed HTML, XML or some other structured content through them may result in problems."

Plumo 2010-01-04 22:07:06

ansaurus

tags:

views:

answers:

Determine where documents differ with Python

related questions