views:

77

answers:

2

I'm trying to compare the data within two files, and retrieve a list of offsets of where the differences are. I tried it on some text files and it worked quite well.. However on non-text files (that still contain ascii text), I call them binary data files. (executables, so on..)

It seems to think some bytes are the same, even though when I look at it in hex editor, they are obviously not. I tried printing out this binary data that it thinks is the same and I get blank lines where it should be printed. Thus, I think this is the source of the problem.

So what is the best way to compare bytes of data that could be both binary and contain ascii text? I thought using the struct module by be a starting point...

As you can see below, I compare the bytes with the == operator

Here's the code:

import os
import math


#file1 = 'file1.txt'
#file2 = 'file2.txt'
file1 = 'file1.exe'
file2 = 'file2.exe'
file1size = os.path.getsize(file1)
file2size = os.path.getsize(file2)
a = file1size - file2size
end = file1size  #if they are both same size
if a > 0:
    #file 2 is smallest
    end = file2size
    big = file1size

elif a < 0:
    #file 1 is smallest
    end = file1size
    big = file2size


f1 = open(file1, 'rb')
f2 = open(file2, 'rb')



readSize = 500
r = readSize
off = 0
data = []
looking = False
d = open('data.txt', 'w')


while off < end:
    f1.seek(off)
    f2.seek(off)
    b1, b2 = f1.read(r), f2.read(r)
    same = b1 == b2
    print ''
    if same:
        print 'Same at: '+str(off)
        print 'readSize: '+str(r)
        print b1
        print b2
        print ''
        #save offsets of the section of "different" bytes
        #data.append([diffOff, diffOff+off-1])  #[begin diff off, end diff off]
        if looking:
            d.write(str(diffOff)+" => "+str(diffOff+off-2)+"\n")
            looking = False
            r = readSize
            off = off + 1
        else:
            off = off + r

    else:
        if r == 1:
            looking = True
            diffOff = off
            off = off + 1 #continue reading 1 at a time, until u find a same reading
        r = 1  #it will shoot back to the last off, since we didn't increment it here



d.close()
f1.close()
f2.close()          

#add the diff ending portion to diff data offs, if 1 file is longer than the other
a = int(math.fabs(a))  #get abs val of diff
if a:
    data.append([big-a, big-1])


print data
+4  A: 

Did you try difflib and filecmp modules?

This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce difference information in various formats, including HTML and context and unified diffs. For comparing directories and files, see also, the filecmp module.

The filecmp module defines functions to compare files and directories, with various optional time/correctness trade-offs. For comparing files, see also the difflib module

.

Ashish
Did you try it?
Ashish
A: 

You are probably encountering encoding/decoding problems. Someone may suggest a better solution, but you could try reading the file into a bytearray so you're reading raw bytes instead of decoded characters:

Here's a crude example:

$ od -Ax -tx1 /tmp/aa
000000 e0 b2 aa 0a
$ od -Ax -tx1 /tmp/bb
000000 e0 b2 bb 0a

$ cat /tmp/diff.py 
a = bytearray(open('/tmp/aa', 'rb').read())
b = bytearray(open('/tmp/bb', 'rb').read())
print "%02x, %02x" % (a[2], a[3])
print "%02x, %02x" % (b[2], b[3])

$ python /tmp/diff.py 
aa, 0a
bb, 0a
bstpierre