tags:

views:

133

answers:

5

Hi,

I'm diffing a bunch of binary files, recursively.

Basically, I'm running:

diff --recursive --brief dir_a dir_b

And this tells me which files differ, and which are only present in one of the locations.

I'd like to get a bit more information, roughly, how much different they are from one another. A percentage would do.

Is there a simple, unixy, relatively fast way to do this?

Regarding the metric

So, most responders are wondering about how I want to calculate the percentage, and the answer is, very much, I don't care. I'm thinking something in the lines of diff size over compound size of both files would do. But if there's something else out there that uses a different metric, I'm taking it. I just need a rough value.

git tends to show some sort of diff percentage for commits, any idea what the metric would be here?

A: 

I am not sure how you would want to measure percentages. You could however cook up a script which reads the output of your diff command and somehow calculates percentages. But first you need to know which metric you want to use.

Robert Klemme
A: 

Look up the program "diffstat". It will give you a better idea.

daed
+1  A: 

Since you're diffing binaries, diff or diffstat are not very useful. The notion of "difference" is also not as clear as with line-oriented text files.

One idea is to use a binary diff tool such as bsdiff or xdelta to generate a binary patch with zero compression and then compare the size of the patch to the size of the original.

laalto
A: 

not exactly sure how you want to define "how much different", but you can count the number of items in each directory and divide by total to get percentage

# diff -r /tmp /home | awk -F":" '{_[$1]++}END{for(i in _) print _[i],i}'
74 Only in /tmp
29 Only in /home

the above just prints out the numbers. Define a metric yourself.

ghostdog74
Just how different each file is from its equivalent on the second path. I actually don't care about the files that are only in one of them at all.
kch
A: 

I guess this script prints some kind of percentage.

#!/bin/sh

file1="$1"
file2="$2"

file1size=$( cat $file1 | wc -c )
file2size=$( cat $file2 | wc -c )

if [ $file1size -lt $file2size ]; then
    size=$file1size
else
    size=$file2size
fi

dc -e "
3k
$( cmp -n $size -l $file1 $file2 | wc -l )
$size
/
100*
p"
Cirno de Bergerac