views:

421

answers:

3

hi, I want to tell whether two tarball files contain identical files, in terms of file name and file content, not including meta-data like date, user, group.

However, There are some restrictions: first, I have no control of whether the meta-data is included when making the tar file, actually, the tar file always contains meta-data, so directly diff the two tar files doesn't work. Second, since some tar files are so large that I cannot afford to untar them in to a temp directory and diff the contained files one by one. (I know if I can untar file1.tar into file1/, I can compare them by invoking 'tar -dvf file2.tar' in file/. But usually I cannot afford untar even one of them)

Any idea how I can compare the two tar files? It would be better if it can be accomplished within SHELL scripts. Alternatively, is there any way to get each sub-file's checksum without actually untar a tarball?

Thanks,

+1  A: 

Is tardiff what you're looking for? It's "a simple perl script" that "compares the contents of two tarballs and reports on any differences found between them."

Evan Hanson
Looking at the implementation, it untars contends of the file into a temp directory, so it doesn't quite solve his problem :/
Charles Ma
+1  A: 

Are you controlling the creation of these tar files?
If so, the best trick would be to create a MD5 checksum and store it in a file within the archive itself. Then, when you want to compare two files, you just extract this checksum files and compare them.


If you can afford to extract just one tar file, you can use the --diff option of tar to look for differences with the contents of other tar file.


One more crude trick if you are fine with just a comparison of the filenames and their sizes.
Remember, this does not guarantee that the other files are same!

execute a tar tvf to list the contents of each file and store the outputs in two different files. then, slice out everything besides the filename and size columns. Preferably sort the two files too. Then, just do a file diff between the two lists.

Just remember that this last scheme does not really do checksum.

Sample tar and output (all files are zero size in this example).

$ tar tvfj pack1.tar.bz2
drwxr-xr-x user/group 0 2009-06-23 10:29:51 dir1/
-rw-r--r-- user/group 0 2009-06-23 10:29:50 dir1/file1
-rw-r--r-- user/group 0 2009-06-23 10:29:51 dir1/file2
drwxr-xr-x user/group 0 2009-06-23 10:29:59 dir2/
-rw-r--r-- user/group 0 2009-06-23 10:29:57 dir2/file1
-rw-r--r-- user/group 0 2009-06-23 10:29:59 dir2/file3
drwxr-xr-x user/group 0 2009-06-23 10:29:45 dir3/

Command to generate sorted name/size list

$ tar tvfj pack1.tar.bz2 | awk '{printf "%10s %s\n",$3,$6}' | sort -k 2
0 dir1/
0 dir1/file1
0 dir1/file2
0 dir2/
0 dir2/file1
0 dir2/file3
0 dir3/

You can take two such sorted lists and diff them.
You can also use the date and time columns if that works for you.

nik
Thanks a lot, but I have no control of the creation of the tarballs:(
myjpa
Thats unfortunate. But, you have a Python solution. And, it saves you from the disk space utilization of extraction. My other two solutions would be useful as heuristic methods which can be tried when you want speed.
nik
Infact, if you suspect the two archives to be different with high likelihood, then for fast results, you could use the last solution suggested in my answer. Because, this will always catch files added/removed and if a file changes its size typically changes too.
nik
Yes, I agree. This is a quick approach to tell if file number/size changes.
myjpa
A: 

tarsum is almost what you need. Take its output, run it through sort to get the ordering identical on each, and then compare the two with diff. That should get you a basic implementation going, and it would be easily enough to pull those steps into the main program by modifying the Python code to do the whole job.

Greg Smith
Yes, I think it is helpful, the code is so straightforward. Only I have to use python.
myjpa
Doing a comparison between two tarballs requires creating a pair of lists of (file,md5) entries and computing the difference between the two lists. That's just really painful to write in straight shell, while trivial to do in Python or Perl. That's why you're unlikely to first a straight shell answer here--it's exactly the kind of problem that motivated creating those languages. If you don't want to go completely crazy writing this thing, you'd really be far better off to start with tarsum (or the tardiff Perl code) and customize it for your specific needs than to use straight shell.
Greg Smith