How do you compare the content of two archive files programmatically?

views:

621

answers:

+3 Q:

How do you compare the content of two archive files programmatically?

Hi there, I'm doing some testing to ensure that the all in one zip file that i created using a script file will produce the same output as the content of a few zip files that i must manually click and create via web interface. Therefore the zip will have different folder structure.

Of course i can manually extracted them out and using my powerful eyeball technique to scan them or even lazier i can write a script to do that, but before i invest more time and get accused by my boss for company time robbery, i'm asking if there's a better way to do this?

I'm using perl LAMP stack by the way. thanks.

+1 A:

I can wholeheartly recommend Beyond Compare. Unless you're really getting underpaid, it's the biggest bang for your (bosses) buck.

[Edit] I seem to have scanned over the different folder structure, sorry about that.Beyond Compare can compare all files in folders with the same folderstructure. It does not have (I believe) the intelligence to go searching for matches in files in different folders.

Regards,
Lieven

Lieven 2009-02-12 09:23:19

@Lieven Does it do archive comparison? and how do i link it up to my perl script? thanks.

melaos 2009-02-12 09:26:36

It does archive comparison. You can drive BC from the command line. I assume that will be doable in perl (don't know perl). The problem will be your different folder structure...

Lieven 2009-02-12 09:29:58

@Lieven, yea i think the different folder structure is the killer here :(

melaos 2009-02-12 09:39:11

@melaos I believe that flattening the hierarchy as SDX2000 mentioned is the best way to go then.

Lieven 2009-02-12 09:53:16

+1 A:

Taking a cue from Carra's answer...if A.zip is your single big archive and B.zip is the archive generated through the web then use the following algorithm

Extract all files from A.zip and recursively (w.r.t folders) compute the checksum of the files present in the folder (using cksum, md5sum etc) where the contents were extracted and save this information after sorting it (pipe it through sort) to a file (say A.txt)
Do the same for B.zip and generate B.txt
Compare A.txt with B.txt they should be exactly the same.

Use unzip -l to get file/directory lists for both the (zip) archives and then flatten the hierarchy of the user generated zip file and compare with the contents of your script generated zip file using some thing like diff. By flattening of hierarchy I mean you may need to do some kind of pre-precessing on one or both lists before you can do a meaningful comparison with diff.

SDX2000 2009-02-12 09:26:13

@SDX2K yea i thought about that too, but was looking for some simple hack before i write my own. thanks :)

melaos 2009-02-12 09:27:43

You are welcome :)

SDX2000 2009-02-12 09:30:50

+1 A:

Create a crc checksum for your files.

If your checksum is the same for the original files and the unzipped files, you can be sure the files are the same. And even works for non text data.

A checksum be easily be created with an external program such as "SFV Checker" or programmatically (.net/java for example include libraries to do this).

Carra 2009-02-12 09:29:27

@Carra so in my case let's say there's three original zip files and now using my script i have one big zip files. How do i do it using checksum? thanks

melaos 2009-02-12 09:31:21

@melaos I think he meant... you need to extract all your constituent files and then do a check sum on them may be based on file names or without them.

SDX2000 2009-02-12 09:34:53

on linux you may try `cksum` or `md5sum` to generate checksums

SDX2000 2009-02-12 09:35:51

but you should be able to get the checksum using `unzip -l` too (I think or may be some other switch)

SDX2000 2009-02-12 09:37:18

@SDX2K well if i had to extract them out, it means i need to loop through each dir and each file right to compare them one by one? thanks.

melaos 2009-02-12 09:37:34

I have updated my answer. Please see that.

SDX2000 2009-02-12 09:46:03

Well, extract all files from your zip file into *one folder*. Extract the three small zip files and put all files into another folder. Now create a checksum of all your files in the first and the checksum of the second dir. If the checksum match, your files are the same.

Carra 2009-02-12 09:51:28

+2 A:

You can use perl's Archive::ZIP or Python's zipfile to extract the filenames, sizes and CRC checksums of the files in the archives. Create a file which contains the results sorted by file name (ignore the path).

For your smaller ZIPs, merge the results of the script (cat list1 list2 list3 | sort).

Now, you can use diff to compare the results.

Aaron Digulla 2009-02-12 10:38:06

http://search.cpan.org/perldoc?Archive::Zip

Brad Gilbert 2009-02-12 14:58:42

ansaurus

tags:

views:

answers:

How do you compare the content of two archive files programmatically?

related questions