tags:

views:

67

answers:

1

I never really looked into it but now I realized that I can't easily build two identical .jar files.

I mean, if I build twice, without changing anything, I get the exact same size but different checksums for the .jar.

So I quickly ran some test (basically unzipping, sort -n -k 5'ing and then diff'ing) to see that all the files inside the .jar were identical, yet the .jar were different.

So I did a test with a plain .zip file and found this:

... $ zip 1.zip a.txt
... $ zip 2.zip a.txt
... $ ls -l ?.zip
-rw-rw-r-- 1 webinator webinator 147 2010-07-21 13:09 1.zip
-rw-rw-r-- 1 webinator webinator 147 2010-07-21 13:09 2.zip

(exact same .zip file size)

... $ sha1sum ?.zip
db99f6ad5733c25c0ef1695ac3ca3baf5d5245cf  1.zip
eaf9f0f92eb2ac3e6ac33b44ef45b170f7984a91  2.zip

(different SHA-1 sums, let see why)

$ hexdump 1.zip -C > 1.txt

$ hexdump 2.zip -C > 2.txt

$ diff 1.txt 2.txt 
3c3
< 00000020  74 78 74 55 54 09 00 03  ab d4 46 4c*4e*d5 46 4c  |txtUT.....FLN.FL|
---
> 00000020  74 78 74 55 54 09 00 03  ab d4 46 4c*5d*d5 46 4c  |txtUT.....FL].FL|

Unzipping both zip files surely gives back our unique file.

Question: why is that? (I'll answer myself)

+2  A: 

(Answering to myself) It is because the .zip file format saves the creation and modification time in its headers.

If you really do want to create two identical .zip (or .jar), you have to make the second one believe it was created/modified exactly at the same time as the first one.

Webinator
Then it _IS_ deterministically created...
Thorbjørn Ravn Andersen
@Thorbjørn Ravn Andersen: sure, if you can precisely predict at which second all your class will have been compiled and will be zipped together ;)
Webinator
@Webinator I think you're confusing deterministic with identical... they aren't the same. Deterministic means constructed in the same fashion every time, not necessary identical bytes. You could easily do a binary diff on the files and see that all has changed is timestamps (that's something one of our major clients has to do to get new dependencies checked into their dep-repo... and is a pain for them to do, but they do it because they do need to guarantee that these files with different hashes are identical).
glowcoder
@Webinator, predicion is not the issue here. If you can create a set of input data that is guaranteed to produce your output data (this includes the value of the system clock), then it is deterministic.
Thorbjørn Ravn Andersen