views:

366

answers:

10

I would like to know if compression algorithms always generate unique output for two different sets of files.

Say, I have two files A and B, and say I am applying a compression algorithm (for instance like PKZIP - this could be any compression algorithm) for each of these files to get A.zip and B.zip respectively. Is it possible for A.zip to be exactly identical to B.zip at the binary level for some combination of A and B. If this is not possible, can we safely assume compression to be equivalent to cryptographic hashing when it comes to guaranteeing uniquenes? On the other hand if it is possible, could you please provide me a sample A and B file along with the compression algorithm to use to verify this duplicity?

+12  A: 

It is not possible. If the compressed files were identical, how could they generate different results when you decompress them?

Mark Ransom
Clear and simple: +1. Note this applies for lossless compression only (which the OP suggests by talking about PKZIP, but doesn't explicitly mention).
j_random_hacker
When I wrote the answer I didn't even consider the possibility of lossy compression, due to the way the question was worded. Thanks for the clarification.
Mark Ransom
+1  A: 

It should be obvious: If the compressed files are identical then how could the decompressor know whether to make A or B out of it??

This doesn't make a useable hash, though, as the length will be variable.

Loren Pechtel
+21  A: 

Lossless compression (as used in ZIP files) will always produce different outputs for different files - otherwise you would be unable to reliably recover the original data. However, the output data may be of any size - and for some inputs, it will be larger than the original input. As such, this isn't usually very useful as a hash, which generally requires a fixed-size output.

Lossy compression (eg, MP3, JPEG, etc) can produce the same output for different inputs - as such, you cannot recover the original data, but instead get something similar to it. Because of this the pigeonhole principle isn't a problem, and so you can guarantee that it will reduce the output size, often even specifying the desired output size. However, because similar but slightly different inputs will often produce the same output, this isn't useful for hashing either, as hashing requires small changes in the input to produce large changes in the output.

bdonlan
+1 for pigeonhole principle because I'm a sucker for math. However, does this address the cryptographic hash question?
Jeremy Powell
Sure. Lossless doesn't work because it's variable-size, lossy because small changes don't result in large hash changes (avalanche effect).
bdonlan
@bdonian what's the requirement on hashes to having fixed length?Also, the idea of 'losing' information (ie lossy) doesn't stop an algorithm from being a good hash. MD5 or SHA-1 are lossy compression algorithms, are they not?I think the important thing to note here is that all crypto hash functions are compression algorithms, but not the other way around. (Crypto hash functions must be "hard" to invert)And, after saying that, I do note that this somewhat contradicts my answer below :P
Jeremy Powell
I never said losing information prevents something from being a good hash. Indeed, any good hash loses ALL information (ie, you cannot recover any information about the original message at all). Also, generally hashes are smaller than the input message, which cannot be ensured with a lossless compression algorithm.
bdonlan
+1  A: 

Compression functions are required to be injective, that is, each input maps to a unique output. If this weren't true, how would the algorithm know whether to decompress back to A or B?

Note that this is only true for lossless (data) compression. It's possible to compress 2 images, for instance, and get the same result, but only if the images were very close to begin with.

rlbond
+1  A: 

Well, your question is kinda general, but since you indicate file-based compression algorithms (your pkzip tag for one thing), then no. There is no way two different lossless compression algorithm can produce the same output from different inputs.

However, for lossy compression algorithms, like JPEG, then sure, that's of course a possibility, but then the files would be nearly identical to begin with.

For instance, take a .PNG file, save it as .JPEG, change one pixel to make it 1 degree brighter or darker in one of the channels, resave it as a .JPEG, and you have a chance that you got two identical files, even though the input was different, albeit slightly.

So lossless algorithms, no, that can't happen. For lossy algorithms, yes.

Lasse V. Karlsen
A: 

It is only possible for lossy compression algorithms (in opposite to lossless data compression). Theoretically they could give the same result for similar(but still different) input data.

Kirill V. Lyadvinsky
+2  A: 

Let f be a compression algorithm. If compressing A and B yields the same file, then f(A) = f(B) = C, for some C. Now, let f' be the decompression algorithm. then f'(f(A)) = f'(C) = f'(f(B)). Hence, f' uncompresses A.zip and B.zip to the same file.

So, either f is a worthless compression algorithm (because it is not a bijection), or A and B are in fact the same file. (When I say worthless, I mean worthless for lossless compression!)

As for your other question, note that a lossless compression algorithm is by definition not as hashing algorithm, since a hashing function h maps a domain A on a (generally) smaller domain B. Hence h cannot be a bijection, while we just asserted that our lossless compression function f is a bijection.

Stephan202
Worthless is a bit strong; lossy (ie, non-bijective) algorithms are used for audio and imagery all the time
bdonlan
@bdonlan: you're right. I updated the answer to clarify what I mean by 'worthless' :)
Stephan202
+3  A: 

Certainly, lossy compression can give the same output as already noted.

But I think a very important point that hasn't been mentioned is that cryptographic hashes should be very hard to reverse (or to reproduce the same hash via two different inputs). For this reason lossless and hence reversible compression algorithms like zips would be unsuitable as a cryptographic hash.

Junier
+1 for pointing out the uselessness of compression as a security measure, but I think the OP was mainly interested in just using compressed outputs to guarantee uniqueness -- and guaranteeing uniqueness is something that lossless compression does *even better than* cryptographic hashes (though with the obvious downside of producing a variable-length output).
j_random_hacker
+1  A: 

Cryptographic hash functions have a very specific requirement: to make it very difficult to reverse them. Compression, by definition, is easy to invert, so its a very poor choice for a crypto hash.

EDIT:

Note that when I say 'by definition' above, I mean by conventional definition. Strictly speaking, MD5, SHA-1, etc could be considered compression algorithms as well.

Jeremy Powell
A: 

For an algorithm to be a decent cryptographic hash, a small localised change in the input should cause a large disperse change in the output. Also, a hash function is a mapping from an arbitrarily sized input to a fixed size output.

Draemon