tags:

views:

435

answers:

11

say I have a blob of text 5000 characters. I run it through a hashing program and generates a 40 char long hash. now i run another blob of text, 10000 characters. it still generates a hash 40 chars long. that's true for text of any length.

my question is if the hashes are all unique, wouldn't i be able to compress anything into a 40 char string?

+5  A: 

RSA hashes are not unique. There's an extremely tiny (something to the order of 1:36^40) chance that you will create a false positive when hashing two different bits of clear text. For most applications, the chance is considered sufficiently small that you can ignore it, since it would take, on average, millions of years to see an accidental collision.

Jekke
+18  A: 

Hashing is not unique.

Hashing is a technique to attempt to generate a unique hash for each value fed to it, but it is not guaranteed unique.

Good hashing algorithms will have duplicate hash values much less frequently than bad hash algorithms. Also, hashing is one directional - meaning you can't go from a hash -> original, so it's not meant for compression.

And: A hash doesn't need to be unique. The same input needs to be tranformed into the same hash by the algorithm. You don't use a hash as identifier!

Reed Copsey
+ clear, correct and to the point
Peter
+8  A: 

Not all hashes are guaranteed to be unique. The wikipedia entry on the topic is pretty good: http://en.wikipedia.org/wiki/Hash_function

pdwetz
To add to this, if you know your entire possible input set, you can use generate the perfect hash for it.
GMan
A: 

You can compress the signature of any text into a hash, but you cannot reverse calculate what the text was to give you that hash. Simply speaking the only way to find out what the text was that gave you the hash would be to brute-force text through the hash to try and find a match.

See Wikipedia

Stevo3000
Again : only if the ranges are equal.. Otherwise bruteforce would not give you a 100% certain solution, and, depending on the ranges, it could be far, far less..
Peter
Brute-force will always give you the correct answer, it just may also give you false positives. The probability of false positives for a well defined hash is statisticaly highly improbable.
Stevo3000
+3  A: 

Hashing is for spreading as good as possible, not for uniqueness!

Of course, reaching uniqueness is a reaching a 100% spreading, but that's often not possible, no matter how good your hashing algorithm.

Striking example :

C# for example provide an Int32 code for each object as HashCode... So also for Int64 :

       Int64 a = Int64.MaxValue;
       Int32 myHash =  a.GetHashCode();

Conclusion here : there are 2^64 different possible instances of Int64, but only 2^32 hashcodes for them!!

So : each hashvalue for an Int64 is shared by (average)

4 294 967 295

other Int64's!

So much for uniqueness hey :-)

Peter
A: 

Don't get confused by the .Net GetHashCode(). It's not very good as it's only 32 bits compared to 640 bits in the original question (if each character is 8 bits).

Stevo3000
1. I'm not feeling so confused, tx, it's exactly the same principle, the blob in the example is also quite bigger than the hash2. This is no answer on the question at all! You should make a comment instead
Peter
Sorry Peter, don't have the required reputation (50) to comment. But I'd recommend that you read Microsoft's comments on their GetHashCode() as "the default implementation of this method must not be used as a unique object identifier for hashing purposes".
Stevo3000
A: 

If you are using a well defined hash function correctly you can practically assume that hash results are unique.

Issue, with your question is a hash is a one way function. There is no inverse function to take a value and go back to your original blob. Unless you keep a huge table of all possible original values (so called rainbow table).

Szere Dyeri
That's only true if the range of the hash is equally big as the range of the original value, see example of int32/64
Peter
Your answer is incorrect. The issue with the question is not the one way function!
Peter
+1  A: 

Consider looking at this from the point of view of the Pigeonhole Principle. If you're stuffing n items into a smaller number of buckets k, there will necessarily be some buckets with multiple items. So to answer your question, no hashes are not unique.

Richard Pistole
+8  A: 

One way to think of a hash is like a human fingerprint (hashes are also sometimes referred to as fingerprints)..

You can "compress" any person in to a (pretty much) unique finger-print.. but, you cannot know who someone is by their fingerprint alone.. This is just like a hash, you can easily work out hash("abcdef") -> a1b2c3, but given only a1b2x3, you cannot trivially tell the source data.

To reverse a finger print, you need to compare the fingerprint to a database of known people->finger-prints (if the unknown fingerprint matches Person1, the unknown fingerprint belongs to them)

With a hash, again you must do much the same thing - you have a database with all string->hash mappings (called a rainbow table). Then you lookup the row with the hash "a1b2c3" and it shows "abcdef" was hashed in order to get this. The other more common way is to simply try every combination of characters, hash them and compare (a brute force attack)

Finally, while human fingerprints are "unique", it's possible to have two the same, it's just incredibly unlikely - it's the same with hashing... Some hashing algorithms are more susceptible to collisions than others.

my question is if the hashes are all unique, wouldn't i be able to compress anything into a 40 char string?

Theoretically hashing is a great compression method, but to decompress is incredibly impractical beyond (say) 10 ASCII characters of data.. You're right, you can compress anything to a 40 character string, but you cannot decompress it practically (even theoretically is a bit of a stretch..)

dbr
No, you can't use it as a compression system at all.Given a 40 character hash, there will be millions of (for example) 50 character strings for every hash value. It happens that it is very, very hard to find any of them, but you couldn't hope to find the "right" one.
Chris Jefferson
+1  A: 

Hashing is not guaranteed to be unique, but if you are looking for a unique hash, look at gperf. It can generate a unique hashing function for a set of predetermined inputs.

Zifre
A: 

They are not unique, but you're much more likely to drop dead of a heart attack before you find two different documents with the same hash for a high quality algorithm, e.g. SHA-1

David Plumpton