calculating which strings will have the same hash

+1 A:

It depends on the hash function. With a simple hash function, it may be possible. For example, if the hash function simply sums the ASCII byte values of a string, then one could enumerate all strings of a given length that produce a given hash value. If the hash function is more complex and "cryptographically strong" (e.g., MD5 or SHA1), then it is theoretically not possible.

Mark Wilkins 2010-06-30 21:54:24

+1 A:

Most hashes are of cryptographic or near-cryptographic strength, so the hash depends on the input in non-obvious ways. The way this is done professionally is with rainbow tables, which are precomputed tables of inputs and their hashes. So brute force checking is basically the only way.

Jeremy Kemball 2010-06-30 21:54:51

+5 A:

The answer is pretty clearly yes, since at the very least you could run through every possible string of the given length, compute the hashes of all of them, and then see which are the same. The more interesting question is how to do it quickly.

Further reading: http://en.wikipedia.org/wiki/Collision_attack

Tyler McHenry 2010-06-30 21:55:40

+18 A:

What you are looking for is the solution to the Collision Problem (See also collision attack). A well-designed and powerful cryptographic hash function is designed with the intent of as much obfuscating mathematics as possible to make this problem as hard as possible.

In fact, one of the measures of a good hash function is the difficulty of finding collisions. (Among the other measures, the difficulty of reversing the hash function)

It should be noted that, in hashes where the input is any length of string and the output is a fixed-length string, the Pigeonhole Principle ensures that there is at least one collision for any given string. However, finding this string is not as easy, as it would require basically blind guess-and-check over a basically infinite collection of strings.

It might be useful to read into the the ideal hash functions. Hash functions are designed to be functions where

Small changes in the input cause radical, chaotic changes in the output
Collisions are reduced to a minimum
It is difficult or, ideally, impossible to reverse
There are no hashed values that are impossible to obtain with any inputs (this one matters significantly less for cryptographic purposes)

The theoretical "perfect" hash algorithm would be a "random oracle" -- that is, for every input, it outputs a perfectly random output, on the condition that for the same input, the output will be identical (this condition is fulfilled with magic, by the hand of Zeus and pixie fairies, or in a way that no human could ever possibly understand or figure out)

Unfortunately, this is pretty much impossible, and ultimately, all hashes are judged as "strong" based on how much of these qualities they possess, and to what degree.

A hash like SHA1 or MD5 is going to be pretty strong, and more or less computationally impossible to find collisions for (within a reasonable time frame). Ultimately, you don't need to find a hash that is impossible to find collisions for. You only practically need one where the difficulty of it is large enough that it'd be too expensive to compute (ie, on the order of a billion or a trillion years to find a collision)

Due to all hashes being imperfect, one could analyze the internal workings of it and see mathematical patterns and heuristics and try to find collisions along that pattern. This is similar to a hash function being %7...Hashing the number 13 would be 13%7 = 6, 89%7 = 5. If you saw a hash of 3, you could use your mathematical understanding of the modulus function to easily find a collision (ie, 10)¹. Fortunately for us, stronger hash functions have much, much, much harder to understand mathematical basis. (Ideally, so hard that no human would ever understand it!)

Some figures:

Finding a collision for a single given SHA-0 hash takes about 13 full days of running computations on the top supercomputers in the world, using the patterns inherent in the math.
According to a helpful commenter, MD5 collisions can be generated "quickly" enough to be less than ideal for sensitive purposes.
No feasible or practical/usable collision finding method for SHA-1 has been found or proven so far, although, as pointed out in the comments, there are some weaknesses that have been discovered.

Here is a similar SO question, which has answers much wiser than mine.

¹_{note that, while this hash function is weak for collisions, it is strong it that it is perfectly impossible to go backwards and find a given key, if your hash is, say, 4. There are an infinite amount (ie, 4, 11, 18, 25...)}

Justin L. 2010-06-30 21:55:57

Read the same answers to the same questions every day, but rarely are they this well written.

BlueRaja - Danny Pflughoeft 2010-07-01 03:49:19

*(it's worth mentioning that due to discovered weaknesses, MD5 collisions can be generated in feasible times; and, while the times are still infeasible, SHA1 has been shown to be less than ideal. A competition to find a replacement for both - SHA3 - is taking place right now, and will end late 2012)*

BlueRaja - Danny Pflughoeft 2010-07-01 03:49:36

@BlueRajaj: Thanks, Mr. Pflughoeft. And I've made due note of that which is worth mentioning =)

Justin L. 2010-07-01 05:15:33

Good answer, except: "In fact, I've been told that people have already pre-computed all collisions for every one of the 2^128 possible MD5 result hashes." - say what? There's an infinite number of values that produce the same hash, so this is impossible. Not to mention, even storing one byte for every one of the 2^128 hash values would require 3e20 exabytes, and computing it at one billion hashes a second would take 1e20 centuries to complete.

Nick Johnson 2010-07-01 08:56:29

@Nick You are right, I definitely misread something >.<

Justin L. 2010-07-01 09:03:22

ansaurus

tags:

views:

answers:

calculating which strings will have the same hash

related questions