tags:

views:

1097

answers:

7

Inside my code I'm generating hashes of URLs, (which are practically of unbounded length). I'm currently using sha1(), which I know has a tiny chance of a collision, but I have up to 255 bytes to store the hash in so feel that I might as well use that available space to lower the chance of collision even further.

Is there either:

  1. Another PHP hash function with a longer or customisable hash length?
  2. A way of using a fixed-length hash function like sha1 with a variable length input to generate a longer hash?

Or, is sha1's 20-byte hash good enough for anything and I should stop worrying about it?

+4  A: 

Or, is sha1's 20-byte has good enough for anything and I should stop worrying about it?

Exactly.

Hashtables, Pigeonholes, and Birthdays
http://www.codinghorror.com/blog/archives/001014.html

Jeff Atwood
The birthday paradox is exactly what I worry about - I don't want to get the stage where when I have a few 100k rows the odds of a collision become non-neglibable. Given that I have the luxury of enough bytes to store a much longer hash shouldn't I use that if available?
Ciaran McNulty
+1  A: 

If you're really worried, pick a 256- or 512-bit hash (32 or 64 characters).

If you're really, really paranoid, add a salt.

If you're more paranoid than that, concatenate two hashes for a longer one, such as md5 and sha-256.

warren
It sounds like this system is being used to prevent duplicate urls from being entered... having a salt would defeat the purpose if that is the case.
R. Bemrose
true enough... i think though that if you're actually anticipating a collision, then you're collecting too many URLs :)
warren
A: 

You could always prepend/append a sequential ID (in decimal or hex) to your existing hash?

Sure you wouldn't have a fixed length hash but you'd know the code was a) unique and b) non-guessable (even if someone noticed the sequential part they wouldn't know the way you were salting/hashing the rest of the code).

Of course if you're not trying to hide these hashes from anyone then why not simply use a sequential ID in the first place?

Gareth
+3  A: 

Let see... http://www.cryptography.com/cnews/hash.html

Q: How hard would it be to find collisions in SHA-1?
A: The reported attacks require an estimated work factor of 2^69 (approximately 590 billion billion) hash computations

Looks like the risk is quite low... ^_^

PhiLho
A: 

Since I don't know exactly what you are trying to do, I'll make an assumption that you don't want to enter data twice and you want the ability to detect collisions fast. In that case I propose the following algorithm in pseudocode:

found = false
hv = hash(urlValue)
if table[hash,url] contains pair (hv,urlValue)
   found = true
endif

if (not found)
   insert table (hv,urlValue)
endif

In your database create a non-unique index on the hash column to speed up look ups. This will allow the query on (hash,url) to proceed quickly -- in the normal case you only look at one row since the hash is likely unique, yet you are really deciding to accept or deny based on the actual url. This will allow you to use a shorter hash function. Presumably you are already storing the url for later use so this will not involve any additional storage.

tvanfosson
A: 

if you wanted to get really crazy with it, what you could do is combine the hashes of different parts of the URL.

Say the URL is 40 characters long - break it into 5 pieces: get the SHA1 of chars 1-8, concatenate to the SHA1 of chars 9-16, concatenate to SHA1 of 17-24... etc. In theory you'll then have 2800 possibilities, and will only need to start worrying about collisions after 2(69*5) = 2345 = 7.2 * 10103 rows.

but like I said, we're heading straight into crazy town with methods like this.

nickf
A: 

Well, it only makes sense if you have ashort hash key. Otherwise there is a risk of data overflow in the table.