tags:

views:

626

answers:

4

1) For the purpose of really low hash collision, can I get away with just using half of the 128 bits of a sha1 rather than dealing with the sha1 itself? I understand this is not suitable for cryptographic hashes, but I just need the hashes for hash table keys.

2) Computation time isn't a priority, and besides which I'm hashing very small pieces of data. In particular, I'm mostly going to be taking 2 or 3 64-bit hashes and hashing them to get another 64-bit hash. Is there a better option than sha1 for this purpose? Again, collisions should be very unlikely.

3) I'm a sql newb. Is it a good idea to use 64-bit hashes as id's in sql? Will 64-bit id's cause performance problems in sqlite or postgres? I'm going to need to coordinate data across multiple databases (including a Lucene index), so I figured I should deal with hashes directly in the tables rather than bothering with auto-incremented ids (which would only be meaningful in one db, not across all data stores). I figure 64-bit is a good compromise: big enough for unlikely collisions but saves on space (and lookup time?).

4) What about CRC-64? Does that produce a random enough distribution?

A: 

If computation time is not important why not go the whole 128 bits? Is there any real reason to choose 64 bits beside possible storage issues? (and then an extra 8 bytes is not going to kill you with storage so cheap)

64 bits vs 128 bits will cause no speed problems in SQLite, I'm not certain about mySQL.

gbrandt
+1  A: 

For a good comparison of lengths of hashes, have a look through at http://en.wikipedia.org/wiki/List_of_hash_functions

Also, just a note: SHA-1 is 160 bits, not 128.

Smashery
+2  A: 

Your keys will need absolute uniqueness not high probability of uniqueness. I would suggest using GUIDs instead of hashes for your keys for cross-database compatibility. Generate the hash as a quick look up mechanism -- you can have a non-unique index on this -- but in the case of collision you'll have to compare the actual data to make sure they are the same. In synchronizing your databases you can check the hash (quickly using the index) and if you find a collision, then resolve whether the data is the same and, thus the GUIDs needs to be resolve. If there is no collision, then simply update whichever database needs the missing entry and insert using the GUID from the other database.

I, too, see little point in creating your own hash of hashes to save space. If you already have the other hashes, just use them (append, don't rehash). If not, just use a standard hash function like MD5 or SHA1 and store the resulting data.

tvanfosson
But why do I need absolute uniqueness? Aren't we talking about VERY high probability? 1 in 2^128 chance that any two items have same hash, right? Might we not as well worry about being struck by a meteor? Or do MD5 and sha1 not distribute randomly enough?
Jegschemesch
Ah, I think we're talking past each other because I was ignorant of GUID/UUID's while you seemed to assume I wasn't. But GUID's are not ABSOLUTELY unique either, right?
Jegschemesch
Yes. Globally unique (or universally unique) ids are absolutely unique. The generation algorithm ensures that no two machines produce the same ids. My point was that if you are using it as a primary key you can't tolerate even one collision, no matter how rare.
tvanfosson
+1  A: 

If you have few enough records it's almost certain that you'll never have a hash collision in 64 bits. Likely you will fall into this category.

There should be no problem with trimming down a cryptographic hash like sha1, because if there were internal structure in the hash then it wouldn't be good enough to be a crypto hash, and if there's no structure then any subset of the bits should be quite random. Note that I'm only talking about using that for IDs, not for any crypto purposes!

But really, doesn't your SQL have some kind of GUID? And if it does, why not use it?

dwc
I guess GUID/UUID is pretty much what I want. Not sure if sqlite support is adequate though, so I'll investigate that. As I said, I'm a sql newb.
Jegschemesch
Sqlite3 can be easily extended to support UUIDs, and I've successfully done so in an iPhone app before.
Bob Aman