ansaurus

Question

Limitations of using binary_checksum() to represent a URL or similar string?

Answer 1

+1 A:

What is your hash collision strategy? A checksum that results in a 32 bit digest has a 50% collision probability after only 65k entries. This is because of the meet-in-the-middle collisions. For millions of rows, you'll have a very high hash collision probability.

Remus Rusanu 2009-07-20 21:48:55

I think you indirectly answered my question. As long as I have unique hashes for each of my values (I'm pretty sure I have less than 10k unique values for each column I want to hash), it should be fine. ... It's becoming apparent that such a simple hash is quite fragile.

Jeff Meatball Yang 2009-07-20 22:09:38

You can try MD5 instead, is quite fast and at 128 bits is much less prone to collisions.

Remus Rusanu 2009-07-20 22:23:53

Thanks - assuming MD5 is what I use, are there any other pitfalls?

Jeff Meatball Yang 2009-07-20 22:25:55

Aren't you over-engineering something as simple as a log? Maybe a straight table and rely on SQL 2k8 page compression for your repetitive values would suffice?

Remus Rusanu 2009-07-20 22:26:08

Overengineering? perhaps :) - but I was just throwing this idea around in hopes of finding 1) others who have done it, 2) people who know it's wrong ... SO to the rescue!

Jeff Meatball Yang 2009-07-21 16:29:27

I'd recommend trying data page compression first, it can give amazing results on the kind of data you store in the log. The benefits apply to everything, from less IO faster OLTP to smaller backup/restore maintenance and not least to mirroring (big time). I say that compression is the single most compelling reason to upgrade to 2k8 for any shop.

Remus Rusanu 2009-07-22 15:36:29

Answer 2

+2 A:

In addition to the other comments here about overthinking a log storage scenario, you should consider partitioning the table (by date), and if extensive reporting is required, think about transforming the data to another format (either dimensionalized or summarized) for reporting.

For example, USERAGENT is a primary candidate for a (possibly snowflake) dimension, replacing your long string with a surrogate integer.

You could retain minimal information in the log table after it has been archived to whatever permanent storage (potentiall transformed) is dictated by requirements.

Cade Roux 2009-07-20 23:25:03

+1 here is the exact partitioned table sliding window how-to: http://msdn.microsoft.com/en-us/library/aa964122(SQL.90).aspx

Remus Rusanu 2009-07-21 00:10:18

Thanks, this is what we can do on the data warehouse side of things, but I was hoping to slim down our transactional database so it can be backed up and mirrored more quickly.

Jeff Meatball Yang 2009-07-21 16:28:09

ansaurus

tags:

views:

answers:

Limitations of using binary_checksum() to represent a URL or similar string?

related questions