views:

91

answers:

2

I have a log that is really huge. (millions of rows)

LogTable
-------
ID    
DATE   
BASEURL
QUERYSTRING
USER   
REFERRER  
USERAGENT
SERVER

I want to slim this table down by normalizing the data. (slim the size)

I know! I know! A log should be a super-fast insert. On the other hand, the log table is so huge, the maintenance plan is getting ugly. So I'm only concerned with highly repetitive columns like BASEURL, USER, SERVER, and USERAGENT.

Now, I know logging must still be fast, so I don't want to do string comparisons, which leads to my question:

Can I rely on storing

binary_checksum(COLUMN_VALUE)

in the LogTable, and keep a mapping of COLUMN_VALUE to its checksum in a separate table?

In my application, I would keep a cache of the mappings so I wouldn't need to go back to the database server for every request. (Only if I have a new checksum value, I'd need to insert into the Mapping table.)

The main goal is to be able to run some simple analytical queries on the table, as well as extract the data without completely grinding the database (and my application) to a halt.

Here's a simple query, for example:

select 
   count(1)
,  [user] /* This is a checksum value, which I can lookup in my cache */
from
   LogTable
where date between @from and @to
group by [user]

What do you think? Is this checksum approach ok?

Edit:

  • All my columns are varchar(2000) or less.
  • I would assume it also allows me to index the data faster? (I would index an offline/transformational copy)
+1  A: 

What is your hash collision strategy? A checksum that results in a 32 bit digest has a 50% collision probability after only 65k entries. This is because of the meet-in-the-middle collisions. For millions of rows, you'll have a very high hash collision probability.

Remus Rusanu
I think you indirectly answered my question. As long as I have unique hashes for each of my values (I'm pretty sure I have less than 10k unique values for each column I want to hash), it should be fine. ... It's becoming apparent that such a simple hash is quite fragile.
Jeff Meatball Yang
You can try MD5 instead, is quite fast and at 128 bits is much less prone to collisions.
Remus Rusanu
Thanks - assuming MD5 is what I use, are there any other pitfalls?
Jeff Meatball Yang
Aren't you over-engineering something as simple as a log? Maybe a straight table and rely on SQL 2k8 page compression for your repetitive values would suffice?
Remus Rusanu
Overengineering? perhaps :) - but I was just throwing this idea around in hopes of finding 1) others who have done it, 2) people who know it's wrong ... SO to the rescue!
Jeff Meatball Yang
I'd recommend trying data page compression first, it can give amazing results on the kind of data you store in the log. The benefits apply to everything, from less IO faster OLTP to smaller backup/restore maintenance and not least to mirroring (big time). I say that compression is the single most compelling reason to upgrade to 2k8 for any shop.
Remus Rusanu
+2  A: 

In addition to the other comments here about overthinking a log storage scenario, you should consider partitioning the table (by date), and if extensive reporting is required, think about transforming the data to another format (either dimensionalized or summarized) for reporting.

For example, USERAGENT is a primary candidate for a (possibly snowflake) dimension, replacing your long string with a surrogate integer.

You could retain minimal information in the log table after it has been archived to whatever permanent storage (potentiall transformed) is dictated by requirements.

Cade Roux
+1 here is the exact partitioned table sliding window how-to: http://msdn.microsoft.com/en-us/library/aa964122(SQL.90).aspx
Remus Rusanu
Thanks, this is what we can do on the data warehouse side of things, but I was hoping to slim down our transactional database so it can be backed up and mirrored more quickly.
Jeff Meatball Yang