I have a log that is really huge. (millions of rows)
LogTable
-------
ID
DATE
BASEURL
QUERYSTRING
USER
REFERRER
USERAGENT
SERVER
I want to slim this table down by normalizing the data. (slim the size)
I know! I know! A log should be a super-fast insert. On the other hand, the log table is so huge, the maintenance plan is getting ugly. So I'm only concerned with highly repetitive columns like BASEURL, USER, SERVER, and USERAGENT.
Now, I know logging must still be fast, so I don't want to do string comparisons, which leads to my question:
Can I rely on storing
binary_checksum(COLUMN_VALUE)
in the LogTable, and keep a mapping of COLUMN_VALUE to its checksum in a separate table?
In my application, I would keep a cache of the mappings so I wouldn't need to go back to the database server for every request. (Only if I have a new checksum value, I'd need to insert into the Mapping table.)
The main goal is to be able to run some simple analytical queries on the table, as well as extract the data without completely grinding the database (and my application) to a halt.
Here's a simple query, for example:
select
count(1)
, [user] /* This is a checksum value, which I can lookup in my cache */
from
LogTable
where date between @from and @to
group by [user]
What do you think? Is this checksum approach ok?
Edit:
- All my columns are varchar(2000) or less.
- I would assume it also allows me to index the data faster? (I would index an offline/transformational copy)