I have an ETL process performance problem. I have a table with 4+ billion rows in it. Structure is:
id
bigint identity(1,1)raw_url
varchar(2000) not nullmd5hash
char(32) not nulljob_control_number
int not null
Clustered unique index on the id and non clustered unique index on md5hash
SQL Server 2008 Enterprise Page level compression is turned on
We have to store the raw urls from our web-server logs as a dimension. Since the raw string > 900 characters we cannot put a unique index on that column. We use an md5 hash function to create the unique 32 character string for indexing purposes. We cannot allow duplicate raw_url strings in the table.
The problem is poor performance. The md5hash is of course random by nature so the index fragmentation drives to 50% which leads to inefficient IO.
Looking for advice on how to structure this to allow better insertion and lookup performance as well as less index fragmentation.