Large scale ETL string lookups performance issues

views:

176

answers:

Large scale ETL string lookups performance issues

I have an ETL process performance problem. I have a table with 4+ billion rows in it. Structure is:

id bigint identity(1,1)
raw_url varchar(2000) not null
md5hash char(32) not null
job_control_number int not null

Clustered unique index on the id and non clustered unique index on md5hash

SQL Server 2008 Enterprise Page level compression is turned on

We have to store the raw urls from our web-server logs as a dimension. Since the raw string > 900 characters we cannot put a unique index on that column. We use an md5 hash function to create the unique 32 character string for indexing purposes. We cannot allow duplicate raw_url strings in the table.

The problem is poor performance. The md5hash is of course random by nature so the index fragmentation drives to 50% which leads to inefficient IO.

Looking for advice on how to structure this to allow better insertion and lookup performance as well as less index fragmentation.

+1 A:

I would break up the table into physical files, with the older non-changing data in a read-only file group. Make sure the non-clustered index is also in the filegroup.

Edit (from comment): And while I'm thinking about it, if you turn off page level compression, that'll improve I/O as well.

Randolph Potter 2009-09-14 14:39:26

Incidentally, the MD5 hash in a non-clustered index shouldn't cause that much of a problem with page splits.

Randolph Potter 2009-09-14 14:41:20

We have 24 cpus on the processing machine so the page level compression actually reduces the IO overhead in this case and we only incur a minor increase in CPU utilization. Worth the trade off.Also the md5hash, which is a random string, will drive index fragmentation to 50% so if we use a fill factor of 50% on the index we don't have much in the way of page splitting but we do have 1/2 empty pages which increase IO overhead

Sean 2009-09-14 15:13:31

When you build the index the first time, there will be I/O overhead, but I don't see how in the long run that's an issue? I'd sooner have the empty pages than page-splits on INSERTs. Have you looked at filegroups?

Randolph Potter 2009-09-14 16:47:28

How long does it take to build the MD5 hash index?

Randolph Potter 2009-09-14 16:49:54

We are using a filegroup with a number of data files spread across controllers on a Complellent SAN.I don't remember how long it took to create the original index but is was several hours.We do keep the index in place instead of dropping and recreating due to the time to rebuild it.

Sean 2009-09-14 17:01:04

Have you considered putting the non-clustered index into its own filegroup?

Randolph Potter 2009-09-14 18:15:29

I would argue that it should be a degenerate dimension in the fact table.

And figure some way to do partitioning on the data. Maybe take the first xxx characters and store them as a separate field, and partition by that. Then when you're doing lookups, you're passing the short and long columns, so it's looking in a partition first.

Joe K 2009-11-05 17:51:57

ansaurus

tags:

views:

answers:

Large scale ETL string lookups performance issues

related questions