ansaurus

Question

Sql Server Legacy Database To Clustered index or not

Answer 1

+2 A:

I'm not sure where your GUIDs come from, but if they're being generated during the insert, using the NEWSEQUENTIALID() in SQL Server instead of NEWID() will help you avoid fragmentation issues during insert.

Regarding the choice of a clustered index, as Kimberly L. Tripp states here: "the most important factors in choosing a clustered index are that it's unique, narrow and static (ever-increasing has other benefits to minimizing splits)." A GUID falls short on the narrow requirement when compared to an INT or even BIGINT.

Kimberly also has an excellent article on GUIDs as PRIMARY KEYs and/or the clustering key.

Joe Stefanelli 2010-08-20 21:31:04

The identifiers are generated through .Net Frameworks Guid.NewGuid(); as the architecture of this system was built to have the objects generate their own id. (Again a legacy system.)

Peter 2010-08-20 21:40:51

Answer 2

A:

You don't indicate what your performance issues are. If the worst performing action is an INSERT, then maybe your solution is right. If it's something else, then I'd look at how the clustered index can help that.

You might look at existing indexes on the table and the queries that use them. You may be able to select an index that, while degrades INSERTs slightly, provides a greater benefit to the current performance-problem areas.

bobs 2010-08-20 21:44:31

You're right bob I edited the post. The issues are when we query the more transactional tables (i.e. the ones that change the most not the static tables.) This is compounded if we join multiple transactional tables together.

Peter 2010-08-20 21:47:58

Answer 3

+4 A:

If you don't have a clustered index on the table, it is being stored as a heap rather than a b-tree. Heap data access is absolutely atrocious in SQL Server so you definitely need to add a clustered index.

I agree with your analysis that the GUID column is a poor choice for clustering, especially since you don't have the ability to use NEWSEQUENTIALID(). You could create a new artificial integer key if you like, but if there is another column or combination of columns that would make sense as a clustered index, that is fine as well.

Do you have a field that is used frequently for range scans? Which columns are used for joins? Is there a combination of columns that also uniquely identifies the row aside from the GUID? Posting a sample of the data model would help us to suggest a good candidate for clustering.

bluefooted 2010-08-21 02:13:59

Blue unfortunately I can't post the data model as my company is extremely strict when it comes to this sort of thing.In some tables there are date columns which are used for range scans. Generally the primary keys (guids) and foreign keys (guids) are used for joins, there isn't any joins on plain text columns.I was wondering about using the date created, but I only just added that in the last release and therefore not every row has this value. I think adding the artificial integer key with the identity turned on is the path i will take. The guid is the PK in every table.

Peter 2010-08-21 03:26:13

Date is often a good candidate for a clustering key, especially if queries are frequently restricted by date range. If you just added it, I suppose it's not used too often, but at least with date created you can ensure ordered insertion and therefore minimize page splits and fragmentation. My only concern about the artificial key is that it will probably never be useful unless you change your data model to join on the new key instead of the GUID.

bluefooted 2010-08-21 03:51:44

Whatever you end up choosing, be sure to examine your non-clustered indexes as well. If your GUID index contains only the GUID column, it may not be terribly useful. You might want to investigate adding some columns to the INCLUDE clause to cover your most common queries.

bluefooted 2010-08-21 03:52:29

Oh sorry, I just read your Edit about the covering indexes :)

bluefooted 2010-08-21 03:54:44

Blue, on a table that has a max of 10 columns, with a maximum of 100 records (only edited during setup), would a clustered index be ok on the pk of this table even if it was a guid? save speeds are not important for these types of table.

Peter 2010-08-21 03:56:03

I was thinking the new identity (artificial key) clustered index would include the current primary key as the 2nd column within the index. This way the Guid id's could be referenced quicker by sql server. Is this thinking incorrect?

Peter 2010-08-21 03:58:08

Well, here's the thing with clustered indexes. The clustered index key is used to lookup the data from your non-clustered indexes, so the entire key is added to the leaf level of all your non-clustered indexes. This means that if you have a really wide clustered index key, not only does it bloat your clustered index, it bloats all your non-clustered indexes as well. On a table with only 100 records, it's probably not a big deal since depending on the size of the rows a 100 record table is probably only a few pages anyway, but keep it in mind.

bluefooted 2010-08-21 04:17:05

If you add the GUID to your new clustered indexes as a second column, you won't have the same page split/fragmentation hit, but you'll end up with a really wide key. Also keep in mind that the leaf level of the clustered index contains all the data in the row, so the GUID column is already going to be part of the clustered index on the leaf level, not sure if it would actually be helpful as part of the key unless you use the artificial key for lookups in conjunction with the GUID.

bluefooted 2010-08-21 04:21:34

Answer 4

+1 A:

It's not 100% clear to me: is your number 1 access pattern to query the tables by the GUID or by other columns? And when joining to other tables, what columns (and data types) are most often used?

I can't really give you any solid recommendations until I understand more about how these GUIDs are used. I realize you said they're primary keys, but that doesn't guarantee they are used as the primary conditions on queries or in joins.

UPDATE

Now that I know a little more, I have a crazy suggestion. Do cluster those tables on the GUIDs, but set the fill factor to 60%. This will ameliorate the page split problem and give you better performance querying on those puppies.

As for using Guid.NewGuid(), it seems that you can do sequentialGUIDs in C# after all. I found the following code here on SO:

[DllImport("rpcrt4.dll", SetLastError = true)]
static extern int UuidCreateSequential(out Guid guid);

public static Guid SequentialGuid()
{
    const int RPC_S_OK = 0;
    Guid g;
    if (UuidCreateSequential(out g) != RPC_S_OK)
        return Guid.NewGuid();
    else
        return g;
}

newsequentialID() is actually just a wrapper for UuidCreateSequential. I'm sure if you can't use this directly on the client you can figure out a way to make a quick round-trip to the server to get a new sequential id from there, perhaps even with a "dispenser" table and a stored procedure to do the job.

Emtucifor 2010-08-21 02:57:07

I've edited the original post see edit 2 and 3.

Peter 2010-08-21 03:47:07

The fillfactor is definitely a possibility, although again it's going to bloat your index. Having GUID keys in your clustered index is not ideal, but since you're already stuck with the GUIDs and they are used for the majority of your joins, you may have to just go with it.I think bottom line is you need to do some testing to figure out the best way. Do you have a test system you can use to try out the different methods?

bluefooted 2010-08-21 04:52:29

I agree it sucks to use those huge GUIDs, and lowering the number of rows per page by reducing the fillfactor is unfortunate, but if he wants improved update performance at the cost of read performance, that's the way to go. It's not even as straight a proposition as that because page splits already reduce page row density...

Emtucifor 2010-08-21 07:57:01

Yeah, I hear you, it's definitely not an easy question, but his stated goal is actually to improve read performance:"The goal is to improve the identity query and joined table query performance."Unfortunately, it's tough to make general recommendations without looking at actual execution plans. In this case, I think the best idea is to test a couple different strategies and see what works best.

bluefooted 2010-08-21 14:43:39

ansaurus

tags:

views:

answers:

Sql Server Legacy Database To Clustered index or not

related questions