views:

243

answers:

4

The table in question is part of a database that a vendor's software uses on our network. The table contains metadata about files. The schema of the table is as follows

Metadata 
ResultID (PK, int, not null) 
MappedFieldname (char(50), not null) 
Fieldname (PK, char(50), not null) 
Fieldvalue (text, null)

There is a clustered index on ResultID and Fieldname. This table typically contains millions of rows (in one case, it contains 500 million). The table is populated by 24 workers running 4 threads each when data is being "processed". This results in many non-sequential inserts. Later after processing, more data is inserted into this table by some of our in-house software. The fragmentation for a given table is at least 50%. In the case of the largest table, it is at 90%. We do not have a DBA. I am aware we desperately need a DB maintenance strategy. As far as my background, I'm a college student working part time at this company.

My question is this, is a clustered index the best way to go about this? Should another index be considered? Are there any good references for this type and similar ad-hoc DBA tasks?

+4  A: 

The indexing strategy entirely depends on how you query the table and how much performance you need to get out of the respective queries.

A clustered index can force re-sorting rows physically (on disk) when out-of-sequence inserts are made (this is called "page split"). In a large table with no free space on the index pages, this can take some time.

If you are not absolutely required to have a clustered index spanning two fields, then don't. If it is more like a kind of a UNIQUE constraint, then by all means make it a UNIQUE constraint. No re-sorting is required for those.

Determine what the typical query against the table is, and place indexes accordingly. The more indexes you have, the slower data changes (INSERTs/UPDATEs/DELETEs) will go. Don't create too many indexes, e.g. on fields that are unlikely to be filtered/sorted on.

Create combined indexes only on fields that are filtered/sorted on together, typically.

Tomalak
The typical queries will be non sequential inserts, no updates, no deletes, and selects not as often as inserts (write many, read few). I guess I need to read up and watch which queries are running on a regular basis.
llamaoo7
This sounds like a good scenario to remove the clustered index. Also, look at the index fill factor. Make sure enough room is available to reduce the need to do index page splits. The default fill factor of 80 might be too high for your needs.
Tomalak
A: 

The clustered index is OK as far as I see. Regarding other indexes you will need to provide typical SQL queries that operate on this table. Just creating an index out of the blue is never a good idea. You're talking about fragmentation and indexing, does it mean that you suspect that query execution slows down? Or do you simply want to shrink/defragment the database/index?

It is a good idea to have a task to defragment indexes from time to time during off-hours, though you have to consider that with frequent/random inserts it does not hurt to have some spare space in the table to prevent page splits (which do affect performance).

liggett78
+1  A: 

Look hard at your queries - the ones that hit the table for data. Will the index serve? If you have an index on (ResultID, FieldName) in that order, but you are querying for the possible ResultID values for a given Fieldname, it is likely that the DBMS will ignore the index. By contrast, if you have an index on (FieldName, ResultID), it will probably use the index - certainly for simple value lookups (WHERE FieldName = 'abc'). In terms of uniqueness, either index works well; in terms of query optimization, there is (at least potentially) a huge difference.

Use EXPLAIN to see how your queries are being handled by your DBMS.

Clustered vs non-clustered indexing is usually a second-order optimization effect in the DBMS. If you have the index correct, there is a small difference between clustered and non-clustered index (with a bigger update penalty for a clustered index as compensation for slightly smaller select times). Make sure everything else is optimized before worrying about the second-order effects.

Jonathan Leffler
+1 Since he seems to be worried about INSERT/UPDATE performance rather than SELECT performance, clustered/non-clustered might be some kind of a first order optimization for him.
Tomalak
A: 

I am aware we desperately need a DB maintenance strategy.

+1 for identifying that need

As far as my background, I'm a college student working part time at this company

Keep studying, gain experience, but get an experienced consultant in in the meantime.

The table is populated by 24 workers running 4 threads each

I presume this is pretty mission critical during the working day, and downtime is bad news? If so don't clutz with it.

There is a clustered index on ResultID and Fieldname

Is ResultID the first column in the PK, as you indicate?

If so I'll bet that it is insufficiently selective and, depending on what the needs are of the queries, the order of the PK fields should be swapped (notwithstanding that this compound key looks to be a poor choice for the clustered PK)

What's the result of:

SELECT COUNT(*), COUNT(DISTINCT ResultID) FROM MyTable

If the first count is, say, 4 x as big as the second, or more, you will most likely be getting scans in preference to seeks, because of the low selectively of ResultsID, and some simple changes will give huge performance improvements.

Also, Fieldname is quite wide (50 chars) so any secondary indexes will have 50 + 4 bytes added to every index entry. Are the fields really CHAR rather than VARCHAR?

Personally I would consider increased the density of the leaf pages. At 90% you will only leave a few gaps - maybe one-per-page. But with a large table of 500 million rows the higher packing density may mean fewer levels in the tree, and thus fewer seeks for retrieval. Against that almost every insert, for a given page, will require a page split. This would favour inserts that are clustered, so may not be appropriate (given that your insert data is probably not clustered). Like many things, you'd need to make a test to establish what index key density works best. SQL Server has tools to help analyse how queries are being parsed, whether they are being cached, how many scans of the table they cause, which queries are "slow running", and so on.

Get a consultant in to take a look and give you some advice. This aint a question that answers here are going to give you a safe solution to implement.

You really REALLY need to have some carefully thought through maintenance policies for tables that have 500 millions rows and shed-loads of inserts daily. Sorry, but I have enormous frustration with companies that get into this state.

The table needs defragmenting (your options will become fewer if you don't have a clustered index, so keep that until you decide that there is a better candidate). "Online" defragmentation methods will have modest impact on performance, and can chug away - and can safely be aborted if they overrun time / CPU constraints [although that will most likely take some programming]. If you have a "quiet" slot then use it for table defragmentation and updating the statistics on indexes. Don't wait until the weekend to try to do all tables in one go - do as much/many as you can during any quiet time daily (during the night presumably).

Defragmenting the tables is likely to lead to a huge increased in Transaction log usage, so make sure that any TLogs are backed up frequently (we have a 10 minute TLog backup policy, which we increase to every minute during table defragging so that the defragging process doesn't become the definition of required Tlog space!)

Kristen