views:

397

answers:

7

Suppose you have a very large database, and to simplify lets say it consists of one major table you will be doing your lookups on with one (and only one) primary key field - pk.

Given the fact that all lookups are going to be basically SELECT * FROM table_name WHERE pk=someKeyValue, what is the best way to optimize this database for the fastest lookups?

Edit: just a few more details - INSERTs and UPDATEs are going to be very non-frequent so I don't mind sacrificing performance there to achieve better lookup performance.

Also, seems like clustering is the way to go. Do you have any examples of the kind of increase in performance I can achieve with this method? And how exactly is this done (on any kind of DB)?

A: 

If all your queries are going to be based off the PK, you wouldn't get any added benefit by setting an index on the PK since it should already be indexing by that.

Edit: The only other possible things I would suggest is looking at normalizing your table (if that is even an option or necessity). By splitting off items into other tables, you can refine what is being pulled back in each query and only pull the less-used items when needed using joins.

Based off the limited description of "a very large database with a single table" it is hard to locate any easy and obvious ways to optimize without looking at what kind of data you are actually storing in your fields.

TheTXI
I never said I was thinking of re-indexing by PK. Do you have any other optimization suggestions?
Yuval A
+1  A: 

One thing you could do is make the primary key clustered, this results in the actual data being physically ordered on the disk, resulting in faster queries.

It will also mean slower inserts, but if you select much more frequently than you insert, this should not be a problem.

Patrick McDonald
This will work on MS SQL Server - where the clustering key defines the physical ordering of the data. Not sure how other systems handle this, though.
marc_s
+4  A: 

If the primary key is clustered, then you won't get any quicker.

If it isn't clustered, and the number of columns in your table is relatively small, then you could in theory create a covering index to speed up the query. But then this negates any insert/update performance enhancements that having the non-clustered primary key would have given you.

If your primary key is an always-increasing field (e.g. a SQL Server identity, or generated from a sequence in Oracle) then the clustered primary key has no drawbacks anyway.

David M
A: 
  • If your PK order matches insertion order, i.e. time or id/autoincrement, then make it clustered. This will reduce disk and cache thrashing on inserts, leaving more resources to devote to lookups.
  • Consider tweaking page sizes on the table to be an exact multiple of your record size. This requires intimate knowledge of the particular database software for details of how, and record/index overhead, etc.
  • If practical, use fixed-size for all columns rather than variable size.
  • Consider putting the index and/or transaction log files on a separate volume.
  • Install as much RAM as the software and hardware can use.
dwc
A: 

If you're using MySQL, you can do some additional things (beyond tuning your cache values). The table engine can be a factor; for instance, MyISAM is widely held to be faster at SELECTs than InnoDB. If this table is primarily a lookup table, and you were using MySQL, that might be a good thing to do. (InnoDB is pretty good on average; it's better on writes than MyISAM, and also, InnoDB never needs to be repaired.)

Peter
A: 

If you were using Oracle then I'd advise benchmarking three approaches:

  1. Heap table with primary key index
  2. Index-organised table
  3. Single table hash cluster

1 represents a very vanilla approach -- really it's the lowest common denominator, but could mean 5+ logical reads to get each row, with one of those being a probable physical read of the table if it is not completely cached.

2 will save you one of those logical read by avoiding the probe to a separate table segment, but might not save you the physical read because the IOT segment will be larger and harder to cache than the index alone.

3 will potentially get you the row with a single logical read, but unless you have the entire table cached that's probably going to translate into a physical read.

Benchmarking is highly recommended.

David Aldridge
A: 

I have to add two more options to all that was proposed above (I like dwc’s answer). You should consider partitioning if your table is really big.

First, horizontal partitioning (especially if I/O is bottleneck in your DB). You create several filegroups and locate them on different hard drives. Then, create Partition Function, Partition Scheme to divide your table and put parts of your table on separate HDs (like rows 1-499999 to the F: drive, 500000-999999 to the G: drive, and so on) .

Second, vertical partitioning. This would work if you select column sets (not *) in most of your queries. In this case, divide columns in the table in two groups: first, fields that you need in all queries; second, fields that you rarely need. Create two tables with the same primary key. Use JOINs on the primary key when you need columns from both tables.

(This answer pertains to SQL Server 2005/2008.)

Irina C