views:

243

answers:

2

Someone asked the question: "INT, BIGINT or UUID/GUID in Oracle, DB2, Derby and HSQLDB?" and I started to think about all the database schemas I've designed and the books that I've read and not one reference gave any real clear advice about creating indexes.

For example; if you have a compound index like

date() ++ foo() ++ bar()

While this index is nice for searching and sorting date range data (read; read performance)... it is terrible for writes. (inserts always happen to the right side of the balanced tree forcing re-balancing which is an expensive operation)

Obviously... a) know your data. b) know your use-case. c) know your database engine.

But what are the common sense general rules for defining a reasonable schema for high performance databases?

+1  A: 

There are only a few rules of thumb to create indexes:

  • Create index on foreign keys
  • Create index on typical search columns, like log-on name and pin of users, product id of products etc.
  • Don't create any because you think it could enhance the performance.

Additional indexes should be added driven by performance problems of the application

  • observe your application and identfy time-consuming queries
  • when you identified a critical query, analyze the execution plan and optimize it using indexes.

In the last sentence you say "defining a reasonable schema". This is much more general than how to design indexes.

Stefan Steinegger
+2  A: 

OK, here's some real clear advice about index generation: It depends.

That's real clear, but it's not specific at all. If you want something that's more specific, you'll have to learn what it depends on.

It depends on your DBMS, and maybe even the version of your DBMS. Here are some buzzwords you ought to learn about, at least superficially. By "superficially" I mean learn what it does for you, and how it can hurt you, but not necessarily how it works. Use a document that's specific to your DBMS, if you can get one.

Avoiding full table scans.

Index only retrieval.

Range retrieval. (and compound or composite indexes)

Merge Join (discussed later).

Hash indexes.

Concurrency control (discussed later).

Primary keys and indexes (discussed later).

The cost of index updates.

Deferred Updates on indexes.

Cost based optimization. If your DBMS doesn't have a CBO, then get another DBMS.

Hints. (How to use them, and how to live without them.)

Database administration and the CBO. Some DBMSes require periodic DBA action to prevent the optimizer from using an obsolete strategy.

It depends on volume: index design is relatively trivial for very small tables. By "relatively trivial", I mean that it's fairly easy, but it's also fairly unimportant. The cost of being wrong is low. If you're building lookup tables, you will surely want a unique index on the code column. You'll get such a table (with most DBMSes) if you declare the code column as a primary key. If you don't create any other indexes, the cost is likely to be a table scan of a small table under unusual circumstances where some delay is tolerable.

The large tables in any schema tend to be the ones that are added to by routine transaction processing. This adds to the benefits of having some indexes, both in terms of speed and in terms of transaction concurrency. It also adds to the cost of having indexes, because the transactions have to update the indexes. The cost-benefit trade off can be very subtle and very important for transaction tables.

If your DBMS supports it, you can use deferred update to good effect with some indexes on transaction tables.

In any schema, at least try to distinguish the reference tables from the transaction tables. I know, I know, this is somewhat subjective. Use your best judgement.

It depends on traffic: not all tables get the same amount of traffic. Indexes speed up joins as well as lookups. At the very least you should learn whether your DBMS has an optimizer that knows how to do merge-joins based on available indexes and table volumes. If you don't know what a merge-join is, learn what it is. But don't waste time learning how to program a merge-join unless that's the kind of thing you do for a living.

It depends on urgency. A query that's performed once a month during beckground batch processing isn't as urgent as a query that holds up a user 1000 times a day, while that user stares at the screen, or context switches her multitasking.

Beware of what product marketing will tell you about urgency. They will tend to tell you that being faster than the competition is of the utmost urgency in every circumstance, even if it means working evenings and weekends while you miss your first child's birth. Marketing often doesn't care whether you get burned out. They are like a jockey that doesn't care whether the horse ever races again. The truth is that some transactions are very urgent, while others are relatively unimportant.

Be prepared to be flexible about index design, and to consider trade offs.

I wish I could point you to a really good book on this subject. I hope somebody else will do so.

Walter Mitty
Nice answer. Thanks!
Richard