views:

53

answers:

3

If you are talking about btrees, I wouldn't imagine that the additional overhead of a non clustered index (not counting stuff like full text search or other kind of string indexing) is even measurable, except for an extremely high volume high write scenario.

What kind of overhead are we actually talking about? Why would it be a bad idea to just index everything? Is this implementation specific? (in that case, I am mostly interested in answers around pg)

EDIT: To explain the reasoning behind this a bit more...

We are looking to specifically improve performance right now across the board, and one of the key things we are looking at is query performance. I have read the things mentioned here, that indexes will increase db size on disk and will slow down writes. The question came up today when one pair did some pre-emptive indexing on a new table, since we usually apply indexes in a more reactive way. Their arguement was that they weren't indexing string fields, and they weren't doing clustered indexes, so the negative impact of possibly redundant indexes should barely be measurable.

Now, I am far from an expert in such things, and those arguments made a lot of sense to me based on what I understand.

Now, I am sure there are other reasons, or I am misunderstanding something. I know a redundant index will have a negative effect, what I want to know is how bad it will be (because it seems negligible). The whole indexing every field thing is a worst case scenario, but I figured if people could tell me what that will do to my db, it will help me understand the concerns around being conservative with indexing, or just throwing them out there when it has a possibility of helping things.

+3  A: 

Random thoughts

  • Indexes benefit reads of course
  • You should index where you get the most bang for your buck
  • Most DBs are > 95% read (think about updates, FK checks, duplicate checks etc = reads)
  • "Everything" is pointless: most indexed need to be composite with includes
  • Define high volume we have 15-20 million new rows per day with indexes
gbn
20mil rows per day, you probably want to start paying attention to insert overheads. We are more in the realm of 100k inserts/day, most of the load is realtime reporting style queries. I was basically asking if we really need to worry about the overhead of inserts so much when it comes to indexing in a more read heavy scenario
Matt Briggs
We've indexed where we need to, not blanket indexing. We've dropped indexes too, or worked out where we can have a composite and live with a scan not seek. We don't worry about overhead: our worries are disk space and maintenance windows...
gbn
so you are basically saying that outside of db size, or "migration" type scenarios that update every row in a big table, it doesn't _really_ make much of a difference if you have redundant indexes?
Matt Briggs
@Matt Briggs: it can do in some cases. But why allow unused indexes in the first place? See my 4th point. Also note an index on (col1, col2) is normally good enough to to (col1) alone too, for example.
gbn
@gbn: as for why, look at the edit to the post. the motivation was to figure out whether or not we should take a "preemptive" approach to indexing.
Matt Briggs
+1  A: 

Introduction to Indices

In short, an index, whether clustered or non-, adds extra "branches" to the "tree" in which data is stored by most current DBMSes. This makes finding values with a single unique combination of the index logarithmic-time instead of linear-time. This reduction in access time speeds up many common tasks the DB does; however, when performing tasks other than that, it can slow it down because the data must be accessed through the tree. Filtering based on non-indexed columns, for instance, requires the engine to iterate through the tree, and because the ratio of branch nodes (containing only pointers to somewhere else in the tree) to leaf nodes has been reduced, this will take longer than if the index were not present.

In addition, non-clustered indices separate data based on column values, but if those column values are not very unique across all table rows (like a flag indicating "yes" or "no"), then the index adds an extra level of complexity that doesn't actually help the search; in fact, it hinders it because in navigating from root to leaves of the tree, an extra branch is encountered.

KeithS
for your last point, if the index isn't going to be helpful, wouldn't the query planner just do a sequence scan instead? would the redundant index still negatively impact the query performance if it wasn't used?
Matt Briggs
+1  A: 

I am sure the exact overheard is probably implementation specific, but off the top of my head some points:

  • Increased Disk Space requirements.
  • All writes (inserts, updates, deletes) cost more as all indexes must be updated.
  • Increased transaction locking overheard (all indexes must be updated within a transaction, leading to more locks being required, etc).
  • Potentially increased complexity for the query optimizer (choosing which index is most likely to perform best; Also potential for one index to be chosen when another index would actually be better).
Chris Shaffer
transaction locking time is a very good point. What I am looking for is in a high read / low write scenario, are there any non trivial impacts to performance for a btree style index.
Matt Briggs