views:

621

answers:

9

Hypothetically, in a SQL Server database, if I have a table with two int fields (say a many-to-many relation) that participates in joins between two other tables, at what approximate size does the table become large enough where the performance benefit of indexes on the two int fields overcomes the overhead imposed by said indexes?

Are there differences in architecture between different versions of SQL Server that would substantially change this answer?

Update: Quassnoi gets the selected answer for his detailed insight into query execution plans. For me, I think the most important point is that the SQL Server Optimizer will optimize out the index if necessary, so there is really no penalty for having an index on every foreign key (other than the inserting overhead), and I won't have to think about adding an index later when the table reaches some (undetermined) size.

A: 

I believe as soon as you start doing joins on those int fields your table is big enough. If the table is small enough that it wouldn't benefit from an index then the overhead wouldn't be significant enough that you would want to opt out.

When I think about the overhead due to an index I usually consider how often the table index will be changing--through inserts, deletes and updates to indexed columns.

nvuono
Indexes add overhead on SELECT statements too, not just INSERTs and UPDATEs.
Robert Harvey
+1  A: 

The index will nearly always increase the performance of the query, at the cost of extra memory and performance cost for insert/deletion (since it needs to maintain the index at that point). Profiling will be the only definite way to tell whether or not the index, in your particular case, is beneficial.

In general, you're trading memory for speed when you create an index (other than the additional cost of insertion). If you're doing many queries (selects or updates) relative to the number of inserted/deleted rows, indexes will pretty much always increase your performance.

Reed Copsey
If the case is relatively standard (as in the example of two ints in a table), is the tip-over point relatively level, or will it vary as a function of the number of additional columns in the outer tables, and other unknown factors?
Robert Harvey
The optimizer knows where the tipping point is, so you don't need to worry about it. If the loading penalty is significant because you're loading rows fast enough to notice, you'll reach that point in no time anyway.
le dorfier
The indexing will speed up your queries, nearly across the board. The number of columns will make little difference, since it's a matter of locating the appropriate rows to return (which is somewhat separate from the number of columns). If you have very little data, the index may not help much, but it also is nearly free - I personally always index columns which I will be using frequently for any location queries (including updating rows [as long as you don't change the indexed column), since that will be much faster with an index, even in relatively small cases.
Reed Copsey
+1  A: 

it depends on the selectivity of your data, if your data is not selective enough then the index might not even be used since the cost would be too expensive. If you have only 2 values in the table and these values are evenly distributed then you will get a scan not a seek

I still believe every table should have a Primary Key, if you have that then you also have an index already

SQLMenace
It is true that in my hypothetical (a many-to-many join) the outer tables would already have indexes.
Robert Harvey
A two-column junction table? It would be highly unusual to have low selectivity on primary keys from two other tables.
le dorfier
@Robert: what about your linking table? It should have a primary key as well - if nothing else, a composite primary key on the two foreign key columns. How does your current schema protect you from duplicate records?
GalacticCowboy
I always put a primary key in my tables. In the scenario I have in mind, there should be no duplicate records. The join is created through a UI and some checkboxes, so if there are duplicate records, there is something wrong with the UI. It may sound like heresy, but I dislike composite keys, and until you mentioned the duplicate records scenario, I had never found a use for them. Something to think about.
Robert Harvey
You could also do it with a composite unique or something, rather than the PK.
GalacticCowboy
Robert, I'm not a fan of composite indexes but a two-column joining table is an exception. Usually these are two int columns that are already primary keys elsewhere. You need a way to ensure uniqueness of data in the table, so you need a unique index in any event on these two columns. Adding another primary key on a table like this is really wasted overhead, it is better to just make the composite key in this particular case.
HLGEM
+1  A: 

The penalty for insertion will be negligible until long after the benefit of the indexes will appear. The optimizer is smart enough to ignore the indexes anyway until that point clicks in. So just index the table from the start.

le dorfier
Are you saying that SQL Server will create an execution plan using the indexes only if it determines that the index will provide a benefit?
Robert Harvey
Right. There have been several questions in SO about why indexes aren't being used on small tables, and the answer has been "you don't have enough data yet."
le dorfier
So my question morphs to, "At what number of records does the SQL Server Optimizer kick in the index?" And the answer is, "Don't Care?"
Robert Harvey
I'm assuming the case you're giving, which is a junction table with the primary keys from two other tables. And in all but the most pathological cases (e.g. only one value in one of the columns for all rows), yes. One index on both columns (the most selective first) as your Primary Key, and another index on the other column only (if you are specificying both values, it'll use the first.) Or omit the second index if you're sure you'll only ever query on the first column, which is relatively rare.
le dorfier
+1  A: 

Another thing to think about is the concept of coding performance-- sometimes having an index can streamline the mental overhead of thinking about how to manage the relationship between different pieces of data. sometimes it can complicate it...

larson4
One thing I didn't mention is that I use Linq to SQL, which seems to take field names and indexes as cues for what gets included in the model; specifically, foreign key joins get special preference.
Robert Harvey
+9  A: 

For the queries involving small portions of the table rows, indexes are always beneficial, be there 100 rows or 1,000,000.

See this entry in my blog for examples with plans and performance details:

The queries like this:

SELECT  *
FROM    table1 t1
JOIN    table2 t2
ON      t2.col = t1.col

will most probably use HASH JOIN. A hash table for the smaller table will be built, and the rows from the larger table will be used to probe the hash table.

To do this, no index is needed.

However, this query:

SELECT  *
FROM    table1 t1
JOIN    table2 t2
ON      t2.col = t1.col
WHERE   t1.othercol = @value

will use NESTED LOOPS: the rows from the outer table (table1) will be searched using an index on table1.othercol, and the rows from the inner table (table2) will be searched using an index on table2.col.

If you don't have an index on col1, a HASH JOIN will be used which requires scanning all rows from both tables and some more resources to built a hash table.

Indexes are also useful for the queries like this:

SELECT  t2.col
FROM    table1 t1
JOIN    table2 t2
ON      t2.col = t1.col

, in which case the engine doesn't need to read table2 itself at all: eveything you need for this query can be found in the index, which can be much smaller than the table itself and more efficient to read.

And, of course, if you need your data sorted and have indexes on both table1.col and table2.col, then the following query:

SELECT  *
FROM    table1 t1
JOIN    table2 t2
ON      t2.col = t1.col
ORDER BY
        t2.col

will probably use MERGE JOIN method, which is super fast if both input rowset are sorted, and its output is also sorted, which means that ORDER BY comes free.

Note that even if you don't have an index, an optimizer may choose to Eager Spool your small table, which means building a temporary index for the duration of the query and dropped the index after the query finishes.

If the query is small, it will be very fast, but again, an index won't hurt (for SELECT queries I mean). If the optimizer won't need it, it just will not be used.

Note, though, that creating an index may affect DML performance, but it's other story.

Quassnoi
Actually, the database doesn's sort keys within a single page. So until it gets beyond that point there is no benefit. And probably for several pages beyond that.
le dorfier
@Robert: they also have benefit when you use only the indexed columns in the query or when you need your data to be sorted. And no, they're not always a benefit on WHERE clause, only on very selective ones.
Quassnoi
Thanks for the insight.
Robert Harvey
Quassnoi, I saw your blog post. Just so you know, the final decision on indexing our database (based on additional information at this post:stackoverflow.com/questions/1033796/…) was to index all foreign keys EXCEPT those that participate in joins to lookup tables containing LESS THAN 10 RECORDS.
Robert Harvey
We feel that, at a cardinality of 10, the query optimizer will more likely than not choose a table scan over the index, so the overhead of the additional indexes is unwarranted.
Robert Harvey
@Robert: are you talking about indexing the referencing table (the one on which the FOREIGN KEY id defined) or a referenced table (the one mentioned in the REFERENCES clause)? You just cannot reference a table with a FOREIGN KEY unless it has a UNIQUE key defined. And an index on the referencing table is completely other story.
Quassnoi
+1  A: 

Regardless of size, there is always a performance benefit to using an index when doing a lookup.

Regarding overhead, the question becomes: what overhead do you mean, and how do you relate it to the value of a lookup? The two are separate values, after all.

There are two forms of overhead for an index: space (which is usually negligible, depending on how the index is structured), and re-index on insert (the server must recalculate an index after every insert).

As I mentioned, the space issue probably isn't that big a deal. But re-indexing is. Fortunately, you need to be doing lots of near-continuous inserting before that form of overhead becomes a problem.

So bottom line: You're almost always better off having an index. Start from that position and wait until re-indexing becomes a bottleneck. Then you can look into alternatives.

Randolpho
Incorrect. Create a table with only one row, add an index, and see for yourself.
AlexKuznetsov
Ok, by "regardless of size" I mean "for tables with a rowcount greater than 3". Better?
Randolpho
Is the tipover point really three records? That doesn't seem likely.
Robert Harvey
still not quite right. The following link is useful:link: "The Tipping Point Query Answers" http://www.sqlskills.com/BLOGS/KIMBERLY/post/The-Tipping-Point-Query-Answers.aspx
AlexKuznetsov
@AlexKuznetsov: I don't see how the link applies. The article talks about the point at which a clustered index will be used in favor of an alternative (non-covering) index. A clustered index *is still an index* and therefore provides performance benefit; my reply stands. The original question clearly implies that the table is a heap table without either a clustered or FK index. Finally, keep in mind that the article even suggests that any index that can cover the query will not "tip over" to the clustered index.
Randolpho
Randolpho, the main point of that article is to do benchmarks rather than to gues, and to think in terms of page reads. As long as all your data fits on one page, there is no need for a lookup - lookups need more than one page read.
AlexKuznetsov
+1  A: 

A very useful link: "The Tipping Point Query Answers" http://www.sqlskills.com/BLOGS/KIMBERLY/post/The-Tipping-Point-Query-Answers.aspx

AlexKuznetsov
Thanks for that. It tells me that there is no fixed tipping point, that many people don't have good insight into this, and that it's a good thing we have query optimizers.
Robert Harvey
+1  A: 

The best thing is to let the server itself figure it out. You create index in the columns where it makes sense(Im sure there's entire chapters if not books on how to do this the best way), and let the SQL server figure out when/how to use the index.

In many cases, when optimizing, you'd need to read the docs of your particular DBMS to learn more how it uses indexes, and relate that to the queries the application you're optimizing uses. Then you can fine tune the index usage.

nos