views:

476

answers:

9

Say I have a table with a large number of rows and one of the columns which I want to index can have one of 20 values. If I were to put an index on the column would it be large?

If so, why? If I were to partition the data into the data into 20 tables, one for each value of the column, the index size would be trivial but the indexing effect would be the same.

A: 

Indexes are purely for performance. If an index doesn't boost performance for the queries you're interested in, then it sucks.

As for disk usage, you have to weigh your concerns. Different SQL providers build indexes differently, but as a client, you generally trust that they do the best that can be done. In the case you're describing, a clustered index may be optimal for both size and performance.

harpo
" If an index doesn't boost performance for the queries you're interested in, then it sucks."I beg to differ. I do agree, if the index serves no purpose, it is just extra overhead. But the purpose may be much wider than the query or queries you are currenlty examining.
HLGEM
You're right... I overstated a bit. After posting I thought, it might be designed for future data scenarios.
harpo
+2  A: 

Say I have a table with a large number of rows and one column which I want to index can have one of 20 values. If I were to put an index on the column would it be large?

The index size will be proportional to the number of your rows and the length of the indexed values.

The index keeps not only the indexed value, but also some kind of a pointer to the row (ROWID in Oracle, LCID in PostgreSQL, primary key in InnoDB etc).

If you have 10,000 rows and a 1 distinct value, you will still have 10,000 records in your index.

If so, why? If I were to partition the data into the data into 20 tables, one for each value of the column, the index size would be trivial but the indexing effect would be the same

In this case, you would come with 20 indexes being same in size in total as your original one.

This technique is sometimes used in fact in such called partitioned indexes. It has its advantages and drawbacks.

Quassnoi
In Oracle, the COMPRESS option on index creation can reduce the need to have multiple copies of the same indexed value represented in the index. However, you still need all the rowids.
WW
My point is that if I partition into 20 tables then I wouldn't need any index in the column, as I know that every row of the column has the same value.
Mr. Flibble
If you partition into 20 tables you don't even need the column
Quassnoi
A: 

It would be large enough to hold those values for all the rows, in a sorted order.

Say you have 20 different strings of 4 characters, and 1 million rows, it would at least be 4 million bytes (or 8 if 16-bit unicode) to hold those values.

Lasse V. Karlsen
Well, not necessarily. If all the rows on a page had the same column value, for example, a smart indexing engine might be able to use less space by recording that fact instead. IMHO of course, I could easily be wrong...
Mike Woodhouse
+3  A: 

The short answer: Do indexes suck: Yes and No

The longer answer: They don't suck if used properly. Maybe you should start reading about how indexes work, why they can work and why they sometimes don't work.

Good starting points: http://www.sqlservercentral.com/articles/Indexing/

thijs
+6  A: 

It's not the indexes that will suck. It's putting indexes on the wrong columns that will suck.

Seriously though, why would you need a table with a single column? What would the meaning of that data be? What purpose would it serve?

And 20 tables? I suggest you read up on database design first, or otherwise explain to us the context of your question.

Jon Limjap
I've seen a database with a separate table for each attribute of the actual entities. Why: they want version history and time-travel for every attribute. Imagine that database with 300 tables, where most fields are of the type "DateTime"...
thijs
@thijs but you would still require two columns, one as the key and one as the attribute
Nathan Koop
I phrased that badly. There is one column which I want to index, not one column in total. I'll edit my question with more details of the table structure.
Mr. Flibble
Single column tables are extraordinarily useful. Consider a system with 1,000,000 customers, of which about 250 are on credit hold at any given time. What would you prefer — an OnCreditHold column in Customers, or an OnHoldCustomers table with only a CustomerID column and 250 rows?
Larry Lustig
+1  A: 

Sorry, I'm not quite sure what you mean by "large".

  • If your index is clustered, all the data for each record will be on the same leaf page, thereby creating the most efficient index available to your table as long as you write your queries against it properly.

  • If your index is non-clustered, then only the index related data will be on your leaf pages. Then, depending on suchs things as how many other indexes you have, coupled with details like your fill factor, your index may or may not be efficient. In general, if you don't have a ton of indexes on your table, you should be safe.

  • The efficiency of your index will also be determined by the data type of the 20 values you're speaking of going into the column. If those are pre-defined values, then their details should probably be in a lookup table with a simple primary key datatype (like Int/Number). Then add that column to your table as a foreign key with an index on the column.

Ultimately, you could have a perfect index on a column. But it's best use will be determined for the most part by the queries you write. So if your queries make use of the indexes, you're golden.

Boydski
The table has 600 million rows. There are about 5 columns, all but one of which are used for select filtering and one which is the data column.But, for the sake of this question we could say there are 3 columns. Col1, Col2, Col3. Say Col1 is the PK and col2 has 20 possible values and col3 is the
Mr. Flibble
data column. It seems to me that there is something wrong if the index on Col2 is massive - as I can roll my own index by partitioning into 20 tables, 1 per Col2 value.
Mr. Flibble
At 600M rows, I hope you're talking about an OLAP table, not an OLTP table. That's a lot of rows to be managing! Now you're getting into serious warehouse DB architecture theory that would have to take into consideration many other factors of your database. I'd love to hear of your final decision.
Boydski
+2  A: 

Standard b-tree indexes are best suited to fairly selective indexes, which this example would not be. You don't say what DBMS you are using; Oracle has another type of index called a bitmap index which is more suited to low-selectivity indexes in OLAP environments (since these indexes are expensive to maintain, making them unsuitable for OLTP environments).

The optimiser will decide bases on stats whether it thinks the index will help get the data in the fastest time; if it won't, the optmiser won't use it.

Partitioning is another strategy. In Oracle you can define a table as partitioned on some set of columns, and for the optimiser can automatically perform "partition elimination" like you suggest.

Tony Andrews
FYI: Table partitioning (spread data over files) based on the contents of columns is also possible in MSSQL 2005 and up
thijs
+6  A: 

Indexes (or indices) don't suck. A lot of very smart people have spent a truly remarkable amount of time of the last several decades ensuring that this is so.

Your schema, however, lacking the same amount of expertise and effort, may suck very badly indeed.

Partitioning, in the case described is equivalent to applying a clustered index. If the table is sorted otherwise (or is in arbitrary order) then the index necessarily has to occupy much more space. Depending on the platform, a non-clustered index may reduce in size as the sortedness of the rows with respect to the indexed value increases.

YMMV.

Mike Woodhouse
Good! I suspected this partitioning was like using a clustered index. This leads me to the question: is there any value to partitioning the table myself over using a clustered index? I think that the performance hit would be minimal on inserts if I only need to add a bit of code to choose the correc
Mr. Flibble
correct table to insert into. Would there be a greater performance hit if I used a clustered index? Does data have to get shifted about a lot on each insert where there is a clustered index - or is it smarter than that?
Mr. Flibble
A table with a clustered index is (by definition) sorted on the indexed columns. So inserting across all values is probably going to cost. It could actually be worse with a partitioned table, though - you'd have to suck it and see. Don't forget to try a non-clustered index in the comparison, either!
Mike Woodhouse
+2  A: 

No indexes don't suck, but you have to pay attention to how you use them or they can backfire on the performance of your queries.

First: Schema / design
Why would you create a table with only one column? That's probably taking normalization one step to far. Database design is one of the most important things to consider in optimizing performance

Second: Indexes
In a nutshell the indexes will help the database to perform a binary search of your record. Without an index on a column (or set of columns) the database will often fall back to a table scan. A table scan is very expensive because it involves enumerating each and every record.

It doesn't really matter THAT much for index scans how many records there are in the database table. Because of the (balanced) binary tree search doubling the amount of records will only result in one extra search step.

Determine the primary key of your table, SQL will automatically place a clustered index on that column(s). Clustered indexes perform really well. In addition you can place non-clustered indexes on columns that are used often in SELECT, JOIN, WHERE, GROUP BY and ORDER BY statements. Do remember that indexes have a certain overlap, try to never include your clustered index into a non-clustered index.

Also interesting might be the fill factor on the indexes. Do you want to optimize your table for reads (high fill factor - less storage, less IO) or for writes (low fill factor more storage, less rebuilding your database pages).

Third: Partitioning
One of the reasons to use partitioning is to optimize your data access. Let's say you have 1 million records of which 500,000 records are no longer relevant but stored for archiving purposes. In this case you could decide to partition the table and store the 500,000 old records on slow storage and the other 500,000 records on fast storage.

To measure is to know
The best way to get insight in what happens is to measure what happens to your cpu and io. Microsoft SQL server has some tools like the Profiler and Execution plans in Management Studio that will tell you the duration of your query, number of read/writes and cpu usage. Also the execution plan will tell you which or IF indexes are being used. To your surprise you might see a table scan although you didn't expect it.

Zyphrax
Heh, I did not mean the table has only one column. I mean it has one column in particular that I want to index. I've edited the question to make this clearer.
Mr. Flibble
Excellent answer. Very detailed.
Yves M.