tags:

views:

283

answers:

4

Hello Experts,

I need to add indexes to my table (columns) and stumbled across this post:

http://stackoverflow.com/questions/141232/how-many-database-indexes-is-too-many

Quote: “Having said that, you can clearly add a lot of pointless indexes to a table that won't do anything. Adding B-Tree indexes to a column with 2 distinct values will be pointless since it doesn't add anything in terms of looking the data up. The more unique the values in a column, the more it will benefit from an index.”

Is an Index really pointless if there are only two distinct values? Given a table as follows (MySQL Database, InnoDB)

Id (BIGINT)
fullname (VARCHAR)
address (VARCHAR)
status (VARCHAR)

Further conditions:

  • The Database contains 300 Million records
  • Status can only be “enabled” and “disabled”
  • 150 Million records have status= enabled and 150 Million records have stauts= disabled

My understanding is, without having an index on status, a select with where status=’enabled’ would result in a full tablescan with 300 Million Records to process?

How efficient is the lookup when I use a BTREE index on status?

Should I index this column or not?

What alternatives (maybe any other indexes) does MySQL InnoDB provide to efficiently look records up by the "where status="enabled" clause in the given example with a very low cardinality/selectivity of the values?

Thank you very much!

Jan

+1  A: 

Jan, you should definitely index that column. I'm not sure of the context of the quote, but everything you said above is correct. Without an index on that column, you are most certainly doing a table scan on 300M rows, which is about the worst you can do for that data.

Jan, as asked, where your query involves simply "where status=enabled" without some other limiting factor, an index on that column apparently won't help (glad to SO community showed me what's up). If however, there is a limiting factor, such as "limit 10" an index may help. Also, remember that indexes are also used in group by and order by optimizations. If you are doing "select count(*),status from table group by status", an index would be helpful.

You should also consider converting status to a tinyint where 0 would represent disabled and 1 would be enabled. You're wasting tons of space storing that string vs. a tinyint which only requires 1 byte per row!

Mike Sherov
Hmm, why the downvote? Am I mistaken? If so, I'd love to know why, as it would mean I've missed something fundamental about MySql and indices.
Mike Sherov
I think you are. Indexing involves lookups, therefore it's faster to read 300 rows sequentially than jump back and forth 150 times (disclaimer: I didn't downvote :)
stereofrog
@stereofrog: are you sure? What about the fact that indexes are stored in the key buffer, where as data is not? Also, I'm not sure what "jumping back and forth" means.
Mike Sherov
Hello Mike, Thank you very much for your answer, it was very helpful to me.
Jan
+1  A: 

You will hardly need all 150 mln records at once, so I guess "status" will always be used in conjunction with other columns. Perhaps it'd make more sense to use a compound index like (status, fullname)

stereofrog
This answer doesn't address the question as asked. If he added that index, and he now wants to do a search just by last name, it's a table scan. Also, what if wants the last disabled ten records? As asked, he wants "where status=disabled". Adding fullname to the index may be unnecessary overhead.
Mike Sherov
but only if you don't use `where fullname like '%something%'` as an index is not that usefull in like with wildcards on _both_ sides.
extraneon
No, if status comes first in the index, you do NOT have an index on fullname. Ordering of columns matters.
Mike Sherov
+3  A: 

I'm sorry to say that I do not agree with Mike. Adding an index is meant to limit the amount of full records searches for MySQL, thereby limiting IO which usually is the bottleneck.

This indexing is not free; you pay for it on inserts/updates when the index has to be updated and in the search itself, as it now needs to load the index file (full text index for 300M records is probably not in memory). So it might well be that you get extra IO in stead of limitting it.

I do agree with the statement that a binary variable is best stored as one, a bool or tinyint, as that decreases the length of a row and can thereby limit disk IO, also comparisons on numbers are faster.

If you need speed and you seldom use the disabled records, you may wish to have 2 tables, one for enabled and one for disabled records and move the records when the status changes. As it increases complexity and risk this would be my very last choice of course. Definitely do the move in 1 transaction if you happen to go for it.

It just popped into my head that you can check wether an index is actually used by using the explain statement. That should show you how MySQL is optimizing the query. I don't really know hoe MySQL optimizes queries, but from postgresql I do know that you should explain a query on a database approximately the same (in size and data) as the real database. So if you have a copy on the database, create an index on the table and see wether it's actually used. As I said, I doubt it, but I most definitely don't know everything:)

extraneon
+1 The use of either partitioning or 2 seperate tables is a good suggestion.
ar
This is a good discussion. I agree about partitioning, but only if he doesn't want records regardless of status. If every query involves a where on status, partitioning makes sense to me.
Mike Sherov
Also, the penalty on insert, in my opinion, is outweighed by the boost on select. If status becomes a tinyint, and you have a properly tuned mysql server, the index file for 300M records easily fits in key buffer, after a bit of warm up time.
Mike Sherov
I see, when it's simply where status = disabled, index would be worse. Thanks!
Mike Sherov
+1 for partitioning. OP can also use a merge table for selects that don't involve "status".
stereofrog
Thank you very much for your answer!
Jan
+1  A: 

The index that you describe is pretty much pointless. An index is best used when you need to select a small number of rows in comparison to the total rows.

The reason for this is related to how a database accesses a table. Tables can be assessed either by a full table scan, where each block is read and processed in turn. Or by a rowid or key lookup, where the database has a key/rowid and reads the exact row it requires.

In the case where you use a where clause based on the primary key or another unique index, eg. where id = 1, the database can use the index to get an exact reference to where the row's data is stored. This is clearly more efficient than doing a full table scan and processing every block.

Now back to your example, you have a where clause of where status = 'enabled', the index will return 150m rows and the database will have to read each row in turn using separate small reads. Whereas accessing the table with a full table scan allows the database to make use of more efficient larger reads.

There is a point at which it is better to just do a full table scan rather than use the index. With mysql you can use FORCE INDEX (idx_name) as part of your query to allow comparisons between each table access method.

Reference: http://dev.mysql.com/doc/refman/5.5/en/how-to-avoid-table-scan.html

ar
I see what you're saying, but usually there will be some other limiting factor. For example, let's say he adds limit 10, then an index is better, no? I suppose as asked, you are right
Mike Sherov
nice and clear, +1
stereofrog