ansaurus

Question

MySQL: low cardinality/selectivity columns = how to index ?

Answer 1

+1 A:

Jan, you should definitely index that column. I'm not sure of the context of the quote, but everything you said above is correct. Without an index on that column, you are most certainly doing a table scan on 300M rows, which is about the worst you can do for that data.

Jan, as asked, where your query involves simply "where status=enabled" without some other limiting factor, an index on that column apparently won't help (glad to SO community showed me what's up). If however, there is a limiting factor, such as "limit 10" an index may help. Also, remember that indexes are also used in group by and order by optimizations. If you are doing "select count(*),status from table group by status", an index would be helpful.

You should also consider converting status to a tinyint where 0 would represent disabled and 1 would be enabled. You're wasting tons of space storing that string vs. a tinyint which only requires 1 byte per row!

Mike Sherov 2010-03-05 13:07:41

Hmm, why the downvote? Am I mistaken? If so, I'd love to know why, as it would mean I've missed something fundamental about MySql and indices.

Mike Sherov 2010-03-05 13:13:15

I think you are. Indexing involves lookups, therefore it's faster to read 300 rows sequentially than jump back and forth 150 times (disclaimer: I didn't downvote :)

stereofrog 2010-03-05 13:16:27

@stereofrog: are you sure? What about the fact that indexes are stored in the key buffer, where as data is not? Also, I'm not sure what "jumping back and forth" means.

Mike Sherov 2010-03-05 13:21:53

Hello Mike, Thank you very much for your answer, it was very helpful to me.

Jan 2010-03-07 21:13:33

Answer 2

+1 A:

You will hardly need all 150 mln records at once, so I guess "status" will always be used in conjunction with other columns. Perhaps it'd make more sense to use a compound index like (status, fullname)

stereofrog 2010-03-05 13:13:22

This answer doesn't address the question as asked. If he added that index, and he now wants to do a search just by last name, it's a table scan. Also, what if wants the last disabled ten records? As asked, he wants "where status=disabled". Adding fullname to the index may be unnecessary overhead.

Mike Sherov 2010-03-05 13:18:28

but only if you don't use `where fullname like '%something%'` as an index is not that usefull in like with wildcards on _both_ sides.

extraneon 2010-03-05 13:19:31

No, if status comes first in the index, you do NOT have an index on fullname. Ordering of columns matters.

Mike Sherov 2010-03-05 13:26:07

Answer 3

+3 A:

I'm sorry to say that I do not agree with Mike. Adding an index is meant to limit the amount of full records searches for MySQL, thereby limiting IO which usually is the bottleneck.

This indexing is not free; you pay for it on inserts/updates when the index has to be updated and in the search itself, as it now needs to load the index file (full text index for 300M records is probably not in memory). So it might well be that you get extra IO in stead of limitting it.

I do agree with the statement that a binary variable is best stored as one, a bool or tinyint, as that decreases the length of a row and can thereby limit disk IO, also comparisons on numbers are faster.

If you need speed and you seldom use the disabled records, you may wish to have 2 tables, one for enabled and one for disabled records and move the records when the status changes. As it increases complexity and risk this would be my very last choice of course. Definitely do the move in 1 transaction if you happen to go for it.

It just popped into my head that you can check wether an index is actually used by using the explain statement. That should show you how MySQL is optimizing the query. I don't really know hoe MySQL optimizes queries, but from postgresql I do know that you should explain a query on a database approximately the same (in size and data) as the real database. So if you have a copy on the database, create an index on the table and see wether it's actually used. As I said, I doubt it, but I most definitely don't know everything:)

extraneon 2010-03-05 13:18:21

+1 The use of either partitioning or 2 seperate tables is a good suggestion.

ar 2010-03-05 13:20:20

This is a good discussion. I agree about partitioning, but only if he doesn't want records regardless of status. If every query involves a where on status, partitioning makes sense to me.

Mike Sherov 2010-03-05 13:28:17

Also, the penalty on insert, in my opinion, is outweighed by the boost on select. If status becomes a tinyint, and you have a properly tuned mysql server, the index file for 300M records easily fits in key buffer, after a bit of warm up time.

Mike Sherov 2010-03-05 13:31:33

I see, when it's simply where status = disabled, index would be worse. Thanks!

Mike Sherov 2010-03-05 13:58:30

+1 for partitioning. OP can also use a merge table for selects that don't involve "status".

stereofrog 2010-03-05 14:20:54

Thank you very much for your answer!

Jan 2010-03-07 21:12:21

Answer 4

+1 A:

The index that you describe is pretty much pointless. An index is best used when you need to select a small number of rows in comparison to the total rows.

The reason for this is related to how a database accesses a table. Tables can be assessed either by a full table scan, where each block is read and processed in turn. Or by a rowid or key lookup, where the database has a key/rowid and reads the exact row it requires.

In the case where you use a where clause based on the primary key or another unique index, eg. where id = 1, the database can use the index to get an exact reference to where the row's data is stored. This is clearly more efficient than doing a full table scan and processing every block.

Now back to your example, you have a where clause of where status = 'enabled', the index will return 150m rows and the database will have to read each row in turn using separate small reads. Whereas accessing the table with a full table scan allows the database to make use of more efficient larger reads.

There is a point at which it is better to just do a full table scan rather than use the index. With mysql you can use FORCE INDEX (idx_name) as part of your query to allow comparisons between each table access method.

Reference: http://dev.mysql.com/doc/refman/5.5/en/how-to-avoid-table-scan.html

ar 2010-03-05 13:19:32

I see what you're saying, but usually there will be some other limiting factor. For example, let's say he adds limit 10, then an index is better, no? I suppose as asked, you are right

Mike Sherov 2010-03-05 13:40:16

nice and clear, +1

stereofrog 2010-03-05 14:16:20

ansaurus

tags:

views:

answers:

MySQL: low cardinality/selectivity columns = how to index ?

related questions