views:

206

answers:

4

Let's say I have a large database that consists of products in groups. Let's say that there are 5 groups, each of them has 100,000 products. the product ids are random integers (so are the group ids)

I need to find a product in a specific group. My question is which primary key is more efficient:

  1. (sid, pid)
  2. (pid, sid)

sid, pid is intuitive, but when searching in this order, MySQL will have to isolate 100,000 out of the 500,000 rows and then find a single number in 100,000. On the other hand, (pid, sid) sounds more optimal to me since it will force mysql not to create the large 100,000 group in the first stage, but to go directly to the right item (or up to 5 items if there are similar pids in different cids).

Is #2 indeed faster?

UPDATE: OK. I copied a real table to two copies. table0 has primary key sid,pid. table1 has pid,sid.

result of query:

explain select * from items0 where sid = 22746 and pid = 2109418034 1, 'SIMPLE', 'items0', 'ref', 'PRIMARY', 'PRIMARY', '8', 'const,const', 14, ''

explain select * from items1 where sid = 22746 and pid = 2109418034

1, 'SIMPLE', 'items1', 'ref', 'PRIMARY', 'PRIMARY', '8', 'const,const', 11, ''

Yet another update: I also added the two keys to the same table and run explain. got this: (Primary starts with sid_pid1, Index2 starts with pid1,sid)

1, 'SIMPLE', 'items', 'ref', 'PRIMARY,index_2', 'index_2', '8', 'const,const', 13, ''

I'm not sure, what conclusions can I draw from this test?

+8  A: 

Don't guess, test.

Copy the database, try both keys, and find out for yourself. Then post your results, with a disclaimer that the user should try it for themselves with their dataset, but here are your results.

Tom Ritter
+1 No optimization based on intuition
balpha
@balpha: While I upvoted your comment, intuition can often provide a good starting point, so to say it shouldn't be "based" on it is slightly alarmist, I think.
Adam Robinson
@Adam Robinson: Conceded. Should be "No optimization *solely* based on intution."
balpha
"If you are not measuring you are not engineering" - Rico Mariani - CLR Performance Architect (currently Visual Studio Team)
devdimi
I would try all four indexes of 1) sid, 2) pid, 3) sid, pid and 4) pid, sid and see which indexes get used.
Paul Morgan
Please see update above
Nir
+4  A: 

Add both keys as non-primary (or one as primary and one as non-primary), then run your desired query with "EXPLAIN" added to the front. This will make MySQL show you which key it chose.

DreadPirateShawn
More details here: http://dev.mysql.com/doc/refman/5.1/en/using-explain.html
Eli
Please see my updates above. Still not sure whats the answer though.
Nir
Re: update -- Looks like you want "pid,sid" for your key. Table0 (sid,pid) took 14 row lookups to find your result, while table1 (pid,sid) took only 11 row lookups. And when you added both to the same table, MySQL still chose to use index2 (pid,sid). Note: Sometimes the answer changes once the table contains more data, if the distribution changes significantly. After using the table heavily, try running the same EXPLAIN tests to see if the same index still has the advantage.
DreadPirateShawn
+1  A: 

As Tom said, test it and find out, but it will probably depend on the types of queries you will be doing. I assume you'll be using this table to join products to groups?

If your queries are mostly of the "which groups are this product in" type, then (pid, sid) will probably be fastest.

If they are of the "give me all the products in this group" type, then (sid, pid) will probably be faster.

Eric Petroelje
Hi, The goal is to get full details of a known product(s) from a known group.
Nir
@Nir - then the (sid,pid) order would probably be best. But test it first to verify that.
Eric Petroelje
+4  A: 

The performance of a SQL DBMS query depends GREATLY on a large number of factors - how fragmented the table (or index) is, the freshness and amount of data/index statistics, the size of your data caches/how much CPU/memory, how many rows are in the table, the query construction, etc. etc. etc.

Although profiling queries is a necessary part of performance tuning it alone is not sufficient -- it must be part of a larger query optimization strategy. Saying "test it and see" is not very helpful (and in my opinion sometimes dangerous!) in the general case because of the non-deterministic nature of the query optimization process. One day running it can be just fine, the next slow (or vice versa).

Without an understanding of the fundamentals of MySQL index construction, what queries will be used, and how queries will use indexes any ad hoc tests are in the best case lucky guesses and in the worst case ticking time bombs.

In this case there IS a rule of thumb due to the nature of how MySQL B-Trees are constructed. From the MySQL internals page: http://forge.mysql.com/wiki/MySQL_Internals_MyISAM#The_.MYI_file you can see that in the case of a non-unique BTREE index on two columns MySQL will store the concatenated values in the order that you specify. In that specific example they stored ASCII (or UNICODE) but in the case of integer values it will do something similar (open a hex editor and decode the actual values if you are intrepid enough!) ( also ref'd here http://dev.mysql.com/doc/refman/5.0/en/multiple-column-indexes.html ).

So, the rule of thumb is to put the most selective ( ref http://www.akadia.com/services/ora_index_selectivity.html ) value first because that gives the query processor the most information to narrow down the # of rows to be processed. Placing a less selective key FIRST will force the optimizer to consider more rows and, unless that is what you EXACTLY want, will be suboptimal by design.

Also to piggy back on what Eric said: MySQL (or other DBMS') can use any/all keys in increasing fashion to help narrow down the search -- e.g. if you place an index on( A, B, C ) then queries that have WHERE A = .. B = can use it (depending), queries that use WHERE A = can use it, but queries that ask for WHERE C = cannot (usually).

So, it also depends on the nature of your queries -- if you always ask for WHERE pid = AND sid = then the most selective one should go first (product ID) but if you often ask for WHERE sid = XXXX by itself, then the sid should go first (OR just create another index for that situation if there's varying amounts). The trade-off here is for time/space -- having an additional index will satisfy a different class of queries at the expense of additional disk space and increased write I/O.

Finally, if you are using INNODB you can specify a "clustered" index that actually sorts rows on disk (MyISAM tables are basically heaps). If you cluster the rows on disk by sid, pid then it will actually group them together so you can fetch entire BLOCKS (or pages) of products at a time which will use vastly less I/O than BTREEs alone (ref http://www.xaprb.com/blog/2006/07/04/how-to-exploit-mysql-index-optimizations/ )

So, you can see why "test it and see" is useful but without an understanding of MySQL index fundamentals you miss out on a whole class of optimizations.

Matt Rogish