views:

125

answers:

5

I have a MySQL 5.1 InnoDB table (customers) with the following structure:

int         record_id (PRIMARY KEY)
int         user_id (ALLOW NULL)
varchar[11] postcode (ALLOW NULL)
varchar[30] region (ALLOW NULL)
..
..
..

There are roughly 7 million rows in the table. Currently, the table is being queried like this:

SELECT * FROM customers WHERE user_id IN (32343, 45676, 12345, 98765, 66010, ...

in the actual query, currently over 560 user_ids are in the IN clause. With several million records in the table, this query is slow!

There are secondary indexes on table, the first of which being on user_id itself, which I thought would help.

I know that SELECT(*) is A Bad Thing and this will be expanded to the full list of fields required. However, the fields not listed above are more ints and doubles. There are another 50 of those being returned, but they are needed for the report.

I imagine there's a much better way to access the data for the user_ids, but I can't think how to do it. My initial reaction is to remove the ALLOW NULL on the user_id field, as I understand NULL handling slows down queries?

I'd be very grateful if you could point me in a more efficient direction than using the IN ( ) method.

EDIT Ran EXPLAIN, which said:

select_type = SIMPLE 
table = customers 
type = range 
possible_keys = userid_idx 
key = userid_idx 
key_len = 5 
ref = (NULL) 
rows = 637640 
Extra = Using where

does that help?

+3  A: 

First, check if there is an index on USER_ID and make sure it's used.

You can do it with running EXPLAIN.

Second, create a temporary table and use it in a JOIN:

CREATE TABLE temptable (user_id INT NOT NULL)

SELECT  *
FROM    temptable t
JOIN    customers c
ON      c.user_id = t.user_id

Third, how may rows does your query return?

If it returns almost all rows, then it just will be slow, since it will have to pump all these millions over the connection channel, to begin with.

NULL will not slow your query down, since the IN condition only satisfies non-NULL values which are indexed.

Update:

The index is used, the plan is fine except that it returns more than half a million rows.

Do you really need to put all these 638,000 rows into the report?

Hope its not printed: bad for rainforests, global warming and stuff.

Speaking seriously, you seem to need either aggregation or pagination on your query.

Quassnoi
Thanks for replying. I'll run an EXPLAIN and post back here. The query returns ~638,000 rows at the moment. I'll try putting the user_ids in a temporary table if you think that'll be faster.
Jaymie
EXPLAIN says:select_type = SIMPLEtable = customerstype = rangepossible_keys = userid_idxkey = userid_idxkey_len = 5ref = (NULL)rows = 637640Extra = Using wheredoes that help?
Jaymie
EXPLAIN is OK, the index is being used. There are just lots or rows you don't seem to need. Aggregate or paginate them: no human being is able to browse over 638,000 rows.
Quassnoi
This is true, but Crystal Reports can. Well, saying that.... ;o)
Jaymie
Maybe it's better to use aggregation on database side and feed CR with aggregated values then? If you have 50 columns, then each row will take several kilobytes, and your report file will be several GB long. It will take minutes just to SAVE this file onto the HDD, to say nothing of processing.
Quassnoi
A: 

You can try to insert the ids you need to query on in a temp table and inner join both tables. I don't know if that would help.

Eric Hogue
+1  A: 

Are they the same ~560 id's every time? Or is it a different ~500 ids on different runs of the queries?

You could just insert your 560 UserIDs into a separate table (or even a temp table), stick an index on the that table and inner join it to you original table.

Eoin Campbell
Thanks for replying. They're going to change each time. I really like the temp table idea.
Jaymie
+1  A: 

Is this your most important query? Is this a transactional table?

If so, try creating a clustered index on user_id. Your query might be slow because it still must make random disk reads to retrieve the columns (key lookups), even after finding the records that match (index seek on the user_Id index).

If you cannot change the clustered index, then you might want to consider an ETL process (simplest is a trigger that inserts into another table with the best indexing). This should yield faster results.

Also note that such large queries may take some time to parse, so help it out by putting the queried ids into a temp table if possibl

Jeff Meatball Yang
+2  A: 

"Select *" is not as bad as some people think; row-based databases will fetch the entire row if they fetch any of it, so in situations where you're not using a covering index, "SELECT *" is essentially no slower than "SELECT a,b,c" (NB: There is sometimes an exception when you have large BLOBs, but that is an edge-case).

First things first - does your database fit in RAM? If not, get more RAM. No, seriously. Now, suppose your database is too huge to reasonably fit into ram (Say, > 32Gb) , you should try to reduce the number of random I/Os as they are probably what's holding things up.

I'll assuming from here on that you're running proper server grade hardware with a RAID controller in RAID1 (or RAID10 etc) and at least two spindles. If you're not, go away and get that.

You could definitely consider using a clustered index. In MySQL InnoDB you can only cluster the primary key, which means that if something else is currently the primary key, you'll have to change it. Composite primary keys are ok, and if you're doing a lot of queries on one criterion (say user_id) it is a definite benefit to make it the first part of the primary key (you'll need to add something else to make it unique).

Alternatively, you might be able to make your query use a covering index, in which case you don't need user_id to be the primary key (in fact, it must not be). This will only happen if all of the columns you need are in an index which begins with user_id.

As far as query efficiency is concerned, WHERE user_id IN (big list of IDs) is almost certainly the most efficient way of doing it from SQL.

BUT my biggest tips are:

  • Have a goal in mind, work out what it is, and when you reach it, stop.
  • Don't take anybody's word for it - try it and see
  • Ensure that your performance test system is the same hardware spec as production
  • Ensure that your performance test system has the same data size and kind as production (same schema is not good enough!).
  • Use synthetic data if it is not possible to use production data (Copying production data may be logistically difficult (Remember your database is >32Gb) ; it may also violate security policies).
  • If your query is optimal (as it probably already is), try tuning the schema, then the database itself.
MarkR