views:

96

answers:

4

I am trying to use the following query on a table with ~200k records in it. There are all sorts of other fields that can be filtered by, but this is a base example.

SELECT b.isbn FROM books b 
WHERE 
b.price IS NOT NULL AND 
b.deleted = '' AND 
b.publication_date <= '2009-12-04' AND 
(
  b.subject1_id IN ('CAT1','CAT2','CAT3','CAT4','CAT5') OR
  b.subject2_id IN ('CAT1','CAT2','CAT3','CAT4','CAT5') OR
  b.subject3_id IN ('CAT1','CAT2','CAT3','CAT4','CAT5')
)

Currently, I have a separate index on all of these fields and this query takes ~4.5 seconds, which is way too long. EXPLAIN lists NULL under key.

I also tried creating one large index that included all of the fields in the above query, but EXPLAIN shows that this multi-field index is not used.

How can I index these fields to speed up my queries?

EDIT: Here are my current indices (none of which seem to be used by the query):

  • index(price)
  • index(deleted)
  • index(publication_date)
  • index(subject1_id)
  • index(subject2_id)
  • index(subject3_id)
  • index(price, deleted, publication_date, subject1_id, subject2_id, subject3_id)

EDIT2: Per ʞɔıu's answer - after normalizing the tables and using basically his query, it does speed it up some (time is ~3.5 seconds now), but not as much as I'm looking for. I indexed the new table as PRIMARY KEY(isbn, subject_id) and this index is being used for the join.

EDIT3: I added an additional index on the second table (subject_id, isbn), which helps. The addition of the other index that ʞɔıu mentions below helps a small bit, but only gets used when I use "FORCE INDEX" on the query. It's down to about 1.5 seconds now. Is there hope of getting it much lower?

+3  A: 

You need to normalize your schema before indexing will be of any help in this case.

You can create another table that contains (subject, isbn), add indexes on book and subject, then join to that table, like:

select b.isbn from books b
inner join book_subject bs on bs.isbn=b.isbn
where 
    b.price is not null and b.deleted != 'DELETED'
    AND b.publication_date <= '2009-12-04' 
    AND bs.subject in ('CAT1', 'CAT2'...)

Rule #1 (literally) in schema normalization is: "no repeating groups". Having the OR operation in the where clause across the 3 subject columns is going to prevent you from being able to take advantage of an index for that part of the query.

(updated to reflect that isbn is the primary key)

ʞɔıu
Well, the reason it isn't normalized is only because that is the format of the source data. Will having to join a table of 200k records with another table of up to ~600k records really help the query's efficiency?
Wickethewok
Very likely, because it will be able to use an index and the cost will be a function of the *log* of the number of rows examined.
ʞɔıu
Pretty good stuff - you may want to add your comments on my question to your answer, as that additional index on the subject table was pretty crucial.
Wickethewok
A: 

I'd think more about what your query actually means and that may help lead the way to your answer.

Let's desugar it to show you the issue.

SELECT b.isbn FROM books b 
WHERE 
b.price IS NOT NULL AND 
b.deleted != 'DELETED' AND 
b.publication_date <= '2009-12-04' AND 
(
  b.subject1_id = 'CAT1' OR 
  b.subject1_id = 'CAT2' OR 
  b.subject1_id = 'CAT3' OR 
  b.subject1_id = 'CAT4' OR 
  b.subject1_id = 'CAT5' OR
  b.subject2_id = 'CAT1' OR 
  b.subject2_id = 'CAT2' OR 
  b.subject2_id = 'CAT3' OR 
  b.subject2_id = 'CAT4' OR 
  b.subject2_id = 'CAT5' OR
  b.subject3_id = 'CAT1' OR 
  b.subject3_id = 'CAT2' OR 
  b.subject3_id = 'CAT3' OR 
  b.subject3_id = 'CAT4' OR 
  b.subject3_id = 'CAT5'

)

There obviously isn't any index it will use beyond ones for (price,deleted,publication_date,subject1) and so on for the other subjects.

What fields are in the index?

EntropyFails
See my above edit as to the indexed fields.
Wickethewok
A: 

In relation to nick's post :

create another table that contains (subject, book_id), add indexes on book and subject:

wouldn't it be cleaner to have

select b.isbn from books b

where 
#various table b where restrictions

AND b.isbn IN (
    Select isbn
    from book_subject bs 
    where bs.subject IN ('CAT1', 'CAT2' ...)
)
Zak
I'm a little confused as to your answer. "isbn" is the primary key if that helps.
Wickethewok
This query would require a temporary table to store the intermediate results of the subquery; the inner join method would not. mysql's subquery operations are not very well optimized compared to other RDBMS's.
ʞɔıu
updated to use isbn as PK
Zak
A: 

First of all: MySQL can use only one Index per table during a select. It tries to choose the best possible index, but sometimes the server cannot decide for several reasons. Having multiple indexes on only one field each would only help, if you had a lot of statements running with only one where clause at a time.

To optimize here: You need to create an index which does include the fields

price
deleted
publication_date

DO NOT include the categories, since you are using the OR clause.

ALTER TABLE `test`.`books` ADD INDEX `idxPriceDeletedPublication`(`price`, `deleted`, `publication_date`);

This should then give you the following EXPLAIN output:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: b
         type: range
possible_keys: idxPriceDeletedPublication
          key: idxPriceDeletedPublication
      key_len: 5
          ref: NULL
         rows: 1
        Extra: Using where
1 row in set (0.00 sec)
Cassy
it wouldn't be able to an index for the deleted column; you can't index on not equals. A better strategy would be to query on deleted = 'NOTDELETED' or whatever the opposite of the 'DELETED' state is
ʞɔıu
I added this three component index. MySQL won't use it on my query by default and when I use "FORCE INDEX" it actually takes 13 seconds for some reason.The opposite of 'DELETED' is '' and ʞɔıu is right that I should use that.
Wickethewok