views:

682

answers:

2

This MySQL query works just fine

SELECT o.id 
FROM descriptions_programs d, titles_programs t, programs o
WHERE (d.object_id=o.id 
     AND MATCH (d.text) AGAINST ('+china' IN BOOLEAN MODE) 
     AND d.current=1)
AND   (t.object_id=o.id 
     AND MATCH (t.text) AGAINST ('+china' IN BOOLEAN MODE) 
     AND t.current=1)

But if I replace one AND with an OR, the query runs a very long time. (I have to kill it.):

SELECT o.id 
FROM descriptions_programs d, titles_programs t, programs o
WHERE (d.object_id=o.id 
     AND MATCH (d.text) AGAINST ('+china' IN BOOLEAN MODE) 
     AND d.current=1)
OR    (t.object_id=o.id 
     AND MATCH (t.text) AGAINST ('+china' IN BOOLEAN MODE) 
     AND t.current=1)

Why is this? Don't get hung up on the simplicity of +china. I've just simplified this for the sake of debugging. Also, if I run with just one of the MATCH AGAINST tests, it works fine, so both are okay by themselves. I get the sense that I'm inadvertently causing a huge join by USING OR, but I just don't get it. I was previously using a n IN test on a UNION of two subselects an that worked, but this should work, too. Right?

Update: per bobince's request. It's not super slow, but at ~500ms, it's not nearly as fast as using a UNION as discussed here.

mysql> explain SELECT o.id
    -> FROM programs o
    -> JOIN titles_programs t ON t.object_id=o.id
    -> JOIN descriptions_programs d ON d.object_id=o.id
    -> WHERE MATCH (d.text) AGAINST ('+china' IN BOOLEAN MODE) AND d.current=1
    -> OR MATCH (t.text) AGAINST ('+china' IN BOOLEAN MODE) AND t.current=1
    -> ;
+----+-------------+-------+-------+

----------------+----------------+---------+----------------------+--------+-------------+
| id | select_type | table | type  | possible_keys  | key            | key_len | ref                  | rows   | Extra       |
+----+-------------+-------+-------+----------------+----------------+---------+----------------------+--------+-------------+
|  1 | SIMPLE      | o     | index | PRIMARY        | PRIMARY        | 4       | NULL                 | 148666 | Using index | 
|  1 | SIMPLE      | d     | ref   | object_current | object_current | 4       | haystack.o.id        |      1 |             | 
|  1 | SIMPLE      | t     | ref   | object_current | object_current | 4       | haystack.d.object_id |      1 | Using where | 
+----+-------------+-------+-------+----------------+----------------+---------+----------------------+--------+-------------+
+1  A: 

Your problem is that the joins between o and d and t need to happen in ALL cases. That is, you need:

SELECT o.id 
FROM descriptions_programs d, titles_programs t, programs o
WHERE d.object_id=o.id AND t.object_id=o.id AND
(
        MATCH (d.text) AGAINST ('+china' IN BOOLEAN MODE) 
        AND d.current=1
) OR (  MATCH (t.text) AGAINST ('+china' IN BOOLEAN MODE) 
        AND t.current=1
)

Why? Because in your first query you can ignore those parenthesis -- everything is ANDed together and the tables join up fine. In your second query, that's not true.

Consider what the database is really doing: It's taking "all the rows in t" and crossing that with "all the rows in d", therefore t*d rows. Normally you use joins (as I did) to limit that to a linear list of valid rows.

But in your OR query you allow either row to match o instead of both matching o, so for each row in one table that matches you're also selecting all the rows in the other table.

Jason Cohen
Still no difference. Runs for >1 minute before I have to kill it. Change the OR to an AND, and it executes in milliseconds. Same if you remove either of the two MATCH AGAINST phrases.
Doug Kaye
Try running against just the 't' table. Eliminate 'd' from BOTH the "FROM" and "WHERE" clauses. Then what is the running time? If the answer is "it takes forever," then it's not a matter of "OR," it's just that table scan is really slow.
Jason Cohen
You should be careful about which "OR" you change to an "AND": or you get t*d behavior. Also, you should put parentheses in cases of ambiguity: I don't know what the MySQL convention is, but if you say "foo and bar or car" then it could mean "(foo and bar) or car" which is bad!
David Grayson
Also try moving the "d.current=1" and "t.current=1" outside the "OR" as well -- just have the MATCH inside. Same reason.
Jason Cohen
+1  A: 

Jason's answer is spot on. Additionally I'd try to use the more modern ANSI join syntax to take the load off the WHERE clause so relieve the confusion there:

SELECT o.id
FROM programs o
JOIN titles_programs t ON t.object_id=o.id
JOIN descriptions_programs d ON d.object_id=o.id
WHERE MATCH (d.text) AGAINST ('+china' IN BOOLEAN MODE) AND d.current=1
OR MATCH (t.text) AGAINST ('+china' IN BOOLEAN MODE) AND t.current=1

This will stop the inadvertant cross-join causing a combinatorial explosion; I'd expect it to operate in reasonable time unless the database was really huge.

If not, can you post the results of an EXPLAIN SELECT of the above? Presumably one or both of the fulltext indexes is not being used. I could certainly imagine the query optimiser failing to use the second fulltext index, by doing something like trying to ‘fill in’ the rows that didn't match the first fulltext query instead of going straight to the index, or something.

Normally when you want to fulltext index over two columns in combination, you create one index over both columns. This would in any case be much faster. However it would mean you have to put titles and descriptions in the same table. This may not be such a hardship: since fulltext only works on MyISAM tables (and you don't typically want your canonical data in MyISAM tables) you can keep the definitive copy of your data in properly normalised InnoDB tables, with an additional MyISAM table containing only stripped and stemmed search-bait.

If none of that is any good... well, I guess I'd go back to the UNIONing you mentioned, coupled with an application-level filter to remove duplicate IDs.

bobince