views:

864

answers:

4

As a more general case of this question because I think it may be of interest to more people...What's the best way to perform a fulltext search on two tables? Assume there are three tables, one for programs (with submitter_id) and one each for tags and descriptions with object_id: foreign keys referring to records in programs. We want the submitter_id of programs with certain text in their tags OR descriptions. We have to use MATCH AGAINST for reasons that I won't go into here. Don't get hung up on that aspect.

programs
  id
  submitter_id
tags_programs
  object_id
  text
descriptions_programs
  object_id
  text

The following works and executes in a 20ms or so:

SELECT p.submitter_id
FROM programs p
WHERE p.id IN
    (SELECT t.object_id
    FROM titles_programs t
    WHERE MATCH (t.text) AGAINST ('china')
UNION ALL
    SELECT d.object_id
    FROM descriptions_programs d
    WHERE MATCH (d.text) AGAINST ('china'))

but I tried to rewrite this as a JOIN as follows and it runs for a very long time. I have to kill it after 60 seconds.

SELECT p.id 
FROM descriptions_programs d, tags_programs t, programs p
WHERE (d.object_id=p.id AND MATCH (d.text) AGAINST ('china'))
OR    (t.object_id=p.id AND MATCH (t.text) AGAINST ('china'))

Just out of curiosity I replaced the OR with AND. That also runs in s few milliseconds, but it's not what I need. What's wrong with the above second query? I can live with the UNION and subselects, but I'd like to understand.

+1  A: 

Join after the filters (e.g. join the results), don't try to join and then filter.

The reason is that you lose use of your fulltext index.

Clarification in response to the comment: I'm using the word join generically here, not as JOIN but as a synonym for merge or combine.

I'm essentially saying you should use the first (faster) query, or something like it. The reason it's faster is that each of the subqueries is sufficiently uncluttered that the db can use that table's full text index to do the select very quickly. Joining the two (presumably much smaller) result sets (with UNION) is also fast. This means the whole thing is fast.

The slow version winds up walking through lots of data testing it to see if it's what you want, rather than quickly winnowing the data down and only searching through rows you are likely to actually want.

MarkusQ
Is the syntax for that any different than the first example?
Doug Kaye
I don't follow, Markus. (a) How would you write 'join after the filters?' and (b) 'you lose use of your fulltext index.???
Doug Kaye
A: 

If you join both tables you end up having lots of records to inspect. Just as an example, if both tables have 100,000 records, fully joining them give you with 10,000,000,000 records (10 billion!).

If you change the OR by AND, then you allow the engine to filter out all records from table descriptions_programs which doesn't match 'china', and only then joining with titles_programs.

Anyway, that's not what you need, so I'd recommend sticking to the UNION way.

Seb
Is that math correct? If I have 100,000 programs and each one has a title, why wouldn't the join of programs and tags yield just 100,000 rows? And if you also join 100,000 descriptions, don't you still have only 100,000 rows?
Doug Kaye
If you want to match programs with titles, then match then in the join clause. If you just join them without any ON clause, then all rows are matched. Do something like FROM descriptions_programs d JOIN tags_programs t ON d.object_id = t.objecT_id JOIN programs p ON t.object_id = p.id
Seb
A: 

The union is the proper way to go. The join will pull in both full text indexes at once and can multiple the number of checks actually preformed.

sfossen
A: 

Just in case you don't know: MySQL has a built in statement called EXPLAIN that can be used to see what's going on under the surface. There's a lot of articles about this, so I won't be going into any detail, but for each table it provides an estimate for the number of rows it will need to process. If you look at the "rows" column in the EXPLAIN result for the second query you'll probably see that the number of rows is quite large, and certainly a lot larger than from the first one.

The net is full of warnings about using subqueries in MySQL, but it turns out that many times the developer is smarter than the MySQL optimizer. Filtering results in some manner before joining can cause major performance boosts in many cases.

Emil H