views:

146

answers:

4

Two tables.

emails id (int10) | ownership (int10)

messages emailid (int10) indexed | message (mediumtext)

Subquery (which is terrible in mysql).

SELECT COUNT(*) FROM messages WHERE message LIKE '%word%' AND emailid IN (SELECT id FROM emails WHERE ownership = 32)


The usage here is that I run a search on emails (which is obviously simplified in the sample above), that generates a list of say 3,000 email id's. I then want to do a search against messages because i need to do a text match - from only those 3000 emails against the message.

The query against messages is expensive (message is not indexed) but this is fine because it would only ever be checking against a few rows.

Ideas:

i) A join. My attempts at this so far have not worked and have resulted in full table scans of the message table (i.e. the emailid index not used) ii) temporary table. This could work I think. iii) cache ids in client and run 2 queries. This does work. Not elegant. iv) subquery. mySQL subqueries run the 2nd query each time so this does not work. maybe fixed in mysql 6.

Ok, here is what I have so far. These are the actual field names (I had simplified a bit in question).

The query:

SELECT COUNT(*) FROM ticket LEFT JOIN ticket_subject 
ON (ticket_subject.ticketid = ticket.id) 
WHERE category IN (1) 
AND ticket_subject.subject LIKE "%about%"

The results:

1   SIMPLE  ticket  ref     PRIMARY,category    category    4   const   28874    
1   SIMPLE  ticket_subject  eq_ref  PRIMARY     PRIMARY     4   deskpro.ticket.id   1   Using where

It takes 0.41 seconds and returns a count(*) of 113.

Running:

SELECT COUNT (*) FROM ticket WHERE category IN (1)

Takes 0.01 seconds and finds 33,000 results.

Running

SELECT COUNT (*) FROM ticket_subject WHERE subject LIKE "%about%"

Takes 0.14 seconds and finds 1,300 results.

Both the ticket table and ticket_subject table have 300,000 rows.

There is an index on ticket_subject.ticketid and ticket.category.

I realise now that using the LIKE syntax was a mistake - as it has been a bit of a red herring about FULLTEXT. THis is not the issue. The issue is:

1) Table A - very fast query, run on index. 0.001 seconds 2) Table B - moderate to slow query, no index - does full table scan. 0.1 seconds.

Both of these results are fine. The problem is I have to JOIN them and the search takes 0.3 seconds; which to me makes no sense because the slow aspects of the combined query on Table B should be quicker because we are now only searching over a fraction of that table - ie it should not be doing a full table scan because the field that is being JOINED on is indexed.

+3  A: 
SELECT COUNT(*) 
FROM messages 
join emails ON emails.id = messages.emailid
WHERE message LIKE '%word%' 
AND ownership = 32

The problem though is with the '%word%' This will always require a scan of message. You might want to look into full text search if you are using MyISAM.

Martin Smith
I was really trying to illustrate a situation of joining the results from a quick search with the results from a slow search.But in this case, the search on %word% should be very quick because it is only searching against a few hundred or maybe a few thousand rows which have been selected by the index.
Chris Padfield
@Chris - Could you update your question with the best of the ones that you have tried so far and it's EXPLAIN PLAN?
Martin Smith
Just done that.
Chris Padfield
+2  A: 

I think this is what you are looking for:

select count(*)
from messages m
  inner join emails e
    on e.id = m.emailid
where m.message like '%word%'
  and e.ownership = 32

Hard to tell for sure how it will perform. If the FTS is because of the starting wildcard on WORD, then doing it this way won't solve the problem. But the good news is that perhaps the join will limit the records in the messages table you have to look at.

MJB
Thanks, this performed at the same speed at the answer by Martin. It is 3 times slower than running the slow query (%word% against messages).
Chris Padfield
@Chris - I think that one problem is that you are doing a join on a non-indexed column -- emails.id -- so there is not much you can do to speed it up unless you index that column. You are forcing a full table scan (FTS) on that table as well.
MJB
This column is indexed. I have put an EXPLAIN above.
Chris Padfield
A: 

Is it possible for you to turn the join the other way around? It seems that the second query is a less expensive one and since the whole thing is a simple join then you want to perform the less expensive query to narrow the data-set as much and then do a join to your more expensive query.

Am
Well the subquery would be fast I believe because it would use the index to get the list of messages it needs to check, and then would only process those. The problem is I can't create a join that seems to work with that logic; all the joins i have are 3 times slower than running the expensive query against the whole table.
Chris Padfield
+8  A: 

Remember to take advantage of Boolean short-circuit evaluation:

SELECT COUNT(*) 
FROM messages 
join emails ON emails.id = messages.emailid
WHERE ownership = 32 AND message LIKE '%word%'

This filters by ownership before it evaluates the LIKE predicate. Always put your cheaper expressions on the left.

Also, I agree with @Martin Smith and @MJB that you should consider using MySQL's FULLTEXT indexing to make this faster.


Re your comment and additional information, here's some analysis:

explain SELECT COUNT(*) FROM ticket WHERE category IN (1)\G

           id: 1
  select_type: SIMPLE
        table: ticket
         type: ref
possible_keys: category
          key: category
      key_len: 4
          ref: const
         rows: 1
        Extra: Using index

The note "Using index" is a good thing to see because it means it can satisfy the query just by reading the index data structure, not even touching the data of the table. This is certain to run very fast.

explain SELECT COUNT(*) FROM ticket_subject WHERE subject LIKE '%about%'\G

           id: 1
  select_type: SIMPLE
        table: ticket_subject
         type: ALL
possible_keys: NULL        <---- no possible keys
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1
        Extra: Using where

This shows that there are no possible keys that can benefit the wildcard LIKE predicate. It uses the condition in the WHERE clause, but it has to evaluate it by running a table-scan.

explain SELECT COUNT(*) FROM ticket LEFT JOIN ticket_subject 
ON (ticket_subject.ticketid = ticket.id) 
WHERE category IN (1) 
AND ticket_subject.subject LIKE '%about%'\G

           id: 1
  select_type: SIMPLE
        table: ticket
         type: ref
possible_keys: PRIMARY,category
          key: category
      key_len: 4
          ref: const
         rows: 1
        Extra: Using index

           id: 1
  select_type: SIMPLE
        table: ticket_subject
         type: ref
possible_keys: ticketid
          key: ticketid
      key_len: 4
          ref: test.ticket.id
         rows: 1
        Extra: Using where

Likewise, accessing the ticket table is quick, but that's spoiled by the table-scan incurred by the LIKE condition.

ALTER TABLE ticket_subject ENGINE=MyISAM;

CREATE FULLTEXT INDEX ticket_subject_fulltext ON ticket_subject(subject);

explain SELECT COUNT(*) FROM ticket JOIN ticket_subject  
ON (ticket_subject.ticketid = ticket.id)  
WHERE category IN (1)  AND MATCH(ticket_subject.subject) AGAINST('about')

           id: 1
  select_type: SIMPLE
        table: ticket
         type: ref
possible_keys: PRIMARY,category
          key: category
      key_len: 4
          ref: const
         rows: 1
        Extra: Using index

           id: 1
  select_type: SIMPLE
        table: ticket_subject
         type: fulltext
possible_keys: ticketid,ticket_subject_fulltext
          key: ticket_subject_fulltext          <---- now it uses an index
      key_len: 0
          ref: 
         rows: 1
        Extra: Using where

You're never going to make LIKE perform well. See my presentation Practical Full-Text Search in MySQL.


Re your comment: Okay, I've done some experiments on a dataset of similar size (the Users and Badges tables in the Stack Overflow data dump :-). Here's what I found:

select count(*) from users
where reputation > 50000

+----------+
| count(*) |
+----------+
|       37 |
+----------+
1 row in set (0.00 sec)

That's really fast, because I have an index on the reputation column.

           id: 1
  select_type: SIMPLE
        table: users
         type: range
possible_keys: users_reputation_userid_displayname
          key: users_reputation_userid_displayname
      key_len: 4
          ref: NULL
         rows: 37
        Extra: Using where; Using index

select count(*) from badges
where badges.creationdate like '%06-24%'

+----------+
| count(*) |
+----------+
|     1319 |
+----------+
1 row in set, 1 warning (0.63 sec)

That's as expected, since the table has 700k rows, and it has to do a table-scan. Now let's do the join:

select count(*) from users join badges using (userid)
where users.reputation > 50000 and badges.creationdate like '%06-24%'

+----------+
| count(*) |
+----------+
|       19 |
+----------+
1 row in set, 1 warning (0.03 sec)

That doesn't seem so bad. Here's the explain report:

           id: 1
  select_type: SIMPLE
        table: users
         type: range
possible_keys: PRIMARY,users_reputation_userid_displayname
          key: users_reputation_userid_displayname
      key_len: 4
          ref: NULL
         rows: 37
        Extra: Using where; Using index

           id: 1
  select_type: SIMPLE
        table: badges
         type: ref
possible_keys: badges_userid
          key: badges_userid
      key_len: 8
          ref: testpattern.users.UserId
         rows: 1
        Extra: Using where

This does seem like it's using indexes intelligently for the join, and it helps that I have a compound index including userid and reputation. Remember that MySQL can use only one index per table, so it's important to get define the right compound indexes for the query you need to do.


Re your comment: OK, I've tried this where reputation > 5000, and where reputation > 500, and where reputation > 50. These should match a much larger set of users.

select count(*) from users join badges using (userid)
where users.reputation > 5000 and badges.creationdate like '%06-24%'

+----------+
| count(*) |
+----------+
|      194 |
+----------+
1 row in set, 1 warning (0.27 sec)

select count(*) from users join badges using (userid)
where users.reputation > 500 and badges.creationdate like '%06-24%'

+----------+
| count(*) |
+----------+
|      624 |
+----------+
1 row in set, 1 warning (0.93 sec)

select count(*) from users join badges using (userid)
where users.reputation > 50 and badges.creationdate like '%06-24%'
--------------

+----------+
| count(*) |
+----------+
|     1067 |
+----------+
1 row in set, 1 warning (1.72 sec)

The explain report is the same in all cases, but if the query finds more matching rows in the Users table, then it naturally has to evaluate the LIKE predicate against a lot more matching rows in the Badges table.

It's true that there is some cost to doing a join. It's a little surprising that it's so dramatically expensive. But this can be mitigated if you use indexes.

I know you said you have a query that can't use an index, but perhaps it's time to consider creating a redundant column with some transformed version of the data of your original column, so you can index it. In the example above, I might create a column creationdate_day and populate it from DAYOFYEAR(creationdate).


Here's what I mean:

ALTER TABLE Badges ADD COLUMN creationdate_day SMALLINT;
UPDATE Badges SET creationdate_day = DAYOFYEAR(creationdate);
CREATE INDEX badge_creationdate_day ON Badges(creationdate_day);

select count(*) from users join badges using (userid)
where users.reputation > 50 and badges.creationdate_day = dayofyear('2010-06-24')

+----------+
| count(*) |
+----------+
|     1067 |
+----------+
1 row in set, 1 warning (0.01 sec)  <---- not too shabby!

Here's the explain report:

          id: 1
  select_type: SIMPLE
        table: badges
         type: ref
possible_keys: badges_userid,badge_creationdate_day
          key: badge_creationdate_day    <---- here is our new index
      key_len: 3
          ref: const
         rows: 1318
        Extra: Using where

           id: 1
  select_type: SIMPLE
        table: users
         type: eq_ref
possible_keys: PRIMARY,users_reputation_userid_displayname
          key: PRIMARY
      key_len: 8
          ref: testpattern.badges.UserId
         rows: 1
        Extra: Using where
Bill Karwin
+1 never thought about that
DrColossos
I'm not familiar with MySQL, but are you sure it doesn't already reorder the where predicates when creating the execution plan?
Mike
@Mike: Yes, I am sure. No programming language that supports short-circuit evaluation should reorder Boolean expressions!
Bill Karwin
I just ran some test on this - the order has no effect in my trials. What is strange is that if you run the SQL queries seperately - the word one takes 0.12 seconds and the ownership one takes 0.01 seconds. The JOIN however takes 0.37 seconds.
Chris Padfield
What I am saying above, is that it takes 3 times longer to run the JOIN than it takes to run the slow query as a full table scan.
Chris Padfield
@Chris Padfield: Then you need to analyze with EXPLAIN and probably create an index.
Bill Karwin
I have updated with an actual query and EXPLAIN of that query above. Indexes already exist.
Chris Padfield
Thanks; for the update above but unfortunately the issue here is not about the LIKE query (which is just an example of a slow query). The point is I *have* to search on something that is not indexed. The issue however is that this is not a problem when doing the search on it's own (in this case SELECT COUNT(*) FROM ticket_subject WHERE subject LIKE '%word%' is FAST(ish). This is NOT the problem.The problem is only when the JOIN is introduced. It would seem to be this should be faster because we are only now searching some of this table's rows - but it is 3x slower.
Chris Padfield
Thanks for the update; this is useful. Any chance you can run the same query but on one in which the users column results in a lot more matches (say 10,000). I think this may be the source of the problem.
Chris Padfield
Hi Bill,Really useful info which is what I expected. Why, however, would an index on creationdate have any effect. It's the JOIN (which is already indexed) that is causing the slow down - it took 0.63 seconds when you just did the LIKE query; but 1.72 seconds with the JOIN. I don't think there is a solution to this?
Chris Padfield
Please see the example I've added to show that an index reduces the query time to 0.01 sec.
Bill Karwin
Now THAT is an answer, Bill. Big upvotes.
Andy Lester