views:

112

answers:

2

[Warning: long post ahead!]

I've banging my head at this for quite some time now but can't get on a common denominator what is going on. I've found a solution to workaround, see at the end, but my inner Zen is not satisfied yet.

I've a main table with forum messages (it's from Phorum), simplified looks like this (ignore the anon_user_id for the moment, I will get later to it):

CREATE TABLE `test_msg` (
  `message_id` int(10) unsigned NOT NULL auto_increment,
  `status` tinyint(4) NOT NULL default '2',
  `user_id` int(10) unsigned NOT NULL default '0',
  `datestamp` int(10) unsigned NOT NULL default '0',
  `anon_user_id` int(10) unsigned NOT NULL default '0',
  PRIMARY KEY  (`message_id`)
);

Messages can be anonymized by the software, in which case the user_id is set to 0. The software also allows posting complete anonymous messages which we endorse. In our case we need to still know which user posted a message, so through the hook system provided by Phorum we have a second table we update accordingly:

CREATE TABLE `test_anon` (
  `message_id` bigint(20) unsigned NOT NULL,
  `user_id` bigint(20) unsigned NOT NULL,
  KEY `fk_user_id` (`user_id`),
  KEY `fk_message_id` (`message_id`)
);

For the view in the profile, I need to get the a list of messages from a user, no matter whether they've have been anonmized by him or not.

A user itself has always the right to see the message he wrote anonymously or later anonymized.

Because user_id gets set to 0 if anonymized, we can't simply use WHERE for it; we need to join with our own second table. Formulate the above into SQL looks like this (the status = 2 is required, other states would mean the post is hidden, pending approval, etc.):

SELECT * FROM  test_msg AS m
LEFT JOIN test_anon ON test_anon.message_id = m.message_id
WHERE (test_anon.user_id = 20 OR m.user_id = 20)
AND m.status = 2
ORDER BY m.datestamp DESC
LIMIT 0,10

This query by itself, whenever the query cache is empty, takes a few second, something along 4 seconds currently. Things get worse when multiple users issue the query and the query cache is empty (which just happens; people post messages and the cached queries are invalid); we faced in our internal testing phase and reports were that the system sometimes slows down. We've seen queries taking 30 to 60 seconds because of the concurrent-ness. I don't want to start imagine what happens when we expand our user base ...

Now it's not like I didn't to any analysis about the bottleneck.

I tried rewriting the WHERE clause, adding indice and deleting them like hell.

This is when I found out that when I do not use any index, the query performs lighting fast under certain conditions. Using no index, the query looks like:

SELECT * FROM  test_msg AS m USE INDEX()
LEFT JOIN test_anon ON test_anon.message_id = m.message_id
WHERE (test_anon.user_id = 20 OR m.user_id = 20)
AND m.status = 2
ORDER BY m.datestamp DESC
LIMIT 0,10

Now here comes the certain condition: the LIMIT limits the result to 10 rows. Assume my complete result n = 26. Using a LIMIT 0,10 to LIMIT 16,0 takes zero seconds (something along < 0.01s): these are the cases were the result is always 10 rows.

Starting with LIMIT 17,10 , the result will be only 9 rows. Starting at this point, the query starts taking around four seconds again. The is applicable for all results where the result set is smaller then the number of maximum rows limited through LIMIT. Irritating!

Going back to the first CREATE TABLE statement, I also conducted tests without the LEFT JOIN; we just assume user_id=0 and anon_user_id=<the previous user_id> for anonymized messages, in other words, completely bypassing the second table:

SELECT * FROM test_msg
WHERE status = 2 AND (user_id = 20 OR anon_user_id = 20)
ORDER BY m.datestamp DESC
LIMIT 20,10

Result: it did not matter. The performance is still within 4 or 5 seconds; forcing to not use an index with USE INDEX() does not speed up this query.

This is were I really got puzzled now. Index will always only be used for the status column, the OR prevents other indices from being used, this is also what the MySQL documentation told me in this regard.

An alternate solution I tried: do not use the test_anon table to only relate to anonymized messages, but simply to all messages. This allows me to write a query like this:

SELECT * FROM test_msg AS m, test_anon AS t
WHERE m.message_id = t.message_id
AND t.user_id = 20
AND m.status = 2
ORDER BY m.datestamp DESC
LIMIT 20,10

This query always gave me instant results (== < 0.01 seconds), no matter what LIMIT, etc.

Yes, I've found a solution. I've not yet rewritten the whole application to the model though.

But I'ld like to better understand what the rational is behind my observed behavior (especially forcing no index speeding up queries). On paper nothing looked wrong with the original approach.

Some numbers (they aren't that big anyway):

  • ~ one million messages
  • message table data size is around 600MB
  • message table index size is around 350MB
  • number of anonymized messages in test_anon < 3% of all messages
  • number of messages from registered users < 25% of all messages

All tables are MyISAM; I tried with InnnoDB but performance was much more worse.

+1  A: 

The problem is that you're doing a join for the entire table. You need to tell the optimizer that you only need to join for two user ID's: zero and your desired user ID. Like this:

SELECT * FROM test_msg AS m
LEFT JOIN test_anon ON test_anon.message_id = m.message_id
WHERE (m.user_id = 20 OR m.user_id = 0)
AND (test_anon.user_id = 20 OR test_anon.user_id IS NULL)
AND m.status = 2
ORDER BY m.datestamp DESC
LIMIT 0,10

Does this work better?

Justin Grant
I just later realized that it doesn't quite work: for my logic to properly work, the `AND (test_anon.user_id = 20 OR test_anon.user_id IS NULL)` must be a `AND (test_anon.user_id = 20 OR test_msg.user_id = 20)`, but then the performance issues with partial results (= less then limit) are back :(Trying to figure out if I can use your query as a basis.
mark
The key to good performance is restricting the query to return as few rows as possible from each table, before the join happens. Luckily, there are only two values from each table which are valid for this query: zero and 20 for the "main" table, and NULL and 20 for the "anon" table. The result is four theoretically possible cases (2x2). Your WHERE clause should restrict values to those 4 choices only. And you don't even need to screen out the two illegal choices (e.g. test_anon.user_id = 20 and m.user_id = 20) since your app won't allow those anyways.
Justin Grant
+1  A: 

You in fact have two different queries here which are better processed as separate queries.

To improve the LIMIT, you need to use LIMIT on LIMIT technique:

SELECT  *
FROM    (
        SELECT  *
        FROM    test_msg AS m
        WHERE   m.user_id = 20
                AND m.status = 2
        ORDER BY
                m.datestamp DESC
        LIMIT 20
        ) q1
UNION ALL
SELECT  *
        (
        SELECT  m.*
        FROM    test_msg m
        JOIN    test_anon a
        ON      a.message_id = m.message_id
        WHERE   a.user_id = 20
                AND m.user_id = 0
                AND m.status = 2
        ORDER BY
                m.datestamp DESC
        LIMIT 20
        ) q2
ORDER BY
        datestamp DESC
LIMIT 20

See this entry in my blog for more detail on this solution:

You need to create two composite indexes for this to work fast:

test_msg (status, user_id, datestamp)
test_msg (status, user_id, message_id, datestamp)

Then you need to choose what the index will be used for in the second query: ordering or filtering.

In your query, the index cannot be used for both, since you're filtering on a range on message_id.

See this article for more explainations:

In a couple of words:

  • If there are lots of anonymous messages from this user, i. e. there is high probability that the message will be found somewhere in the beginning of the index, then the index should be used for sorting. Use the first index.
  • If there are few anonymous messages from this user, i. e. there is low probability that the message will be found somewhere in the beginning of the index, then the index should be used for filtering. Use the second index.

If there is a possibility to redesign the tables, just add another column is_anonymous to the table test_msg.

It will solve lots of problems.

Quassnoi
I applied your query and it seems to work quite well. I'm not sure how I can use the inner `LIMIT` properly. I need to do pagination on the `UNION` result, this way I don't know how many rows to fetch from the individual results. For now I just left them out, still < 0.01s which is great!
mark
If you want to do, say, `LIMIT 100, 20` in the outer query, use the maximal possible `LIMIT 120` in each of the inner queries, and apply `LIMIT 100, 20` to the outer one.
Quassnoi
Lol, even simple math doesn't work anymore when I look at the problem for too long .. thanks :-)
mark
I've had it running for a few days now and it seems to work really well, thanks!
mark