I'm trying to offer a feature where I can show pages most viewed by friends. My friends table has 5.7M rows and the views table has 5.3M rows. At the moment I just want to run a query on these two tables and find the 20 most viewed page id's by a person's friend.
Here's the query as I have it now:
SELECT page_id
FROM `views` INNER JOIN `friendships` ON friendships.receiver_id = views.user_id
WHERE (`friendships`.`creator_id` = 143416)
GROUP BY page_id
ORDER BY count(views.user_id) desc
LIMIT 20
And here's how an explain looks:
+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | friendships | ref | PRIMARY,index_friendships_on_creator_id | index_friendships_on_creator_id | 4 | const | 271 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | views | ref | PRIMARY | PRIMARY | 4 | friendships.receiver_id | 11 | Using index |
+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+
The views table has a primary key of (user_id, page_id), and you can see this is being used. The friendships table has a primary key of (receiver_id, creator_id), and a secondary index of (creator_id).
If I run this query without the group by and limit, there's about 25,000 rows for this particular user - which is typical.
On the most recent real run, this query took 7 seconds too execute, which is way too long for a decent response in a web app.
One thing I'm wondering is if I should adjust the secondary index to be (creator_id, receiver_id). I'm not sure that will give much of a performance gain though. I'll likely try it today depending on answers to this question.
Can you see any way the query can be rewritten to make it lightening fast?
Update: I need to do more testing on it, but it appears my nasty query works out better if I don't do the grouping and sorting in the db, but do it in ruby afterwards. The overall time is much shorter - by about 80% it seems. Perhaps my early testing was flawed - but this definitely warrants more investigation. If it's true - then wtf is Mysql doing?