views:

460

answers:

3

I'm trying to offer a feature where I can show pages most viewed by friends. My friends table has 5.7M rows and the views table has 5.3M rows. At the moment I just want to run a query on these two tables and find the 20 most viewed page id's by a person's friend.

Here's the query as I have it now:

SELECT page_id 
FROM `views` INNER JOIN `friendships` ON friendships.receiver_id = views.user_id 
WHERE (`friendships`.`creator_id` = 143416) 
GROUP BY page_id 
ORDER BY count(views.user_id) desc 
LIMIT 20

And here's how an explain looks:

+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+
| id | select_type | table       | type | possible_keys                           | key                             | key_len | ref                                     | rows | Extra                                        |
+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+
|  1 | SIMPLE      | friendships | ref  | PRIMARY,index_friendships_on_creator_id | index_friendships_on_creator_id | 4       | const                                   |  271 | Using index; Using temporary; Using filesort | 
|  1 | SIMPLE      | views       | ref  | PRIMARY                                 | PRIMARY                         | 4       | friendships.receiver_id                 |   11 | Using index                                  | 
+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+

The views table has a primary key of (user_id, page_id), and you can see this is being used. The friendships table has a primary key of (receiver_id, creator_id), and a secondary index of (creator_id).

If I run this query without the group by and limit, there's about 25,000 rows for this particular user - which is typical.

On the most recent real run, this query took 7 seconds too execute, which is way too long for a decent response in a web app.

One thing I'm wondering is if I should adjust the secondary index to be (creator_id, receiver_id). I'm not sure that will give much of a performance gain though. I'll likely try it today depending on answers to this question.

Can you see any way the query can be rewritten to make it lightening fast?

Update: I need to do more testing on it, but it appears my nasty query works out better if I don't do the grouping and sorting in the db, but do it in ruby afterwards. The overall time is much shorter - by about 80% it seems. Perhaps my early testing was flawed - but this definitely warrants more investigation. If it's true - then wtf is Mysql doing?

+1  A: 

As far as I know, the best way to make a query like that "lightning fast", is to create a summary table that tracks friend page views per page per creator.

You would probably want to keep it up-to-date with triggers. Then your aggregation is already done for you, and it is a simple query to get the most viewed pages. You can make sure you have proper indexes on the summary table, so that the database doesn't even have to sort to get the most viewed.

Summary tables are the key to maintaining good performance for aggregation-type queries in read-mostly environments. You do the work up-front, when the updates occur (infrequent) and then the queries (frequent) don't have to do any work.

If your stats don't have to be perfect, and your writes are actually fairly frequent (which is probably the case for something like page views), you can batch up views in memory and process them in the background, so that the friends don't have to take the hit of keeping the summary table up-to-date, as they view pages. That solution also reduces contention on the database (fewer processes updating the summary table).

nathan
Thanks for the suggestion Nathan. Writes are pretty frequent - about 100 per read. There may be some ways I can track only a limited set though. I'll give it some thought.
Tim Haines
Are the writes happening from multiple threads? If so, I would definitely consider aggregating them in memory, and having a single thread/connection perform all the writes. Then you won't have many connections contending, writing to the same tables.
nathan
A: 

Hi Tim,

You should absolutely look into denormalizing this table. If you create a separate table that maintains the user id's and the exact counts for every page they viewed your query should become a lot simpler.

You can easily maintain this table by using a trigger on your views table, that does updates to the 'views_summary' table whenever an insert happens on the 'views' table.

You might even be able to denormalize this further by looking at the actual relationships, or just maintain the top x pages per person

Hope this helps,

Evert

Evert
Hi Evert, thanks for your answer. It's a little trickier than maintaining a view count for each page per user. The reason it's trickier is because I'm counting the number of friends that have viewed the page - not the total number of users.
Tim Haines
Yea I realize this is not the final solution, but this will already simplify your query greatly. At this point you can maintain cached lists of your users and their top pages and sort within your frontend layer without touching your database at all.
Evert
A: 

Your indexes look correct although if friendship has very big rows, you might want the index on (creator_id, receiver_id) to avoid reading all of it.

However something's not right here, why are you doing a filesort for 271 rows? Make sure that your MySQL has at least a few megabytes for tmp_table_size and max_heap_table_size. That should make the GROUP BY faster.

sort_buffer should also have a sane value.

tmp_table_size is 33 mb, max_heap_table_size is 16mb, sort_buffer is 2mb. I agree something's not right. Ruby handles the group and sort much faster than mysql at the moment.
Tim Haines
So that's an effective tmp_table_size of 16MB (it's min(tmp_table_size, max_heap_table_size)) which is enough for 25K integers. Sending all that data to Ruby can't be good, have you tried it only with the GROUP BY?
An integer takes 640 bytes? Don't you mean 250k integers? I'll try increasing the tmp_table_size significantly and see if it has an impact.
Tim Haines
Read 'enough' as 'sufficient'. An integer should only take 4-8 bytes.