tags:

views:

76

answers:

5

Hello all,

I have a website that calculates a users involvement/activity using multiple MySQL queries.

For a typical user I will ask:

How many updates have they made? How many photos have they uploaded? etc etc.

These are just basic COUNT queries on the relevant tables, updates, photos. I then total the COUNT values of each to get a score. Each query requires one JOIN and each query takes around 0.0006 seconds, 10 queries in total for each user, 0.006 seconds in all.

While not so bad for one user, I have to calculate this for 100,000 users, for a theoretical processing time of 10 minutes and some 1,000,000 DB queries. It feels like I'm approaching the problem in the wrong way and was wondering if anyone had any ideas?

I had thought about storing the users score in their user record and incrementing it every time they carry out a particular action but it's not as flexible (I can't go back and see how many points were accrued on a certain day for instance).

Any help greatly appreciated!

+2  A: 

Have a user_activity link table. It takes an action_id, user_id and a timestamp. So for instance, when a user uploads a photo, a record is created with activity_id = 2 (for 'photo upload' referenced from the activities table), the user_id and the current timestamp. That's easily queryable and eliminates the worry of long-running queries when you have zillions of users.

karim79
A: 

Unless you want to do an 11 way join, I'd create a seperate table for this purpose that you insert after each user update.

That table should contain just the username, timestamp, section (table it's from), and the unique id from the other tables so you have a backreference for deletes and such.

Daren Schwenke
A: 

It sounds to me as though you are trying to optimize before it is truly necessary. If you don't have 100,000 users you do not need to worry about such issues until necessary.

With that being said there is no reason not to optimize, just don't try to overthink the problem for a solution you don't yet need.

Although you might experience slight inconsistencies, you could try caching the results of each user's counts when they login (using memcached) and only update the cache when update one of the counts. If user's are very active it would be more efficient to only update every hour or so.

cballou
40,000 users at the moment, so I'd be looking at 3-4 minutes or so. It's the 400,000 queries in a short space of time that had be slightly concerned!
Jim
+1  A: 

Assuming your tables are structured where each table has a user_id field of some kind, you can do something like this to get the total "activities" your users have done:

SELECT users.user_id, 
       (update_counts.update_count + photo_counts.photo_count) AS activity_count
FROM   users 
    INNER JOIN 
        (
        SELECT updates.user_id AS user_id, 
               COUNT(updates.*) AS update_count
        FROM   updates
        GROUP BY user_id
        ) AS update_counts ON users.user_id = update_counts.user_id
    INNER JOIN 
        (
        SELECT photos.user_id AS user_id, 
               COUNT(photos.*) AS photo_count
        FROM   photos
        GROUP BY user_id
        ) AS photo_counts ON users.user_id = photo_counts.user_id   
GROUP BY users.user_id

Obviously you can add tables as you need to and you can weight things as you see fit. It should perform reasonably well if you have an index on the user_id field in each table, though it does depend on how big your tables get.

Once your tables get huge, you are going to need to start caching an activity_count in a cache table. You can cache values by date if you need to, of course.

If you only need a rough estimate, you could run this query at some regular frequency (say, once every night) and cache the results; this would be less intrusive than writing triggers for every table to update a cache table.

James McNellis
Thanks, I'll do some tests but would JOIN's of this fashion be quicker than separate queries? I'm guessing the reduction in overall queries being made helps though
Jim
From your original post, it sounds like you were calculating each user's score individually. Using the joins and getting everyone's result at the same time will be faster than getting each person's result one at a time, since it will only require one complete pass over each joined table's user_id index (if you have an index on user_id, it shouldn't need to touch the table itself to get the number of rows per user_id).
James McNellis
Thanks for your help James. Ah I see! Will this method calculate a score for each user, regardless of whether they have activity or not (so a user that has no activity gets a score of 0)?I've been testing it out and it seems to only return users that have activity in each table. Is it a simple matter of changing the INNER JOIN's to LEFT JOIN's and just taking the NULL activity_count value as 0?
Jim
A: 

This may be overkill for your application, but you could always go the OLAP route. This would allow you to have pre-aggregated measures along multiple dimensions, such as users and time spans. This gives you a flexible framework for a variety of reporting needs. The SQL Server Analysis Services package has worked well for our company.

Jacob
Thanks, I'll take a look into that
Jim