views:

331

answers:

8

I have a website that has user ranking as a central part, but the user count has grown to over 50,000 and it is putting a strain on the server to loop through all of those to update the rank every 5 minutes. Is there a better method that can be used to easily update the ranks at least every 5 minutes? It doesn't have to be with php, it could be something that is run like a perl script or something if something like that would be able to do the job better (though I'm not sure why that would be, just leaving my options open here).

This is what I currently do to update ranks:

$get_users = mysql_query("SELECT id FROM users WHERE status = '1' ORDER BY month_score DESC");
$i=0;
while ($a = mysql_fetch_array($get_users)) {
    $i++;
    mysql_query("UPDATE users SET month_rank = '$i' WHERE id = '$a[id]'");
}

UPDATE (solution):

Here is the solution code, which takes less than 1/2 of a second to execute and update all 50,000 rows (make rank the primary key as suggested by Tom Haigh).

mysql_query("TRUNCATE TABLE userRanks");
mysql_query("INSERT INTO userRanks (userid) SELECT id FROM users WHERE status = '1' ORDER BY month_score DESC");
mysql_query("UPDATE users, userRanks SET users.month_rank = userRanks.rank WHERE users.id = userRanks.id");
+1  A: 

Updating the users table row by row will be a time consuming task. It would be better if you could re-organise your query so that row by row updates are not required.

I'm not 100% sure of the syntax (as I've never used MySQL before) but here's a sample of the syntax used in MS SQL Server 2000

DECLARE @tmp TABLE
(
    [MonthRank] [INT] NOT NULL,
    [UserId] [INT] NOT NULL,
)

INSERT INTO @tmp ([UserId])
SELECT [id] 
FROM [users] 
WHERE [status] = '1' 
ORDER BY [month_score] DESC

UPDATE users 
SET month_rank = [tmp].[MonthRank]
FROM @tmp AS [tmp], [users]
WHERE [users].[Id] = [tmp].[UserId]

In MS SQL Server 2005/2008 you would probably use a CTE.

Kane
Hmm, not sure how that would be done in MySQL.
James Simpson
+3  A: 

My first question would be: why are you doing this polling-type operation every five minutes?

Surely rank changes will be in response to some event and you can localize the changes to a few rows in the database at the time when that event occurs. I'm pretty certain the entire user base of 50,000 doesn't change rankings every five minutes.

I'm assuming the "status = '1'" indicates that a user's rank has changed so, rather than setting this when the user triggers a rank change, why don't you calculate the rank at that time?

That would seem to be a better solution as the cost of re-ranking would be amortized over all the operations.

Now I may have misunderstood what you meant by ranking in which case feel free to set me straight.

paxdiablo
Status refers to active users, and there are literally hundreds of things on the site that can change the rankings. A change in one user's ranking can obviously move a lot of other user's rankings.
James Simpson
Okay, I understand now. But, if you're not willing to, or can't, change the rankings at event time, you may have to settle for a more realistic frequency for the ranking updates. Or, you could contact Jeff Atwood to see how SO does it :-)
paxdiablo
+1  A: 

Any time you have a loop of any significant size that executes queries inside, you've got a very likely antipattern. We could look at the schema and processing requirement with more info, and see if we can do the whole job without a loop.

How much time does it spend calculating the scores, compared with assigning the rankings?

le dorfier
+2  A: 

A simple alternative for bulk update might be something like:

set @rnk = 0;
update users 
set month_rank = (@rnk := @rnk + 1)
order by month_score DESC

This code uses a local variable (@rnk) that is incremented on each update. Because the update is done over the ordered list of rows, the month_rank column will be set to the incremented value for each row.

Aleris
Could you explain that code? Is this mysql, because I don't quite understand what is going on there?
James Simpson
See http://dev.mysql.com/doc/refman/5.1/en/user-variables.html
Bill Karwin
+1  A: 

Your problem can be handled in a number of ways. Honestly more details from your server may point you in a totally different direction. But doing it that way you are causing 50,000 little locks on a heavily read table. You might get better performance with a staging table and then some sort of transition. Inserts into a table no one is reading from are probably going to be better.

Consider

mysql_query("delete from month_rank_staging;");
while(bla){
  mysql_query("insert into month_rank_staging values ('$id', '$i');");
}
mysql_query("update month_rank_staging src, users set users.month_rank=src.month_rank where src.id=users.id;");

That'll cause one (bigger) lock on the table, but might improve your situation. But again, that may be way off base depending on the true source of your performance problem. You should probably look deeper at your logs, mysql config, database connections, etc.

andynu
This is a good idea that I am surprised I hadn't thought of. I think it is basically what Kane was saying, but not as easy to read since I'm not used to that syntax. I will test this out to see the results I get.
James Simpson
Please see my edits to my original question for the revised solution.
James Simpson
+1  A: 

Possibly you could use shards by time or other category. But read this carefully before...

Jet
+8  A: 

Make userRanks.rank an autoincrementing primary key. If you then insert userids into userRanks in descending rank order it will increment the rank column on every row. This should be extremely fast.

TRUNCATE TABLE userRanks;
INSERT INTO userRanks (userid) SELECT id FROM users WHERE status = '1' ORDER BY month_score DESC;
UPDATE users, userRanks SET users.month_rank = userRanks.rank WHERE users.id = userRanks.id;
Tom Haigh
Definitely the best answer here
ck
Wow, I didn't even know you could insert like that. Thanks!
James Simpson
+1  A: 

You can split up the rank processing and the updating execution. So, run through all the data and process the query. Add each update statement to a cache. When the processing is complete, run the updates. You should have the WHERE portion of the UPDATE reference a primary key set to auto_increment, as mentioned in other posts. This will prevent the updates from interfering with the performance of the processing. It will also prevent users later in the processing queue from wrongfully taking advantage of the values from the users who were processed before them (if one user's rank affects that of another). It also prevents the database from clearing out its table caches from the SELECTS your processing code does.

Nolte Burke