Inspiration needed: Selecting large amounts of data for a highscore.

views:

answers:

Inspiration needed: Selecting large amounts of data for a highscore.

I need some inspiration for a solution...

We are running an online game with around 80.000 active users - we are hoping to expand this and are therefore setting a target of achieving up to 1-500.000 users.

The game includes a highscore for all the users, which is based on a large set of data. This data needs to be processed in code to calculate the values for each user.

After the values are calculated we need to rank the users, and write the data to a highscore table.

My problem is that in order to generate a highscore for 500.000 users we need to load data from the database in the order of 25-30.000.000 rows totalling around 1.5-2gb of raw data. Also, in order to rank the values we need to have the total set of values.
Also we need to generate the highscore as often as possible - preferably every 30 minutes.

Now we could just use brute force - load the 30 mio records every 30 minutes, calculate the values and rank them, and write them in to the database, but I'm worried about the strain this will cause on the database, the application server and the network - and if it's even possible.
I'm thinking the solution to this might be to break up the problem some how, but I can't see how. So I'm seeking for some inspiration on possible alternative solutions based on this information:

We need a complete highscore of all ~500.000 teams - we can't (won't unless absolutely necessary) shard it.
I'm assuming that there is no way to rank users without having a list of all users values.
Calculating the value for each team has to be done in code - we can't do it in SQL alone.
Our current method loads each user's data individually (3 calls to the database) to calculate the value - it takes around 20 minutes to load data and generate the highscore 25.000 users which is too slow if this should scale to 500.000.
I'm assuming that hardware size will not an issue (within reasonable limits)
We are already using memcached to store and retrieve cached data

Any suggestions, links to good articles about similar issues are welcome.

How about saving those scores in a database, and then simply query the database for the top scores (so that the computation is done on the server side, not on the client side.. and thus there is no need to move the millions of records).

It sounds pretty straight forward... unless I'm missing your point... let me know.

Nestor 2009-11-05 20:10:21

The problem is that we need these millions of rows to calculate the value/score. This calculation needs to take place in code. After the values are calculated and ranked then the scores are written to the database.It's not a problem storing the finished scores or querying them - the problem is that to calculate them we need these many millions of rows transferred from SQL to the client.

Micael 2009-11-05 20:14:58

Is there any way to revise your scoring algorithm so that you don't need to query millions of rows for each score?

Paul McMillan 2009-11-05 20:50:18

We might be able to cut a little down on the 30 mio, maybe by 5-10%, but to narrow it down to something more reasonable like 1-2 mio, no I don't think that's possible. Not with anything resembling the current structure at least. I'm not sure what an alternative structure could be.

Micael 2009-11-05 20:58:45

You need to move the logic of calculating those scores to the server (in the form of a store procedure, for instance). If your logic is programmed in c#, you can embed it in sql 2005/2008 using the SQLCLR.

Nestor 2009-11-05 21:48:49

SQLCLR is idd awesome for this kind of stuff!! good point!

Sander Versluys 2009-11-05 23:41:54

Calculate and store the score of each active team on a rolling basis. Once you've stored the score, you should be able to do the sorting/ordering/retrieval in the SQL. Why is this not an option?

Paul McMillan 2009-11-05 20:10:51

We've previously done the ranking in SQL, but found that performancewise this was not satisfactory. I should also note that one of the pieces of data we show is the rank of a previous turn - meaning that we actually need to rank the team twice in order to show a single highscore row. I would be very interested in hearing if anyone else have had experiences ranking large sets of data - maybe we just weren't doing it right?

Micael 2009-11-05 20:20:59

There is no way any hand-rolled code is going to beat the performance of your SQL server for indexing and sorting scores. I'm still confused about why you can't store the previous rank in the database as well and retrieve it for your highscore.

Paul McMillan 2009-11-05 20:49:28

+1 A:

Interesting problem. In my experience, batch processes should only be used as a last resort. You are usually better off having your software calculate values as it inserts/updates the database with the new data. For your scenario, this would mean that it should run the score calculation code every time it inserts or updates any of the data that goes into calculating the team's score. Store the calculated value in the DB with the team's record. Put an index on the calculated value field. You can then ask the database to sort on that field and it will be relatively fast. Even with millions of records, it should be able to return the top n records in O(n) time or better. I don't think you'll even need a high scores table at all, since the query will be fast enough (unless you have some other need for the high scores table other than as a cache). This solution also gives you real-time results.

rmeador 2009-11-05 20:15:08

Good point about calculating the value on inserts/updates - I will try to investigate this. I would be very happy to get rid of the Rank column and generate it on runtime, but I'm a little pessimistic about performance. It should also be noted that the game is divided into rounds, with a highscore for each round - also that we are running around 8 of these games every year, so the highscore table could contain around 500.000 x 8 games x 20 rounds = 80 mio records per year (we would probably not keep data more than 1-2 years, or at least truncate it somehow).

Micael 2009-11-05 20:28:05

Your comment leads me to believe you have missed an essential point of my answer. You will be getting rid of the rank column, as you say, but you will not be "generating it on runtime". You do not generate a rank at all. If you have some sort of absolute score stored in the database, the database is capable of sorting on that column, so you can retrieve the players in order of score whenever you want. "Rank" becomes entirely implicit in the query you use to retrieve the information.

rmeador 2009-11-05 20:50:48

But sorting and giving each column a number (i guess you imply) is not the same as ranking - two players with the same value will need to have the same rank.

Micael 2009-11-06 07:19:51

So as you iterate over the result set to print the high scores, don't increment the "rank" counter if the next value is the same as the current value :)

rmeador 2009-11-06 16:13:08

+1 A:

First and formost:

The computation has to take place somewhere.
User experience impact should be as low as possible.

One possible solution is:

Replicate (mirror) the database in real time.
Pull the data from the mirrored DB.
Do the analysis on the mirror or on a third, dedicated, machine.
Push the results to the main database.

Results are still going to take a while, but at least performance won't be impacted as much.

Jeremy Roberts 2009-11-05 20:20:48

+1 A:

Assuming that most of your 2GB of data is not changing that frequently you can calculate and cache (in db or elsewhere) the totals each day and then just add the difference based on new records provided since the last calculation.

In postgresql you could cluster the table on the column that represents when the record was inserted and create an index on that column. You can then make calculations on recent data without having to scan the entire table.

2009-11-05 20:29:53

Good point - most of the data are indeed not changing or at least only being added. However this is not always the case. I rare cases there will be corrections to earlier data. So at least there would have to be some mechanism for invalidating the data, though the invalidation scheme could be quite complex.

Micael 2009-11-05 20:38:19

It might prove fruitless, but I'd at least take a gander at the way sorting is done on a lower level and see if you can't manage to get some inspiration from it. You might be able to grab more manageable amounts of data for processing at a time.

Have you run tests to see whether or not your concerns with the data size are valid? On a mid-range server throwing around 2GB isn't too difficult if the software is optimized for it.

coreyward 2009-11-05 20:48:50

Seems to me this is clearly a job for chacheing, because you should be able to keep the half-million score records semi-local, if not in RAM. Every time you update data in the big DB, make the corresponding adjustment to the local score record.

Sorting the local score records should be trivial. (They are nearly in order to begin with.)

If you only need to know the top 100-or-so scores, then the sorting is even easier. All you have to do is scan the list and insertion-sort each element into a 100-element list. If the element is lower than the first element, which it is 99.98% of the time, you don't have to do anything.

Then run a big update from the whole DB once every day or so, just to eliminate any creeping inconsistencies.

Mike Dunlavey 2009-11-05 23:36:38

ansaurus

tags:

views:

answers:

Inspiration needed: Selecting large amounts of data for a highscore.

related questions