views:

267

answers:

5

I have an online game where I record lots of gameplay stats. These stats tables get large very quickly, and I have to be careful because simply recording more stats can cause the game's performance to get pretty bad, once the table gets big enough.

My strategy, which isn't a very good one, is to keep the stats tables small. I have an automatic process that creates a new table every 24 hours, preventing the performance from getting too out-of-hand. But my solution is ugly and is a sort of "rotation" of stats tables. I use innodb and have set up a couple of indexes to improve performance, and then I just keep 30 of these tables around (each one is 24 hours, so I'm saving a month of stats). Every 24 hours, my automated process deletes the "stats30" table, then renames all the numbered tables to have a higher number, and then it creates a new, blank table called simply "stats". This is the "live" table, where stats are actively being recorded into.

These tables basically record every transaction between every player and every other player in the game that they interact with, thus an exponential explosion of data. When a new transaction occurs, it checks to see if there is already a row for the transactions between these two players during that day. If there is, it updates the row with changes to their transactions. Otherwise, it creates a new row. A pair of players who interact 1000 times in a day and a pair who interact just a single time will both have just a single row in the table for that day. Each action on the database involves a SELECT and then either an UPDATE or an INSERT, so it is pretty even between reads and writes as currently designed. The reading of data in a larger sense, i.e. for analysis of stats and multiple players is done very infrequently, relative to the single SELECTs, UPDATEs and INSERTs. There are about 150,000 rows created per day.

I know this could be better. I can't easily reduce the amount of data I'm recording, but I'm concerned about 1.performance, and 2.simplicity. I could increase performance even more by having a new table created every 4 hours, for example, but then I have to mess with 180 tables. Conversely, I could make it simpler by just using one table, and then everything comes to a screeching halt.

Note that I do need to update rows in these tables, so I can't use something like the ARCHIVE storage engine, but I only need to INSERT or UPDATE on the "live" stats table.

There is also the minor issue that when the daily rotation process occurs, any queries coming in at that moment may be lost. (If it is in the process of renaming all tables and creating a new one, new entries may fail.) Losing a few inserts is not a big problem, but a solution where this error won't occur, or could be done "atomically" would be better.

Thanks for any ideas that might help! :)

A: 

No database guru, but have you tried using a different db engine (innoDB is slow, try myISAM) and making sure the data you are selecting is indexed?

seems like a silly question but, that solution you have is pretty crazy

also try this: http://forums.mysql.com/read.php?24,92131,92131 for performance tuning.

Steve
Since he chose InnoDB I assume he's looking to have the ACID compliance that comes with it.
Ben S
I've added another paragraph explaining the issue better. I definitely need something with row-level locking.
ZenBlender
A: 

The first thing to do is to profile your application to find what is actually taking the most time. There is no use continuing with this table rotation if another change would give better performance.

Run EXPLAIN on all your queries and make sure you understand the results and optimize your queries appropriately.

I also highly recommend you read the Reference Manual for optimizing MySQL. It gives many tips on how to configure the server for optimal performance and might give you insight as to what might be causing performance bottlenecks.

If the performance the the live table is is really the issue and you need to reduce the number of rows in that table then simply move the rows to an archive table. This way you always have a slim live table. and can easily run longer-running queries on the archive.

Ben S
A: 

How many rows are we talking about here? Millions? Hundreds of thousands?

Since you said you only need to update rows in the live table, it sounds like you could probably do this with just two tables - a live stats table and a statsarchive table. Your batch job would then just move rows from the live stats table to the stats_archive table periodically. Another job could clean out rows from the archive table after they get to a certain age. It would also be a good idea to have a third job that recalculates statistics on the archive table periodically to make sure MySQL can generate good plans for your queries.

Eric Petroelje
Good idea! I've added more background info to the post. It's in the hundreds-of-thousands per day. I may try a dual-table approach...
ZenBlender
A: 

I would like to know more about your domain to give a more precise answer. But short answer, you would need some kind of partitioning either based on Month, Year or geographic based. So all the stats of users belonging to a particular city will go into those tables. In this way your selects can become faster.

And of course, the usual yada yada about adding indexes...

Tell me more , I may be able to help you...

CodeToGlory
+2  A: 

With 150k rows per day, what's the avg. size of a row? Do these rows contain redundant data that you might be able to minimize by just keeping references?

In generall it's always a good thing to keep tables small so that index updates go through fast. Also, as mentioned by Ben S above your queries should be optimized at least so that there's no access to columns with missing index, etc. You can find some possible issues using EXPLAIN and in the slow query log of your mysql server if you've got it enabled.

One thing that might help you with performance issues is the memcached daemon. Using it you can delay writes to your DB, thereby taking out some of the steam and still won't suffer from dirty caches and alike. Though depending on the application framework you're using (if any at all) it requires some work to implement this into your application.

For archiving and statistics purpose I suggest you take a look at InfoBright (http://www.infobright.org/). It's an open source MySQL drop in replacement (based on MySQL). It's designated purpose is to be a data warehouse store. You can use it for all sorts of high-volume data analysis though. It's got a very nice compression feature, that in our case reduces ~23TB of raw data to about 1.2TB compressed data. I guess it's needless to say that querying specific rows of compressed data can/will be quite slow. But for statistics it's pretty darn fast. So if you're not querying specific rows but instead analyze things like "how many rows got updated with value foo > bar between dec 08 and feb 09" it will give you very good performance. In fact, as you use the DB it will analyze your usage and create a knowledge grid that will optimize it's data for your specific queries.

The next question that would come to my mind is... if it's "only" stats/session data that you're keeping for a day, or some hours, is a relational DB the right tool for the job? Without knowing the exact nature of your application I could imagine some sort of in-memory sessions (that could reside in a terracotta cluster for instance) that write a transaction log and commit their data every so often might be better suited. But as I said, that depends heavily on your application's nature and the ammount of data in question.

lloesche