tags:

views:

340

answers:

9

What do I have to do to make 20k mysql inserts per second possible (during peak hours around 1k/sec during slower times)? I've been doing some research and I've seen the "INSERT DELAYED" suggestion, writing to a flat file, "fopen(file,'a')", and then running a chron job to dump the "needed" data into mysql, etc. I've also heard you need multiple servers and "load balancers" which I've never heard of, to make something like this work. I've also been looking at these "cloud server" thing-a-ma-jigs, and their automatic scalability, but not sure about what's actually scalable.

The application is just a tracker script, so if I have 100 websites that get 3 million page loads a day, there will be around 300 million inserts a day. The data will be ran through a script that will run every 15-30 minutes which will normalize the data and insert it into another mysql table.

How do the big dogs do it? How do the little dogs do it? I can't afford a huge server anymore so any intuitive ways, if there are multiple ways of going at it, you smart people can think of.. please let me know :)

+1  A: 

Thats impressive. Most of my data has been from massive inserts at once. One thing that I find is that bulk inserts do a lot better than individual inserts. Also, the design of your tables, indexes etc has a lot to do with insert speed. The problem with using cron and bulk inserting are the edge cases. (When it goes to do the inserts).

Additionally with flatfiles. You can easily run into issues with concurrency with writing the inserts to the file. If you are writting 1k+ inserts a s you'll quickly run into lots of conflicts and loss when there are issues with the file writing.

monksy
Well I need to do them separately into any medium that will take them. Then I need to take that data, normalize it and put it into a very small and tidy mysql table.
John
+4  A: 

How do the big dogs do it?

Multiple servers. Load balancing.

How do the little dogs do it?

Multiple servers. Load balancing.

You really want to save up inserts and push them to the database in bulk. 20k individual inserts a second is a crapton of overhead, and simplifying that down to one big insert each second eliminates most of that.

Anon.
I guess the question is, "How do I save them up?".
John
A: 

This is not a problem you can handle in PHP alone.

If you have 20 000 requests a second hitting your "low-budget" (as I understood by the undertone of your question) server, then it will reach its limit before most of them reach the PHP processor (and, eventually, MySQL).

If you have a traffic tracker script, you'll very likely cause problems for all the sites you track too.

Joel L
I guess that's another question then. Will the server be able to handle loading up the php that will make this happen 20,000 times per second :(
John
Cache, Cache, Cache. And if you can, put the database on one server and the php on the other. And you might just have to pony up the money to throw bigger stuff at it in the hopes that you'll make enough when it grows to cover it.
Tyler Smith
+5  A: 

A couple of ways:

Firstly, you will reach a point where you need to partition or shard your data to split it across multiple servers. This could be as simple as A-C on server1, D-F on server2 and so on.

Secondly, defer writing to the database. Instead write to a fast memory store using either beanstalkd or memcached directly. Have another process collect those states and write aggregated data to the database. Periodically amalgamate those records into summary data.

cletus
The 20k inserts per second is only temporary data. Once it's collected, I will run a script in 15-30 minute intervals that take all the data, normalize it ( for example if the same IP vists the same webpage 100 times ) there will be 100 rows of data in the temporary table, and in the normalized table it will just update one row to reflect the 100 extra vists.
John
A: 

PHP is not well-suited to high volume web traffic IMHO. However, the database will likely bog you down before the PHP performance - especially with PHP's connection model (opens a new connection for every requst).

I have two suggestions for you:

  1. Give SQL Relay a look: http://sqlrelay.sourceforge.net/
  2. Check out some PHP accelerators: http://en.wikipedia.org/wiki/List%5Fof%5FPHP%5Faccelerators

SQL Relay effectively allows PHP to tke advantage of connection pooling and that will give much better performance for a high volume database application.

PHP accelrators (generally speaking) cache the PHP opcodes which saves the overhead of interpreting the PHP code with every request.

Good Luck!

jckdnk111
The person could have at least commented on why they thought this was a bad suggestion, it looks like great information, thanks!
John
It's probably the "PHP is not well-suited to high volume web traffic", which sites like Facebook disprove.
ceejayoz
To be fair, I did state it was only my opinion. I should have been more specific though. PHP is not as efficient as some other languages and therefore is not well-suited for high volume web traffic (not that it can't or won't work, its just not the best tool for the job in my opinion). http://slashdot.org/story/09/12/20/1433257/The-Environmental-Impact-of-PHP-Compared-To-C-On-Facebook
jckdnk111
jckdnk111
A: 

Writing to a file is great, but you still need to synchronize your file writes which puts you back to square one.

Suggestions:

  • MQ system, though sometimes the DB can be faster,
  • On the MQ idea: in-memory queue. I know you said PHP, but I've seen this done quite well in Java/Servlets,
  • Depending on what it is you're tracking, you can deploy a static file into a CDN (the cloud thing you talked about) and aggregate the access logs in batch. Allows you to rent scaling out,
  • INSERT DELAYED good idea but I don't know what the backlog/queue size is for that in MySQL? (anyone)
Xepoch
Thanks for the comment. I'll look into this.
John
+1  A: 

I'd recommend memcaching, too.

Write your data into a memcache and have a periodically running job aggregate it and do the inserts.

Writing to an actual file would probably DECREASE your performance since file system access is mostly slower than talking to a database that can handle writing access much more efficiently.

Techpriester
A: 

Since you're tracking impressions, what if try only saving, say, one in every 5. Then you still have a completely "random" sample, and you can just apply the percentages to the bigger dataset.

Tyler Smith
A: 

Ever heard of batching inserts? Look it up if not.

Joe Joe