(How can/What should) I implement a database that scales to the upper tens of thousands requests/second?

views:

654

answers:

+6 Q:

(How can/What should) I implement a database that scales to the upper tens of thousands requests/second?

By Upper tens of thousands requests/second I want to see 60,000 -> +90,000 requests/second.

My Setup consists of the following:

user ---> web app --> message queue --> parser --> database?

I should mention that the parser currently can parse/stuff around 18750 records/second using COPY so we are limited on that end until we start adding more parsers -- this isn't a huge concern for me now.

I have a system that requires the ability to bulk upload as fast as I can as many records as I can. This same system (or it can be different depending on how you would approach it) should be able to respond to analytical type queries such as this:

wonq = "select sum(amount) from actions where player = '@player' and " +
       "(type = 'award' or type = 'return') and hand = hand_num"
lostq = "select sum(amount) from actions where player = 'player' and " +
        "type != 'award' and type != 'return' and hand = hand_num"

.....10-15 thousand times (PER USER) since they are keyed off to another table. Needless to say we paginate these results at 10/page for now.

I've looked at the following: (assuming these are all on the same server)

mysql (reg. run of the mill rdbms) -- was able to get into the 15-20 thousand requests/second range; under current conditions if we try to scale this out we need a seperate host/database everytime we need to scale -- this is not doable
couchdb (document oriented db) -- didn't break 700 requests/second; I was really hoping this was going to save our ass -- not a chance!
vertica (columnar oriented db) -- was hitting 60000 request/second, closed source, very pricey; this is still an option but I personally did not like it at all
tokyocabinet (hash based db) -- is currently weighing in at 45,000 inserts/second and 66,000 selects/second; yesterday when I wrote this I was using a FFI based adapater that was performing at around 5555 requests/second; this is by-far THE fastest most awesome database I've seen yet!!
terracotta -- (vm cluster) currently evaluating this along with jmaglev (can't wait until maglev itself comes out) -- this is THE SLOWEST!

maybe I'm just approaching this problem wrong but I've ALWAYS heard that RDBMS were slow as all hell -- so where are these super fast systems that I've heard about?

Testing Conditions::

Just so ppl know my specs on my dev box are:

dual 3.2ghz intel, 1 gig ram

Mysql mysql.cnf edits were:

key_buffer = 400M               # was 16M
innodb_log_file_size = 100M     # non existent before
innodb_buffer_pool_size = 200M  # non existent before

UPDATE::

It turns out that terracotta might have a place in our application structure but it flat out WILL NOT be replacing our database anytime soon as it's speeds are terrible and it's heap utilization sucks.

On the other hand, I was very happy to see that tokyocabinet's NON-FFI ruby library (meaning tyrant/cabinet) is super fast and right now that is first place.

+5 A:

For crazy-big scalability, you'll want to focus on two things:

Sharding: Split your data set into groups that don't overlap. Have an easy, fast way to map from a request to a server. (Player starting with a-f, server 1; g-q, server 2... etc...)
Caching: Use Memcache to remember the output of some really common select queries, so you don't have to go to disk as often.

ojrac 2009-02-17 23:12:35

+1 A:

Well the big player in the game is Oracle but thats big bucks.

If you want to go cheap then you will have to pay the price in a different terms:

by partioning the DB across multiple instances and distributing the load.
Potentially caching results so actual DB access is reduced.

Martin York 2009-02-17 23:17:27

user ---> web app --> message queue --> parser --> database?

What do you need the message queue for? Those are a big performance issue normally.

Andomar 2009-02-17 23:25:16

good question, however, the message queue adds almost NO noticeable performance hit... the reason it is there is because eventually we want to have multiple parsers pulling from it and I want the jobs from the web server to be IMMEDIATELY thrown in the queue so the web server can do it's best

feydr 2009-02-18 00:33:07

Sharding and caching as ojrac said.

Another option is to take a step back and figure out to do the work with less queries! From the little information you gave I can't help but think "there must be a better way". From the examples you gave some summary tables (with optional caching) might be an easy win.

Hypertable etc gives better performance for some data access patterns, but yours sound very suited for the typical databases.

And yeah, CouchDB is disappointingly slow.

Ask Bjørn Hansen 2009-02-18 00:21:10

had no idea CouchDB was so weak! I imagined it was atleast like 10k

Robert Gould 2009-02-18 01:28:50

we have done summary tables in the past which more or less worked, however as of right now I'm back to bare-bones "how fast can we throw stuff in and grab it out"

feydr 2009-02-18 16:22:03

have you tried postgresql? it should be faster than mysql. but anyhow, you would need to balance the load over multiple servers (split database). you can have multiple databases (e.g. for each client) and then one centralized one that will sync with those small ones...

dusoft 2009-02-18 22:12:59

I have not tried postgresql yet, although I've used it in past projects and it is industry quality strength -- I know from past experiences that it does not have the speed I require however..

feydr 2009-02-18 23:38:12

memory database in mysql?

2009-06-09 04:34:22

Have you tried redis? They promise the speed of 110000 SETs/second, 81000 GETs/second. It's an advanced key-value db with support for lists and sets.

AlexD 2009-09-22 22:23:56

actually evaluated redis and like it quite a lot -- I have several problems with it however for this problem -- the main one being that you need enough memory to match what you want to store.... without being distributed that's a big gotcha

feydr 2009-09-23 15:27:37

Yes, for the same reason Redis doesn't look very suitable for our project. In this context LightCloud project looks interesting since it builds distributed key-value database on top of Tokyo Tyrant or Redis.

AlexD 2009-09-25 15:05:22

I doubt any system will give you the out-of-the-box performance that you need. You are probably going to start hitting hard limits on the machine you are on (with just about any write-intensive db you will hit I/O limits pretty fast). Some analysis might be required, but the disk is nearly always the bottleneck. More RAM will help, as will using Solid State Disks.

However, you will probably need clustering of some kind regardless of which actual db you use. You can shard the data itself, or with MySQL, setting up read-slaves will spread the load across nodes and should give you the throughput you are looking for.

Also: MongoDB is awesome. Might be worth a look.

Toby Hede 2009-09-22 22:41:59

have looked at mongodb and I like it much better than couch (both being doc-oriented dbs) as it's much faster.. I was getting 8,000-10,000 requests/second on my laptopyou are right about the clustering... as of right now we are looking at using hdfs/hbase in the hadoop stack.. not as fast but it should do what we need

feydr 2009-09-23 15:29:23

The typical way to quickly store data durably in a write-heavy app is to use an append-only log. If properly deployed s.t. the log file is on its own spinning disk, the disk seek time is minimized per write/append operation.

One can update metadata to know the offset for some primary key after each write.

There is a mysql storage engine that does this is you want to use mysql. Another option is one of the new nosql databases like fleetdb.

Have you tried using a SSD as well?

There are lots of options to solve this problem but they are likely going to require some manual labor.

z8000 2010-01-08 13:33:57

ansaurus

tags:

views:

answers:

(How can/What should) I implement a database that scales to the upper tens of thousands requests/second?

related questions