views:

382

answers:

7

We are planning to implement a system for logging a high frequency of market ticks into a DB for further analysis. To simply get a little what kind of storage performance we can get on the different DB solutions I created a little application for inserting a basic row of tick information. When running the same code on a couple of different DBs we got some interesting results.

The data being inserted is very simple like follows :

CREATE TABLE [dbo].[price](
    [product_code] [char](15) NULL,
    [market_code] [char](10) NULL,
    [currency] [nchar](6) NULL,
    [timestamp] [datetime] NULL,
    [value] [float] NULL,
    [price_type] [char](4) NULL
) ON [PRIMARY]

Microsoft SQL Server :

Total test time : 32 seconds. 3,099 prices per second.

MySQL Server :

Total test time : 18 seconds. 5,349 prices per second.

MongoDB Server :

Total test time : 3 seconds. 25,555 prices per second.

The purpose of this testing is simply to get a little indication of what kind of "raw performance" can be expected of the systems in the bottom. When actually implementing a solution we would of course do buffering, bulk inserts etc.

We only care about the speed of the inserts, as the querying is done "offline" later.

Does anyone have any suggestions for other databases that could fit? I will be trying with HDF5 and MonetDB later tonight too. Its required to have multi client access.

Thanks for any suggestions!

UPDATED :

Sorry, but i did a major edit of my question before positing, and it seems i left out the server versions and some details of the hardware. All tests were on an 8 core server with 12GB RAM running Windows 2008 x64.

Microsoft SQL Server 2008 Enterprise x64. MySQL 5.1.44 running as InnoDB table. MongoDB 1.2.4 x64

The current test is a simple loop of row inserts into the DBs with real historic data from NASDAQ compiled in a CSV file already imported to memory. The code was in C# NET4 x64.

The MS SQL and MySQL servers were "tuned" to perfect settings, while the MongoDB was just set up with defaults. The SQL tables are set up with no indices, as the purpose of the DB is simple as a staging ground before being transfered into the main analysis system.

Many suggested Bulk inserts, however its a difficult way of doing it as we have several clients pushing single ticks into the DB independently from live streams. To allow for such methods, we would have to expand the layer in front of the DB beyond what we have a chance to test for right now. However I imagine something will have to be done for the final architecture, as the numbers we are getting from everything except the MongoDB is not enough to handle the number of inputs needed.

UPDATE 2: SSD drives are indeed great for just this, and we are using this ourselves. However the final product will be installed at a few different customers which all provide their own iron.. and getting servers from the IT department with SSD is still hard... :(

UPDATE 3:

I tried the BulkCopy approach suggested. Performance for the same loop as the others, but first into a DataTable and then BulkInsert into the SQL Server resulted in the following :

Microsoft SQL Server (Bulk) :

Total test time : 2 seconds. 39401 prices per second.

+4  A: 

I can only really comment on sql-server, but there are some thigns to try:

  • command batching (i.e. do multiple INSERT in a single hit to the db)
  • bulk insert (via SqlBulkCopy)

either should give significant improvements on single-row inserts (the latter being fastest)

Marc Gravell
+1 - I recently blogged a performance comparison using SqlBulkCopy vs. batched updates using SqlDataAdapter here: http://www.adathedev.co.uk/2010/02/sqlbulkcopy-bulk-load-to-sql-server.htmlResult being 0.8229s to insert 100,000 records on my home PC.
AdaTheDev
@AdaTheDev - good link, thanks
Marc Gravell
Indeed very interesting. But SqlBulkCopy has the problem of requiring exclusive access to the table when doing the insert, no?
Erik
@Erik - it depends what you choose for the lock mode. "table" is fastest but exclusive. There are more granular options.
Marc Gravell
A: 

There are many ways to optimize performance and different databases handle data very different as well. SQL Server for example is protecting your data, it has to be sure the data is valid and on disk before it lets you know the insert has been succesfull. MySQL nor MongoDB is doing that, so they can be faster. So, what are you looking for? A RDBMS or some storage where you can afford it to loose some data?

Frank Heikens
+2  A: 

The purpose of this testing is simply to get a little indication of what kind of "raw performance" can be expected of the systems in the bottom. When actually implementing a solution we would of course do buffering, bulk inserts etc.

You could at least share the details of your tests. Omitting such crucial information as what MySQL engine you try is unpardonable. And the "raw performance" of a non-batched insert on a buffer-pooled based DB (like SQL Server or InnoDB) is non-sense, is like measuring the "raw performance" of a Ferrari in first-gear and then publishing that "it only goes to 50mph".

But anyway, if you want a highly scalable write optimized DB, look at Cassandra from Apache Incubation. The rumor mill says Twitter will adopt it soon.

Remus Rusanu
A: 

BerkeleyDB might be worth a look if your data can be represented as key/value pairs (as if in a PERL hash or similar data structure). It's fast, multiclient, and transaction-safe, even if it is not the latest wizbang thing.

Ollie Jones
A: 

If you want insert-only operations, you can get more out of mysql by using Archive engine and INSERT DELAYED.

Otherwise, try any of the local-storage KV engines: BDB, QDBM, Tokyo Cabinet, etc.

viraptor
A: 

Did you test with several application instances connected the the Database Server and inserting data at the same time or just one application?

I think you should be testing with multiple instances especially for bulk insertion and see what configuration works for you. Different transaction isolation modes can greatly affect the performance for concurrent access (especially write access). SQL Server for example, I found that lower isolation mode than ReadCommitted should be used for highly concurrent environment or you will find lot of cases of timeout. This should of course be used when the risk of dirty read is not a concern (which fit your case judging from your description).

PS: Forgive me if I'm stating the obvious here.

Jaya Wijaya
+2  A: 

How do these compare to simply logging to a flat file in the file system? If querying is done later, I'm not sure why you are bringing the data into a relational database at this time. Is there a need for transactions or multiple access to the database at all during this recording stage?

Cade Roux
Exactly, if query is done later, no one beats the performance of simply appending to a text file.
Codism