Are document databases good for storing large amounts of Stock Tick data?

views:

107

answers:

+1 Q:

Are document databases good for storing large amounts of Stock Tick data?

I was thinking of using a database like mongodb or ravendb to store a lot of stock tick data and wanted to know if this would be viable compared to a standard relational such as Sql Server.

The data would not really be relational and would be a couple of huge tables. I was also thinking that I could sum/min/max rows of data by minute/hour/day/week/month etc for even faster calculations.

Example data: 500 symbols * 60 min * 60sec * 300 days... (per record we store: date, open, high,low,close, volume, openint - all decimal/float)

So what do you guys think?

Here's my reservation with the idea - and I'm going to openly acknowledge that my working knowledge of document databases is weak. I’m assuming you want all of this data stored so that you can perform some aggregation/trend based analysis on it.

If you use a document based db to act as your source, the loading and manipulation of each row of data (CRUD operations) is very simple. Very efficient, very straight forward, basically lovely.

What sucks is that there are very few (if any) options to pull this data out and cram it into a structure more prone for statical analysis (columnar database, cube, etc). If you load it into a basic relational database there are a host of tools (both comercial and open source - http://www.pentaho.com/) that will accommodate the ETL and analysis very nicely.

Ultimately though, what you want to keep in mind, is that every financial firm in the world has a stock-analysis/auto-trader application, they just caused a major US Stock tumble and they are not toys. :)

Bobby B 2010-07-08 20:27:29

A simple datastore such as a key-value or document database is also beneficial in cases where performing analytics reasonably exceeds a single system's capacity. (Or it will require an exceptionally large machine to handle the load.) In these cases, it makes sense to use a simple store since the analytics require batch processing anyway. I would personally look at finding a horizontally scaling processing method to coming up with the unit/time analytics required.

I would investigate using something built on Hadoop for parallel processing. Either use the framework natively in Java/C++ or some higher level abstraction: Pig, Wukong, binary executables through the streaming interface, etc. Amazon offers reasonably cheap processing time and storage if that route is of interest. (I have no personal experience but many do and depend on it for their businesses.)

Nick 2010-07-09 00:42:23

+1 A:

The answer here will depend on scope.

MongoDB is great way to get the data "in" and it's really fast at querying individual pieces. It's also nice as it is built to scale horizontally.

However, what you'll have to remember is that all of your significant "queries" are actually going to result from "batch job output".

As an example, Gilt Groupe has created a system called Hummingbird that they use for real-time analytics on their web site. Presentation here. They're basically dynamically rendering pages based on collected performance data in tight intervals (15 minutes).

In their case, they have a simple cycle: post data to mongo -> run map-reduce -> push data to webs for real-time optimization -> rinse / repeat.

This is honestly pretty close to what you probably want to do. However, there are some limitations here:

Map-reduce is new to many people. If you're familiar with SQL, you'll have to accept the learning curve of Map-reduce.
If you're pumping in lots of data, your map-reduces are going to be slower on those boxes. You'll probably want to look at slaving / replica pairs if response times are a big deal.

On the other hand, you'll run into different variants of these problems with SQL.

Of course there are some benefits here:

Horizontal scalability. If you have lots of boxes then you can shard them and get somewhat linear performance increases on Map/Reduce jobs (that's how they work). Building such a "cluster" with SQL databases is lot more costly and expensive.
Really fast speed and as with point #1, you get the ability to add RAM horizontally to keep up the speed.

As mentioned by others though, you're going to lose access to ETL and other common analysis tools. You'll definitely be on the hook to write a lot of your own analysis tools.

Gates VP 2010-07-09 04:59:06

Thanks for the responses, looks like I will have to do a few test scenarios and play around first. But the analysis tools support is something I over looked. Thanks.

dvkwong 2010-07-17 21:35:04

ansaurus

tags:

views:

answers:

Are document databases good for storing large amounts of Stock Tick data?

related questions