views:

1159

answers:

2

I am trying to figure out exactly what these new fangled data stores such as bigtable, hbase and cassandra really are.

I work with massive amounts of stock market data, billions of rows of price/quote data that can add up to 100s of gigabytes every day (although these text files often compress by at least an order of magnitude). This data is basically a handful of numbers, two or three short strings and a timestamp (usually millisecond level). If I had to pick a unique identifier for each row, I would have to pick the whole row (since an exchange may generate multiple values for the same symbol in the same millisecond).

I suppose the simplest way to map this data to bigtable (I'm including its derivatives) is by symbol name and date (which may return a very large time series, more than million data points isn't unheard of). From reading their descriptions, it looks like multiple keys can be used with these systems. I'm also assuming that decimal numbers are not good candidates for keys.

Some of these systems (Cassandra, for example) claims to be able to do range queries. Would I be able to efficiently query, say, all values for MSFT, for a given day, between 11:00 am and 1:30 pm ?

What if I want to search across ALL symbols for a given day, and request all symbols that have a price between $10 and $10.25 (so I'm searching the values, and want keys returned as a result)?

What if I want to get two times series, subtract one from the other, and return the two times series and their result, will I have to do his logic in my own program?

Reading relevant papers seems to show that these systems are not a very good fit for massive time series systems. However, if systems such as google maps are based on them, I think time series should work as well. For example, think of time as the x-axis, prices as y-axis and symbols as named locations--all of a sudden it looks like bigtable should be the ideal store for time series (if the whole earth can be stored, retrieved, zoomed and annotated, stock market data should be trivial).

Can some expert point me in the right direction or clear up any misunderstandings.

Thanks

+6  A: 

I am not an expert yet, but I've been playing with Cassandra for a few days now, and I have some answers for you:

  1. Don't worry about amount of data , it's irrelevant with systems like Cassandra, if you have $$$ for a large hardware cluster.

Some of these systems (Cassandra, for example) claims to be able to do range queries. Would I be able to efficiently query, say, all values for MSFT, for a given day, between 11:00 am and 1:30 pm ?

Cassandra is very usefull when you know how to work with keys. It can swift through keys very quickly. So to search for MSFT between 11:00 and 1:30pm, you'd have to key your rows like this:

MSFT-timestamp, GOOG-timestamp , ..etc Then you can tell Cassandra to find all keys that start with MSFT-now and end with MSFT-now+1hour.

What if I want to search across ALL symbols for a given day, and request all symbols that have a price between $10 and $10.25 (so I'm searching the values, and want keys returned as a result)?

I am not an expert, but so far I realized that Cassandra doestn' search by values at all. So if you want to do the above, you will have to make another table dedicated just to this problem and design your schema to fit the case. But it won't be much different from what I described above. It's all about naming your keys and columns. Cassandra can find them very quickly!

What if I want to get two times series, subtract one from the other, and return the two times series and their result, will I have to do his logic in my own program?

Correrct, all logic is done inside your program. This is not MySQL. This is just a storage engine. (But I am sure next versions will offer these sort of things)

Please remember, that I am a novice at this, if I am wrong, feel free to correct me.

Gotys
+7  A: 

If you're dealing with a massive time series database, then the standards are:

These aren't cheap, but they can handle your data very efficiently.

Shane