views:

34

answers:

2

Suppose there are ~10,000's of keys, where each key corresponds to a stream of events. I'd like to support the following operations:

  • push(key, timestamp, event) - pushes event to the event queue for key, marked with the given timestamp. It is guaranteed that event timestamps for a particular key are pushed in sorted or almost sorted order.
  • tail(key, timestamp) - get all events for key since the given timestamp. Usually the timestamp requests for a given key are almost monotonically increasing, almost synchronously with pushes for the same key.

This stuff has to be persistent (although it is not absolutely necessary to persist pushes immediately and to keep tails with pushes strictly in sync), so I'm going to use some kind of database.

What is the optimal kind of database structure for this task? Would it be better to use a relational database, a key-value storage, or something else?

+2  A: 

Working with financial data? ;) I have an app here hndling 1.5 million such streams (CME complete feed) in tests ;)

Relational - you CAN do it, but it is wastefull. What I did is a binary storage PER STREAM and put the alues into a delta format that is binary efficient (timestamps always go up - so no need to keep them total, only smal ldelta from alst). I store them currently in 15 minute slices, and the system to retrieve the tail knows how to get the data. Also puts a LOT less load on the relational side.

Ther eare specialized databases for this, but they are obscene (10.000 USD per processor core, minimum license 8 cores - yeah, right).

Some applications go with flat files (one per key), eve ntrading style applications. I don tlike that personally.

TomTom
Thanks, that looks similar to what I've thought of on my own, however I'm still interested in other solutions :) (By the way, it is *not* about financial data)
jkff
+2  A: 

Do you have any saying on the HW that will be used? Assuming this will have more reads than writes, this could be an ideal application for SSDs, coupled with what TomTom mentioned - storing the events as files in a dedicated directory.

If you go that way, I suggest having a directory for each "Key", and organize them in subdirectories.

I.e., supposing you have a key like this: HJ029084930A

You should have:

/streams
/streams/HJ02
/streams/HJ02/9084
/streams/HJ02/9084/930A/HJ029084930A
/streams/HJ02/9084/930A/HJ029084930A/20100315/230257.trc
/streams/HJ02/9084/930A/HJ029084930A/20100316/000201.trc
/streams/HJ02/9084/930A/HJ029084930A/20100316/000203.trc
/streams/HJ02/9084/930A/HJ029084930A/20100316/010054.trc
...
/streams/HJ02/9084/930A/HJ029084930A/20100317/010230.trc

What I am hinting at is that you should do your best to avoid having "too many" files (or directories) inside a directory, or the OS could slow down retrieving your stuff.

One possible problem is when a stream overlaps from the end of a day to the start of the next one. See if you can split it so that you can have it finish on 23:59:59 and create a new one starting from 00:00:00 on the next day. It depends on what the semantics of "tail()" are in your case.

p.marino