tags:

views:

108

answers:

3

Suppose you're interested in a bunch of independent time-varying values, each of which represents the current state of something. The values don't change on any fixed schedule and new values can't be predicted from old ones. For the sake of a concrete example, let's say you have a bunch of stocks and you're interested in keeping track of their values, and you get an update about an individual stock whenever a trade is made on that stock. (My actual problem isn't about stocks, but hopefully they make what I'm getting at more understandable.)

In addition to just knowing the current price of each stock, you'd also like to be able to pick an arbitrary point in the past and get a "snapshot" that told you what the most recent trading price for each stock was at that time. So for instance, you should be able to say "What was the most recent value of every stock I'm tracking as of 4:53PM last Tuesday?" and get a precise answer efficiently.

I can think of three ways to do this, but I'm not very happy with any of them.

1. Keep a journal. Maintain a list of all trades in time-sequence order. Update is just adding to the list, and query is a linear scan backwards in time starting with the first entry whose timestamp is on or earlier than the specified time stamp. This would make update a constant time operation, but you might potentially have to scan the entire journal to find a value for all trades, so Update is O(1) and Snapshot is O(u) where u is the total number of updates. Memory required is O(u) for obvious reasons.

2. Write checkpoints. Maintain a single journal as before, but instead of each entry containing just new stock price, the update contains the current price (as of that update) for every stock. This is cheap to calculate: since the last update has all that information too, you just copy it all over except for the one stock whose price actually changed. Now Snapshot can be done with an O(log u) operation (using binary search on the journal to locate the last entry that comes before or on the specified timestamp). However Update becomes O(s) where s is the number of stocks in the system, and furthermore the total memory required goes from O(u) in the first strategy to O(s*u) --- both of which are problems if both s and u are large.

3. Separated journals. Maintain a separate journal for each stock, and write updates for each stock into its own journal, again in chronological order. To snapshot, go over each journal and use a binary search to find the correct update. It requires O(u) memory, Update is an O(1) operation, and Snapshot can be done in O(s * log u) time. This is my favorite method of the three, but I feel like it could probably be improved upon, since it ignores any relationship between the timing of updates across different stocks.

Is there a better way I'm missing? Is this a problem that has been studied and has a generally-accepted solution?

+2  A: 
John Kugelman
If I remember correctly #4 is what is used by Mercurial to store the history of each file.
Matthieu M.
+4  A: 

Have a look at the literature on Persistent Data Structures. In particular, this early paper describes the construction of a persistent binary search tree that maintains logarithmic operations, but can be accessed at any version (e.g. point in time). Accesses to parts of the structure that weren't updated in some particular version naturally look to the last preceding version. So, you would have your natural operations in O(log s) time, and the structure could occupy O(u) space if you know all your keys up front and never have to rebalance, or O(u * log s) space if every update modified O(log s) pointers.

These class notes seem to describe what you would need to implement in fairly simple terms.

Novelocrat
+2  A: 

The idea of persistent data structures presented by Novelocrat seems to be the best solution for the general case. I suppose it will work fine in your case.

I just thought of a variation of (2). Manage a dynamic array ordered by modification timestamps. Each entry corresponds to a version, and it consists of an array of s items. Instead of storing all stock records per version, do it lazilly; when the version is created, only one stock item whose value has changed will be assigned a new record. The other s-1 items point to null.

When performing a search on time T and stock S, you should scan linearly the versions backward, starting from the latest one before time T. The scan continues until you find a non-null value for S. Then you proceed by fixing all null pointers for S that you found in your way, so that next queries on them are efficient.

This solution provides O(1) addition time, and an amortized query time of O(log u). Complete snapshot queries take O(s+log u), which is better than implementation (4). The space is still O(u*s).

The amortized cost of queries follows from the fact that whenever we query on item S of version V, all S values of versions<=V are fixed. Therefore, a sequence of u unique queries performs 2*u visits on the arrays (regardless of their order!), resulting in an average of 2 visits per query. Therefore, we stay with the initial lookup time of O(log u).

Eyal Schneider