views:

93

answers:

2

For a particular project, we acquire data for a number of events and collect variables about those events at the same time. After the data has been collected, we perform a user-customizable analysis on said data to determine whatever it is that the user is interested in.

The data is collected in a form similar to this:

Timestamp    Event
0            x = 0
0            y = 1
3            Event A occurred
3            x = 1
4            Event A occurred
4            x = 2
9            Event B occurred
9            y = 2
9            x = 0

To understand the entire state at any time, the most straightforward approach is to walk over the entire set of data. For example, if I start at time 0, and "analyze" until timestamp 5, I know that at that point x = 2, y = 1, and Event A has occurred twice. That's a really simple example. The user might be (and often is) interested in the time between events, say from A to B, and they might specify the first occurrence of A, then B, or the last occurrence of A, then B (respectively, 9-3 = 6 or 9-4 = 5). Like I said, this is easy to analyze when you're walking over the entire set.

Now, we need to adapt the model to analyze an arbitrary window of time. If we look at 0-N, that's the easy case. But if I look at 1-5, for instance, I have no notion of y unless I begin at 0 and know that y was initially 1 and did not change in the window 1-5.

Our approach is to essentially create a dictionary of variables, and run callbacks on events. If one analysis was "What is x when Event A occurs and time is > 3" then we would run that callback on the first Event A, and it would immediately return because time is not greater than 3. It would run again at 4, and and it would report that x was 1 at t=4.

To adapt to the "time-windowing", I think I am going to (in the background) tack on additional conditions to the analysis. If their analysis is just "What is x when Event A occurs", and the current window is 1-5, then I will change it to "What is x when Event A occurs and time >= 1 and time <= 5". Then if the next window is 6-10, I can readjust the condition as necessary.

My main question is: what pattern does this fit? We are obviously not the first people to approach a problem like this, but I have not been able to find how others have approached it. I probably just don't know what exactly to search on Google. Is there any other approach besides keeping a dictionary of the entire global state for looking up a single state at a given time? Note also that the data could have several, maybe tens of thousands of records, so the fewer iterations over the data set, the better.

+1  A: 

You can search by time in Log(N), and you can have a feeling about how many updates ares acceptable... hence here's my solution:

Pick a number, N, of updates that are acceptable in order to return a result. 256 might be good, given the scales you've mentioned so far.

Every N records, commit an entry of all state to a dictionary, with a timestamp.

Now, you have a tradeoff, dictionary size against speed. N->\infty is regular searching. N<-1 is your current solution, N anywhere else will require less memory, but be slower.

Your implementation is now (for time X): Log(n) search of subsampled global dictionary to timestamp before X, (timestamped as Y). Log(n) search of eventlist to timestamp Y, and perform less than N updates.

Picking N as a power of two will even allow you to do some nice shift tricks to do a rounded-down integer divide nice and fast.

Dave Gamble
+3  A: 

I think your best approach here would be to take periodic "snapshots" of the full state data, say every 1000 samples (for example), along with recording the deltas. When you're storing your data as offsets from some original value (aka deltas), you don't have any choice but to reconstruct the full data starting with the original values. Storing periodic snapshots will lessen the amount of reconstruction you have to do - the design tradeoff is between low storage requirements but long reconstruction time on the one hand, and higher storage requirements but shorter reconstruction time on the other.

MPEGs, for example, store each frame as the differences between the current frame and the previous frame. Ordinarily, this would force an MPEG to be viewed from the beginning, but the format also periodically stores full frames so that the decoder doesn't have to backtrack all the way to the beginning of the file.

MusiGenesis
Specifically long-GOP MPEG. (long group of pictures).
Dave Gamble
Now that you mention it, I think that's similar to what Flash movies do with "keyframes" IIRC. Good example.
Mark Rushakoff