tags:

views:

100

answers:

5

I have a large sequence of tuples on disk in the form (t1, k1) (t2, k2) ... (tn, kn)

ti is a monotonically increasing timestamp and ki is a key (assume a fixed length string if needed). Neither ti nor ki are guaranteed to be unique. However, the number of unique tis and kis is huge (millions). n itself is very large (100 Million+) and the size of k (approx 500 bytes) makes it impossible to store everything in memory.

I would like to find out periodic occurrences of keys in this sequence.

For example, if I have the sequence (1, a) (2, b) (3, c) (4, b) (5, a) (6, b) (7, d) (8, b) (9, a) (10, b)

The algorithm should emit (a, 4) and (b, 2). That is a occurs with a period of 4 and b occurs with a period of 2.

If I build a hash of all keys and store the average of the difference between consecutive timestamps of each key and a std deviation of the same, I might be able to make a pass, and report only the ones that have an acceptable std deviation(ideally, 0). However, it requires one bucket per unique key, whereas in practice, I might have very few really periodic patterns. Any better ways?

+2  A: 

This is more or less the reason for which Fourier transforms (Fast Fourier Transforms, etc.) were invented.

You are essentially transforming a sequence from the time (or some similar dimension) domain to a frequency domain. This is a very old problem, predating the application of computers, and there is an immense body of theory on the subject. Also see discrete fourier transform.

EDIT: You would have to transform your values k1, k2, ... somehow, but assuming that that's feasible, this approach should be, too.

Rob Lachlan
Note that the data are not necessarily uniformly sampled (we only know that the timestamps are monotonically increasing), so common techniques such as FFT may not be applicable here.
Paul R
For data that in non-uniform on the time axis, you can bin them and say average the values in the bins then FFT on the binned data. Unfortunately it looks like his K's are discrete values, not a normal varying signal.
phkahler
+3  A: 
Norman Ramsey
+1, Yup, I like it.
Rob Lachlan
Same comment as for Rob - if the data are not uniformly sampled then a lot of conventional discrete DSP techniques are off the table.
Paul R
A: 

If I build a hash of all keys and store the average of the difference between consecutive timestamps of each key and a std deviation of the same, I might be able to make a pass, and report only the ones that have an acceptable std deviation(ideally, 0). However, it requires one bucket per unique key, whereas in practice, I might have very few really periodic patterns. Any better ways?

Personally, I think this is probably the best you're going to get unless you can identify more structure to the problem.

andand
A: 

Let's label a (timestamp,string) tuple as (key,value). Some constraints: 1. There is a discrete set of values, i.e. the match between periodic appearances of these values is exact: aaabb ... aaabb, not aaabb ... aaabc. 2. The set of all instances of a value can be fitted into memory.

Algorithm: 1. Get a complete list of all unique values 2. For each unique value, get all tuples, producing an ordered list of timestamps. 3. Apply an algorithm to look for patterns in this data. Ideally a non-uniform discrete Fourier transform, or autocorrelation.

Phil H
A: 

You really have two separate problems:

  1. you have m different signals in your data, defined by the m unique keys. You need to separate each signal and store is separately.

  2. given one of these unique signals, you must determine if it is periodic, this is an application of autocorrelation or the Discrete Fourier Transform, whichever you prefer. For example, the DFT gives you the coefficients of an interpolation periodic functions of your data. If only one coefficient in the DFT is non-zero, there is a clear period.

If you apply the DFT or autocorrelation to the data without separating signals you will get a compounded problem where you won't know if one of the "periodic" signals found is made up of one unique signal or several.

ldog