views:

209

answers:

8

Update

Just for future reference, I'm going to list all of the statistics that I'm aware of that can be maintained in a rolling collection, recalculated as an O(1) operation on every addition/removal (this is really how I should've worded the question from the beginning):

Obvious

  • Count
  • Sum
  • Mean
  • Max*
  • Min*
  • Median**

Less Obvious

  • Variance
  • Standard Deviation
  • Skewness
  • Kurtosis
  • Mode***
  • Weighted Average
  • Weighted Moving Average****

OK, so to put it more accurately: these are not "all" of the statistics I'm aware of. They're just the ones that I can remember off the top of my head right now.

*Can be recalculated in O(1) for additions only, or for additions and removals if the collection is sorted (but in this case, insertion is not O(1)). Removals potentially incur an O(n) recalculation for non-sorted collections.

**Recalculated in O(1) for a sorted, indexed collection only.

***Requires a fairly complex data structure to recalculate in O(1).

****This can certainly be achieved in O(1) for additions and removals when the weights are assigned in a linearly descending fashion. In other scenarios, I'm not sure.


Original Question

Say I maintain a collection of numerical data -- let's say, just a bunch of numbers. For this data, there are loads of calculated values that might be of interest; one example would be the sum. To get the sum of all this data, I could...

Option 1: Iterate through the collection, adding all the values:

double sum = 0.0;
for (int i = 0; i < values.Count; i++) sum += values[i];

Option 2: Maintain the sum, eliminating the need to ever iterate over the collection just to find the sum:

void Add(double value) {
    values.Add(value);
    sum += value;
}

void Remove(double value) {
    values.Remove(value);
    sum -= value;
}

EDIT: To put this question in more relatable terms, let's compare the two options above to a (sort of) real-world situation:

Suppose I start listing numbers out loud and ask you to keep them in your head. I start by saying, "11, 16, 13, 12." If you've just been remembering the numbers themselves and nothing more, and then I say, "What's the sum?", you'd have to think to yourself, "OK, what's 11 + 16 + 13 + 12?" before responding, "52." If, on the other hand, you had been keeping track of the sum yourself while I was listing the numbers (i.e., when I said, "11" you thought "11", when I said "16", you thought, "27," and so on), you could answer "52" right away. Then if I say, "OK, now forget the number 16," if you've been keeping track of the sum inside your head you can simply take 16 away from 52 and know that the new sum is 36, rather than taking 16 off the list and them summing up 11 + 13 + 12.

So my question is, what other calculations, other than the obvious ones like sum and average, are like this?


SECOND EDIT: As an arbitrary example of a statistic that (I'm almost certain) does require iteration -- and therefore cannot be maintained as simply as a sum or average -- consider if I asked you, "how many numbers in this collection are divisible by the min?" Let's say the numbers are 5, 15, 19, 20, 21, 25, and 30. The min of this set is 5, which divides into 5, 15, 20, 25, and 30 (but not 19 or 21), so the answer is 5. Now if I remove 5 from the collection and ask the same question, the answer is now 2, since only 15 and 30 are divisible by the new min of 15; but, as far as I can tell, you cannot know this without going through the collection again.

So I think this gets to the heart of my question: if we can divide kinds of statistics into these categories, those that are maintainable (my own term, maybe there's a more official one somewhere) versus those that require iteration to compute any time a collection is changed, what are all the maintainable ones?

What I am asking about is not strictly the same as an online algorithm (though I sincerely thank those of you who introduced me to that concept). An online algorithm can begin its work without having even seen all of the input data; the maintainable statistics I am seeking will certainly have seen all the data, they just don't need to reiterate through it over and over again whenever it changes.

+14  A: 

First, the term that you want here is online algorithm. All moments (mean, standard deviation, skew, etc.) can be calculated online. Others include the minimum and maximum. Note that median and mode can not be calculated online.

Jason
That is very good to know about; thanks for the link. I think we are talking about *slightly* different things, though; it looks like an online algorithm is one that can basically be doing a calculation while receiving the data. The scenario I'm concerned with (and I might not have been all that clear on this) is where the entirety of the data *has* been received; but I want to know what calculated values can be readily available at any point without iterating through all the data (which has already been processed in some way).
Dan Tao
you can store any and all statistical information you want if you can process it first.
tster
@tster: You are thinking of a static set of data. There is certain statistical information that becomes invalid once the data changes and to retrieve it again you must iterate through the data to find it. As a trivial example, consider max/min for unsorted data: once the current max is removed, the data must be iterated through again to find whatever the new max is, and likewise for the min.
Dan Tao
+3  A: 

To consistently maintain the high/low you store your data in sorted order. There are algorithms for maintaining data structures which preserves ordering.

Median is trivial if the data is ordered.

If the data is reduced slightly to a frequency table, you can maintain mode. If you keep your data as a random, flat list of values, you can't easily compute mode in the presence of change.

S.Lott
This is a great suggestion, but there is a trade-off of sorts: if you keep the data sorted, it becomes more difficult to keep track of the order in which it was added. (You can still do this, but it's suddenly a lot more complicated to make, for example, a rolling set.)
Dan Tao
@Dan: That's the point. The blanket "which stats can I maintain" requires specific, detailed list of transactions that will be supported. You gave no such list, so it's a random mix of update transactions and statistical summaries that can be held invariant.
S.Lott
@S. Lott: I didn't mean to suggest that I wasn't satisfied with this answer. Of course for some statistics there will be trade-offs; you presented a scenario in which maintaining the high/low is indeed possible, which I (embarrassed to admit this) hadn't even considered -- probably because it is a different scenario from the one I am presently working on. It's still a great answer. Anyway, I'm certainly not under any illusions that *all* stats which can be maintained can be available under all circumstances. Conditional answers are fine.
Dan Tao
+1  A: 

As Jason says, you are indeed describing an online algorithm. I've also seen this type of computation referred to as the Accumulator Pattern, whether the loop is implemented explicitly or by recursion.

ire_and_curses
Except he also wants a Remove operation, which rules out online algorithms like min and max.
xan
+1  A: 

Not really a direct answer to your question, but for many statistics that are not online statistics you can usually find some rules to calculate by iteration only part of the time, and cache the correct value the rest of the time. Is this possibly good enough for you?

For high value for example:

public void Add(double value) {
    values.Add(value);
    if (value > highValue)
        highValue = value;
}

public void Remove(double value) {
    values.Remove(value);
    if (value.WithinTolerance(highValue))
        highValue = RecalculateHighValueByIteration();
}
John at CashCommons
John, I do think this is a good approach, and is in fact that I do use. Strangely enough, I had not even really thought about S.Lott's idea... but as he said, if the list is random, it's not really applicable (in which case I think your idea is probably the best).
Dan Tao
+1  A: 

It's not possible to maintain high or low with constant-time add and remove operations because that would give you a linear-time sorting algorithm. You can use a search tree to maintain the data in sorted order, which gives you logarithmic-time minimum and maximum. If you also keep subtree sizes and the count, it's simple to find the median too.

And if you just want to maintain the high or low in the presence of additions and removals, look into priority queues, which are more efficient for that purpose than search trees.

jk
+1  A: 

The answers to this question on online algorithms might be useful. Regarding the usability for your needs, I'd say that while some online algorithms can be used for estimating summary statistics with partial data, others may be used to maintain them from a data flow just as you like.

You might also want to look at complex event processing (or CEP), which is used for tracking and analysing real time data, for example in finance or web commerce. The only free CEP product I know of is Esper.

Ville Koskinen
A: 

If you don't know the exact size of the dataset in advance, or if it is potentially unlmited, or you just want some ideas, you should definitely look into techniques used in Streaming Algorithms.

PeterAllenWebb
A: 

It does sound (even after your 2nd edit) that you are describing on-line algorithms, with the additional requirement that you want to allow "delete" operations. An example of this are the "sketch algorithms" used for finding frequent items in a stream.

Jouni K. Seppänen