ansaurus

Question

What statistics can be maintained for a set of numerical data without iterating?

Answer 1

+14 A:

First, the term that you want here is online algorithm. All moments (mean, standard deviation, skew, etc.) can be calculated online. Others include the minimum and maximum. Note that median and mode can not be calculated online.

Jason 2009-10-15 18:56:19

That is very good to know about; thanks for the link. I think we are talking about *slightly* different things, though; it looks like an online algorithm is one that can basically be doing a calculation while receiving the data. The scenario I'm concerned with (and I might not have been all that clear on this) is where the entirety of the data *has* been received; but I want to know what calculated values can be readily available at any point without iterating through all the data (which has already been processed in some way).

Dan Tao 2009-10-15 19:04:28

you can store any and all statistical information you want if you can process it first.

tster 2009-10-15 19:17:59

@tster: You are thinking of a static set of data. There is certain statistical information that becomes invalid once the data changes and to retrieve it again you must iterate through the data to find it. As a trivial example, consider max/min for unsorted data: once the current max is removed, the data must be iterated through again to find whatever the new max is, and likewise for the min.

Dan Tao 2009-10-15 19:25:14

Answer 2

+3 A:

To consistently maintain the high/low you store your data in sorted order. There are algorithms for maintaining data structures which preserves ordering.

Median is trivial if the data is ordered.

If the data is reduced slightly to a frequency table, you can maintain mode. If you keep your data as a random, flat list of values, you can't easily compute mode in the presence of change.

S.Lott 2009-10-15 18:57:35

This is a great suggestion, but there is a trade-off of sorts: if you keep the data sorted, it becomes more difficult to keep track of the order in which it was added. (You can still do this, but it's suddenly a lot more complicated to make, for example, a rolling set.)

Dan Tao 2009-10-15 19:18:59

@Dan: That's the point. The blanket "which stats can I maintain" requires specific, detailed list of transactions that will be supported. You gave no such list, so it's a random mix of update transactions and statistical summaries that can be held invariant.

S.Lott 2009-10-15 19:54:07

@S. Lott: I didn't mean to suggest that I wasn't satisfied with this answer. Of course for some statistics there will be trade-offs; you presented a scenario in which maintaining the high/low is indeed possible, which I (embarrassed to admit this) hadn't even considered -- probably because it is a different scenario from the one I am presently working on. It's still a great answer. Anyway, I'm certainly not under any illusions that *all* stats which can be maintained can be available under all circumstances. Conditional answers are fine.

Dan Tao 2009-10-15 23:05:45

Answer 3

+1 A:

As Jason says, you are indeed describing an online algorithm. I've also seen this type of computation referred to as the Accumulator Pattern, whether the loop is implemented explicitly or by recursion.

ire_and_curses 2009-10-15 19:02:27

Except he also wants a Remove operation, which rules out online algorithms like min and max.

xan 2009-10-21 16:57:19

Answer 4

+1 A:

Not really a direct answer to your question, but for many statistics that are not online statistics you can usually find some rules to calculate by iteration only part of the time, and cache the correct value the rest of the time. Is this possibly good enough for you?

For high value for example:

public void Add(double value) {
    values.Add(value);
    if (value > highValue)
        highValue = value;
}

public void Remove(double value) {
    values.Remove(value);
    if (value.WithinTolerance(highValue))
        highValue = RecalculateHighValueByIteration();
}

John at CashCommons 2009-10-15 19:04:24

John, I do think this is a good approach, and is in fact that I do use. Strangely enough, I had not even really thought about S.Lott's idea... but as he said, if the list is random, it's not really applicable (in which case I think your idea is probably the best).

Dan Tao 2009-10-15 19:17:15

Answer 5

+1 A:

It's not possible to maintain high or low with constant-time add and remove operations because that would give you a linear-time sorting algorithm. You can use a search tree to maintain the data in sorted order, which gives you logarithmic-time minimum and maximum. If you also keep subtree sizes and the count, it's simple to find the median too.

And if you just want to maintain the high or low in the presence of additions and removals, look into priority queues, which are more efficient for that purpose than search trees.

jk 2009-10-15 19:04:47

Answer 6

+1 A:

The answers to this question on online algorithms might be useful. Regarding the usability for your needs, I'd say that while some online algorithms can be used for estimating summary statistics with partial data, others may be used to maintain them from a data flow just as you like.

You might also want to look at complex event processing (or CEP), which is used for tracking and analysing real time data, for example in finance or web commerce. The only free CEP product I know of is Esper.

Ville Koskinen 2009-10-15 19:17:51

Answer 7

A:

If you don't know the exact size of the dataset in advance, or if it is potentially unlmited, or you just want some ideas, you should definitely look into techniques used in Streaming Algorithms.

PeterAllenWebb 2009-10-16 01:48:50

Answer 8

A:

It does sound (even after your 2nd edit) that you are describing on-line algorithms, with the additional requirement that you want to allow "delete" operations. An example of this are the "sketch algorithms" used for finding frequent items in a stream.

Jouni K. Seppänen 2009-10-16 12:15:04

ansaurus

tags:

views:

answers:

What statistics can be maintained for a set of numerical data without iterating?

Update

Obvious

Less Obvious

Original Question

related questions