Incremental median computation with max memory efficiency

views:

176

answers:

+8 Q:

Incremental median computation with max memory efficiency

I have a process that generates values and that I observe. When the process terminates, I want to compute the median of those values.

If I had to compute the mean, I could just store the sum and the number of generated values and thus have O(1) memory requirement. How about the median? Is there a way to save on the obvious O(n) coming from storing all the values?

Edit: Interested in 2 cases: 1) the stream length is known, 2) it's not.

i think, in order to find median you need to sort your data first. the best sorting algorims are O(n log n).

not sure is there any other way to find median number?

mohammad shamsi 2010-07-30 13:52:04

Finding the median is an O(n) operation. The algorithm is related to the quicksort partition algorithm and is somewhat tricky. Google is your friend.

deinst 2010-07-30 13:58:20

@mohammad: I don't care about how to compute the median, nor the time complexity really.

Mau 2010-07-30 14:07:54

+1 A:

You can

Use statistics, if that's acceptable - for example, you could use sampling.
Use knowledge about your number stream
- using a counting sort like approach: k distinct values means storing O(k) memory)
- or toss out known outliers and keep a (high,low) counter.
- If you know you have no duplicates, you could use a bitmap... but that's just a smaller constant for O(n).

Stephen 2010-07-30 13:53:35

+3 A:

You are going to need to store at least ceil(n/2) points, because any one of the first n/2 points could be the median. It is probably simplest to just store the points and find the median. If saving ceil(n/2) points is of value, then read in the first n/2 points into a sorted list (a binary tree is probably best), then as new points are added throw out the low or high points and keep track of the number of points on either end thrown out.

Edit:

If the stream length is unknown, then obviously, as Stephen observed in the comments, then we have no choice but to remember everything. If duplicate items are likely, we could possibly save a bit of memory using Dolphins idea of storing values and counts.

deinst 2010-07-30 13:57:08

No, I do not think so. With this n = 13, and we only need to store at most 7. I'm not sure what your n is. With this stream we read in the first 7, and then throw out zeros as we read the 2's. I really do not understand your objection.

deinst 2010-07-30 14:05:24

OK, I read the question as a stream of unknown length, but now I realize that wasn't stated... Either way `13/2==6` for me :) Anyways, this is a true observation. Unfortunately, I can't reverse the -1, because I didn't do it. And `n/2` is still `O(n)` :)

Stephen 2010-07-30 14:10:58

I edited the text to change it to ceil. Thanks.

deinst 2010-07-30 14:13:02

deinst: could you please help me to know how you are going to find median for this list with saveing first n/2 points: 0,3,2,1,5,6,8,7,4

mohammad shamsi 2010-07-30 14:16:34

Keep at most 5 points, because ceil(9/2)==5: `[0], [0,3], [0,2,3], [0,1,2,3], [0,1,2,3,5], (1)[1,2,3,5,6], (2)[2,3,5,6,8], (3)[3,5,6,7,8], (3)[3,4,5,6,7](1)`. 5th item is 4. (0,1,2,3,4,5,6,7,8) -> middle item is 4.

Stephen 2010-07-30 14:21:54

Thanks Stephen. that is less muddled than mine was.

deinst 2010-07-30 14:24:15

If there is a good chance that an average sample is going to be repeated more than 2 times, you could save some memory by storing the number and a count, basically Run Length Encoding the values you are keeping.

Dolphin 2010-07-30 16:06:56

@deinst, you are assuming prior knowledge of the number of samples that are going to be in the stream. Without prior knowledge of the number of samples, you have to store all the samples before computing the median because anytime you throw out a sample, there exists a stream that would make that discarded sample the median.

SargeATM 2010-07-30 15:08:19

@sargeatm: Use comments. And yes, the assumption is pretty clear, and it's ok.

Mau 2010-07-30 15:33:27

@Mau: Don't think we can be surprised this happens when there's a minimum rep required for leaving comments.

Mark Peters 2010-07-30 15:42:11

@mark-peters: ouch i had forgotten :(

Mau 2010-07-30 15:55:59

If you have discrete values and lots of repetition you could store the values and counts, which would save a bit of space.

Possibly at stages through the computation you could discard the top 'n' and bottom 'n' values, as long as you are sure that the median is not in that top or bottom range.
e.g. Let's say you are expecting 100,000 values. Every time your stored number gets to (say) 12,000 you could discard the highest 1000 and lowest 1000, dropping storage back to 10,000.

If the distribution of values is fairly consistent, this would work well. However if there is a possibility that you will receive a large number of very high or very low values near the end, that might distort your computation. Basically if you discard a "high" value that is less than the (eventual) median or a "low" value that is equal or greater than the (eventual) median then your calculation is off.

Update
Bit of an example
Let's say that the data set is the numbers 1,2,3,4,5,6,7,8,9.
By inspection the median is 5.

Let's say that the first 5 numbers you get are 1,3,5,7,9.
To save space we discard the highest and lowest, leaving 3,5,7
Now get two more, 2,6 so our storage is 2,3,5,6,7
Discard the highest and lowest, leaving 3,5,6
Get the last two 4,8 and we have 3,4,5,6,8
Median is still 5 and the world is a good place.

However, lets say that the first five numbers we get are 1,2,3,4,5
Discard top and bottom leaving 2,3,4
Get two more 6,7 and we have 2,3,4,6,7
Discard top and bottom leaving 3,4,6
Get last two 8,9 and we have 3,4,6,8,9
With a median of 6 which is incorrect.

If our numbers are well distributed, we can keep trimming the extremities. If they might be bunched in lots of large or lots of small numbers, then discarding is risky.

Michael J 2010-07-30 15:54:03

ansaurus

tags:

views:

answers:

Incremental median computation with max memory efficiency

related questions