views:

176

answers:

5

I have a process that generates values and that I observe. When the process terminates, I want to compute the median of those values.

If I had to compute the mean, I could just store the sum and the number of generated values and thus have O(1) memory requirement. How about the median? Is there a way to save on the obvious O(n) coming from storing all the values?

Edit: Interested in 2 cases: 1) the stream length is known, 2) it's not.

A: 

i think, in order to find median you need to sort your data first. the best sorting algorims are O(n log n).

not sure is there any other way to find median number?

mohammad shamsi
Finding the median is an O(n) operation. The algorithm is related to the quicksort partition algorithm and is somewhat tricky. Google is your friend.
deinst
@mohammad: I don't care about how to compute the median, nor the time complexity really.
Mau
+1  A: 

You can

  • Use statistics, if that's acceptable - for example, you could use sampling.
  • Use knowledge about your number stream
    • using a counting sort like approach: k distinct values means storing O(k) memory)
    • or toss out known outliers and keep a (high,low) counter.
    • If you know you have no duplicates, you could use a bitmap... but that's just a smaller constant for O(n).
Stephen
+3  A: 

You are going to need to store at least ceil(n/2) points, because any one of the first n/2 points could be the median. It is probably simplest to just store the points and find the median. If saving ceil(n/2) points is of value, then read in the first n/2 points into a sorted list (a binary tree is probably best), then as new points are added throw out the low or high points and keep track of the number of points on either end thrown out.

Edit:

If the stream length is unknown, then obviously, as Stephen observed in the comments, then we have no choice but to remember everything. If duplicate items are likely, we could possibly save a bit of memory using Dolphins idea of storing values and counts.

deinst
No, I do not think so. With this n = 13, and we only need to store at most 7. I'm not sure what your n is. With this stream we read in the first 7, and then throw out zeros as we read the 2's. I really do not understand your objection.
deinst
OK, I read the question as a stream of unknown length, but now I realize that wasn't stated... Either way `13/2==6` for me :) Anyways, this is a true observation. Unfortunately, I can't reverse the -1, because I didn't do it. And `n/2` is still `O(n)` :)
Stephen
I edited the text to change it to ceil. Thanks.
deinst
deinst: could you please help me to know how you are going to find median for this list with saveing first n/2 points: 0,3,2,1,5,6,8,7,4
mohammad shamsi
Keep at most 5 points, because ceil(9/2)==5: `[0], [0,3], [0,2,3], [0,1,2,3], [0,1,2,3,5], (1)[1,2,3,5,6], (2)[2,3,5,6,8], (3)[3,5,6,7,8], (3)[3,4,5,6,7](1)`. 5th item is 4. (0,1,2,3,4,5,6,7,8) -> middle item is 4.
Stephen
Thanks Stephen. that is less muddled than mine was.
deinst
If there is a good chance that an average sample is going to be repeated more than 2 times, you could save some memory by storing the number and a count, basically Run Length Encoding the values you are keeping.
Dolphin
A: 

@deinst, you are assuming prior knowledge of the number of samples that are going to be in the stream. Without prior knowledge of the number of samples, you have to store all the samples before computing the median because anytime you throw out a sample, there exists a stream that would make that discarded sample the median.

SargeATM
@sargeatm: Use comments. And yes, the assumption is pretty clear, and it's ok.
Mau
@Mau: Don't think we can be surprised this happens when there's a minimum rep required for leaving comments.
Mark Peters
@mark-peters: ouch i had forgotten :(
Mau
A: 

If you have discrete values and lots of repetition you could store the values and counts, which would save a bit of space.

Possibly at stages through the computation you could discard the top 'n' and bottom 'n' values, as long as you are sure that the median is not in that top or bottom range.
e.g. Let's say you are expecting 100,000 values. Every time your stored number gets to (say) 12,000 you could discard the highest 1000 and lowest 1000, dropping storage back to 10,000.

If the distribution of values is fairly consistent, this would work well. However if there is a possibility that you will receive a large number of very high or very low values near the end, that might distort your computation. Basically if you discard a "high" value that is less than the (eventual) median or a "low" value that is equal or greater than the (eventual) median then your calculation is off.

Update
Bit of an example
Let's say that the data set is the numbers 1,2,3,4,5,6,7,8,9.
By inspection the median is 5.

Let's say that the first 5 numbers you get are 1,3,5,7,9.
To save space we discard the highest and lowest, leaving 3,5,7
Now get two more, 2,6 so our storage is 2,3,5,6,7
Discard the highest and lowest, leaving 3,5,6
Get the last two 4,8 and we have 3,4,5,6,8
Median is still 5 and the world is a good place.

However, lets say that the first five numbers we get are 1,2,3,4,5
Discard top and bottom leaving 2,3,4
Get two more 6,7 and we have 2,3,4,6,7
Discard top and bottom leaving 3,4,6
Get last two 8,9 and we have 3,4,6,8,9
With a median of 6 which is incorrect.

If our numbers are well distributed, we can keep trimming the extremities. If they might be bunched in lots of large or lots of small numbers, then discarding is risky.

Michael J