ansaurus

Question

Answer 1

+5 A:

Your complexity analysis is confusing. Let's say that n items total are added; we want to output the stream of n medians (where the ith in the stream is the median of the first i items) efficiently.

I believe this can be done in O(n*lg n) time using two priority queues (e.g. binary or fibonacci heap); one queue for the items below the current median (so the largest element is at the top), and the other for items above it (in this heap, the smallest is at the bottom). Note that in fibonacci (and other) heaps, insertion is O(1) amortized; it's only popping an element that's O(lg n).

This would be called an "online median selection" algorithm, although Wikipedia only talks about online min/max selection. Here's an approximate algorithm, and a lower bound on deterministic and approximate online median selection (a lower bound means no faster algorithm is possible!)

If there are a small number of possible values compared to n, you can probably break the comparison-based lower bound just like you can for sorting.

wrang-wrang 2009-09-07 04:07:11

Yes, I am sorry for being confusing. The time complexity is for one iteration, that is, adding an element and returning the median of the current set. The time complexity is not for adding totally n elements and outputing n medians.

Steve 2009-09-07 06:28:52

"insertion is O(1) amortized; it's only popping an element that's O(lg n)" - you will have to pop elements sometimes, though, won't you? Because if a lot of "large" elements come in, then medium-sized elements which previously were greater than the median will eventually be smaller than the median, so you'll have to pop them and push them on the other heap.

Steve Jessop 2009-09-07 14:25:32

Yes, absolutely. That's why I said O(n*lg n) and not O(n). Anyway, Fibonacci heaps aren't practical for small sizes; if I wanted the O(1) ops I'd probably use http://www.cs.tau.ac.il/~zwick/papers/meld-talg.pdf

wrang-wrang 2009-09-07 18:40:43

Answer 2

A:

In order to find the median in linear time you can try this (it just came to my mind). You need to store some values every time you add number to your set, and you won't need sorting. Here it goes.

typedef struct
{
        int number;
        int lesser;
        int greater;
} record;

int median(record numbers[], int count, int n)
{
        int i;
        int m = VERY_BIG_NUMBER;

        int a, b;

        numbers[count + 1].number = n:
        for (i = 0; i < count + 1; i++)
        {
                if (n < numbers[i].number)
                {
                        numbers[i].lesser++;
                        numbers[count + 1].greater++;
                }
                else
                {
                        numbers[i].greater++;
                        numbers[count + 1].lesser++;
                }
                if (numbers[i].greater - numbers[i].lesser == 0)
                        m = numbers[i].number;
        }

        if (m == VERY_BIG_NUMBER)
        for (i = 0; i < count + 1; i++)
        { 
                if (numbers[i].greater - numbers[i].lesser == -1)
                        a = numbers[i].number;
                if (numbers[i].greater - numbers[i].lesser == 1)
                        b = numbers[i].number;

                m = (a + b) / 2;
        }

        return m;
}

What this does is, each time you add a number to the set, you must now how many "lesser than your number" numbers have, and how many "greater than your number" numbers have. So, if you have a number with the same "lesser than" and "greater than" it means your number is in the very middle of the set, without having to sort it. In the case that you have an even amount of numbers you may have two choices for a median, so you just return the mean of those two. BTW, this is C code, I hope this helps.

nairdaen 2009-09-07 05:14:15

Thanks for the code-level description. To my understanding, in median() function, numbers is the array holding the set, n is the new element added to the set, count is the current length of the set before adding n, and m is the median. The time complexity is linear for adding one element.Notice that we cannot assume numbers array is big enough so we need to check and possibly expend numbers array. Your method doesn't require the array to be sorted so the new element can be inserted always to the end. But you need linear scan, which is more expensive than keeping the array sorted.

Steve 2009-09-07 06:37:35

he said he wants sub-linear algorithms

yairchu 2009-09-07 07:18:23

Answer 3

A:

EDIT: forget it then.

scragar 2009-09-07 05:14:44

Unfortunately, we cannot make any assumption on the boundaries of the input elements, or the distribution.

Steve 2009-09-07 06:19:46

it can also be solved much more efficiently with the information given. your answer is both for a different question and overkill

yairchu 2009-09-07 07:17:33

Answer 4

+2 A:

Although wrang-wrang already answered, I wish to describe a modification of your binary search tree method that is sub-linear.

We use a binary search tree that is balanced (AVL/Red-Black/etc), but not super-balanced like you described. So adding an item is O(log n)
One modification to the tree: for every node we also store the number of nodes in its subtree. This doesn't change the complexity. (For a leaf this count would be 1, for a node with two leaf children this would be 3, etc)

We can now access the Kth smallest element in O(log n) using these counts:

def get_kth_item(subtree, k):
  left_size = 0 if subtree.left is None else subtree.left.size
  if k < left_size:
    return get_kth_item(subtree.left, k)
  elif k == left_size:
    return subtree.value
  else: # k > left_size
    return get_kth_item(subtree.right, k-1-left_size)

A median is a special case of Kth smallest element (given that you know the size of the set).

So all in all this is another O(log n) solution.

yairchu 2009-09-07 07:30:49

Answer 5

+1 A:

I received the same interview question and came up with the two-heap solution in wrang-wrang's post. As he says, the time per operation is O(log n) worst-case. The expected time is also O(log n) because you have to "pop an element" 1/4 of the time assuming random inputs.

I subsequently thought about it further and figured out how to get constant expected time; indeed, the expected number of comparisons per element becomes 2+o(1). You can see my writeup at http://denenberg.com/omf.pdf .

BTW, the solutions discussed here all require space O(n), since you must save all the elements. A completely different approach, requiring only O(log n) space, gives you an approximation to the median (not the exact median). Sorry I can't post a link (I'm limited to one link per post) but my paper has pointers.

Larry Denenberg 2010-02-07 19:17:58

ansaurus

tags:

views:

answers:

Find median value from a growing set.

related questions