tags:

views:

1116

answers:

5

I came across an interesting algorithm question in an interview. I gave my answer but not sure whether there is any better idea. So I welcome everyone to write something about his/her ideas.

You have an empty set. Now elements are put into the set one by one. We assume all the elements are integers and they are distinct (according to the definition of set, we don't consider two elements with the same value).

Every time a new element is added to the set, the set's median value is asked. The median value is defined the same as in math: the middle element in a sorted list. Here, specially, when the size of set is even, assuming size of set = 2*x, the median element is the x-th element of the set.

An example: Start with an empty set, when 12 is added, the median is 12, when 7 is added, the median is 7, when 8 is added, the median is 8, when 11 is added, the median is 8, when 5 is added, the median is 8, when 16 is added, the median is 8, ...

Notice that, first, elements are added to set one by one and second, we don't know the elements going to be added.

My answer.

Since it is a question about finding median, sorting is needed. The easiest solution is to use a normal array and keep the array sorted. When a new element comes, use binary search to find the position for the element (log_n) and add the element to the array. Since it is a normal array so shifting the rest of the array is needed, whose time complexity is n. When the element is inserted, we can immediately get the median, using instance time.

The WORST time complexity is: log_n + n + 1.

Another solution is to use link list. The reason for using link list is to remove the need of shifting the array. But finding the location of the new element requires a linear search. Adding the element takes instant time and then we need to find the median by going through half of the array, which always takes n/2 time.

The WORST time complexity is: n + 1 + n/2.

The third solution is to use a binary search tree. Using a tree, we avoid shifting array. But using the binary search tree to find the median is not very attractive. So I change the binary search tree in a way that it is always the case that the left subtree and the right subtree are balanced. This means that at any time, either the left subtree and the right subtree have the same number of nodes or the right subtree has one node more than in the left subtree. In other words, it is ensured that at any time, the root element is the median. Of course this requires changes in the way the tree is built. The technical detail is similar to rotating a red-black tree.

If the tree is maintained properly, it is ensured that the WORST time complexity is O(n).

So the three algorithms are all linear to the size of the set. If no sub-linear algorithm exists, the three algorithms can be thought as the optimal solutions. Since they don't differ from each other much, the best is the easiest to implement, which is the second one, using link list.

So what I really wonder is, will there be a sub-linear algorithm for this problem and if so what will it be like. Any ideas guys?

Steve.

+5  A: 

Your complexity analysis is confusing. Let's say that n items total are added; we want to output the stream of n medians (where the ith in the stream is the median of the first i items) efficiently.

I believe this can be done in O(n*lg n) time using two priority queues (e.g. binary or fibonacci heap); one queue for the items below the current median (so the largest element is at the top), and the other for items above it (in this heap, the smallest is at the bottom). Note that in fibonacci (and other) heaps, insertion is O(1) amortized; it's only popping an element that's O(lg n).

This would be called an "online median selection" algorithm, although Wikipedia only talks about online min/max selection. Here's an approximate algorithm, and a lower bound on deterministic and approximate online median selection (a lower bound means no faster algorithm is possible!)

If there are a small number of possible values compared to n, you can probably break the comparison-based lower bound just like you can for sorting.

wrang-wrang
Yes, I am sorry for being confusing. The time complexity is for one iteration, that is, adding an element and returning the median of the current set. The time complexity is not for adding totally n elements and outputing n medians.
Steve
"insertion is O(1) amortized; it's only popping an element that's O(lg n)" - you will have to pop elements sometimes, though, won't you? Because if a lot of "large" elements come in, then medium-sized elements which previously were greater than the median will eventually be smaller than the median, so you'll have to pop them and push them on the other heap.
Steve Jessop
Yes, absolutely. That's why I said O(n*lg n) and not O(n). Anyway, Fibonacci heaps aren't practical for small sizes; if I wanted the O(1) ops I'd probably use http://www.cs.tau.ac.il/~zwick/papers/meld-talg.pdf
wrang-wrang
A: 

In order to find the median in linear time you can try this (it just came to my mind). You need to store some values every time you add number to your set, and you won't need sorting. Here it goes.

typedef struct
{
        int number;
        int lesser;
        int greater;
} record;

int median(record numbers[], int count, int n)
{
        int i;
        int m = VERY_BIG_NUMBER;

        int a, b;

        numbers[count + 1].number = n:
        for (i = 0; i < count + 1; i++)
        {
                if (n < numbers[i].number)
                {
                        numbers[i].lesser++;
                        numbers[count + 1].greater++;
                }
                else
                {
                        numbers[i].greater++;
                        numbers[count + 1].lesser++;
                }
                if (numbers[i].greater - numbers[i].lesser == 0)
                        m = numbers[i].number;
        }

        if (m == VERY_BIG_NUMBER)
        for (i = 0; i < count + 1; i++)
        { 
                if (numbers[i].greater - numbers[i].lesser == -1)
                        a = numbers[i].number;
                if (numbers[i].greater - numbers[i].lesser == 1)
                        b = numbers[i].number;

                m = (a + b) / 2;
        }

        return m;
}

What this does is, each time you add a number to the set, you must now how many "lesser than your number" numbers have, and how many "greater than your number" numbers have. So, if you have a number with the same "lesser than" and "greater than" it means your number is in the very middle of the set, without having to sort it. In the case that you have an even amount of numbers you may have two choices for a median, so you just return the mean of those two. BTW, this is C code, I hope this helps.

nairdaen
Thanks for the code-level description. To my understanding, in median() function, numbers is the array holding the set, n is the new element added to the set, count is the current length of the set before adding n, and m is the median. The time complexity is linear for adding one element.Notice that we cannot assume numbers array is big enough so we need to check and possibly expend numbers array. Your method doesn't require the array to be sorted so the new element can be inserted always to the end. But you need linear scan, which is more expensive than keeping the array sorted.
Steve
he said he wants sub-linear algorithms
yairchu
A: 

EDIT: forget it then.

scragar
Unfortunately, we cannot make any assumption on the boundaries of the input elements, or the distribution.
Steve
it can also be solved much more efficiently with the information given. your answer is both for a different question and overkill
yairchu
+2  A: 

Although wrang-wrang already answered, I wish to describe a modification of your binary search tree method that is sub-linear.

  • We use a binary search tree that is balanced (AVL/Red-Black/etc), but not super-balanced like you described. So adding an item is O(log n)
  • One modification to the tree: for every node we also store the number of nodes in its subtree. This doesn't change the complexity. (For a leaf this count would be 1, for a node with two leaf children this would be 3, etc)

We can now access the Kth smallest element in O(log n) using these counts:

def get_kth_item(subtree, k):
  left_size = 0 if subtree.left is None else subtree.left.size
  if k < left_size:
    return get_kth_item(subtree.left, k)
  elif k == left_size:
    return subtree.value
  else: # k > left_size
    return get_kth_item(subtree.right, k-1-left_size)

A median is a special case of Kth smallest element (given that you know the size of the set).

So all in all this is another O(log n) solution.

yairchu
+1  A: 

I received the same interview question and came up with the two-heap solution in wrang-wrang's post. As he says, the time per operation is O(log n) worst-case. The expected time is also O(log n) because you have to "pop an element" 1/4 of the time assuming random inputs.

I subsequently thought about it further and figured out how to get constant expected time; indeed, the expected number of comparisons per element becomes 2+o(1). You can see my writeup at http://denenberg.com/omf.pdf .

BTW, the solutions discussed here all require space O(n), since you must save all the elements. A completely different approach, requiring only O(log n) space, gives you an approximation to the median (not the exact median). Sorry I can't post a link (I'm limited to one link per post) but my paper has pointers.

Larry Denenberg