views:

1052

answers:

7

Wikipedia says:

Selection algorithms: Finding the min, max, both the min and max, median, or even the k-th largest element can be done in linear time using heaps.

All it says is that it can be done, and not how.

Can you give me some start on how this can be done using heaps?

A: 

Check out this previous StackOverflow question.

NickLarsen
+2  A: 

See this wikipedia page on selection algorithms. In particular, look at the BFPRT algorithm and the Median of Medians algorithm. BFPRT is probabilistically linear, and is modelled on quicksort; Median of Medians is guaranteed linear, but has a large constant factor and so might take longer in practice, depending on the size of your dataset.

If you only have a few hundred or thousand elements from which to select the median, I suspect that a simple quicksort followed by direct indexing is easiest.

Dale Hagglund
@Dale Hagglund: "using heaps"?
Lazer
"linear" is incompatible with "using heaps" unless you're throwing in the pre-processing cost for free. However, I should have made that clear at the beginning of my post.
Dale Hagglund
Is it really that hard to apply the heap concept to partitions and pivots?
tloflin
@tlofin: sorry, but I'm not sure what you're asking.
Dale Hagglund
@Dale: Sorry, I was replying to eSKay. Guess I need to start using these "@"s.
tloflin
A: 

There are likely better algorithms out there, but here's how I'd do it:

Have two buckets and a value. The value is the median, the two buckets are "bigger than median" and "smaller than median". For each element x in the array, rebalance the buckets such that big_bucket and small_bucket differ by no more than 1 in their size. When moving items from the big bucket to the small bucket they first must pass through the median value to get there (that is, a difference of 2 will successfully push an element from one bucket to the next - a difference of 1 will push an element from one bucket to the median value.) At the end of your first pass through the array the value should be your median.

fbrereto
@fbrereto: What would be the time complexity of your algorithm? I think this algorithm is NOT linear.
Lazer
It'd be one pass through the original array, and the bucket operations would be push/pop, which can be done in constant time (as their size is known to be no more than N/2+1), so off the top of my head I suspect it can be done in O(N). Please correct me if I've missed something.
fbrereto
Hrm... one would have to keep the buckets sorted, which is not an O(N) operation (mod a radix sort).
fbrereto
A: 

I believe they are not talking about just finding the median of a given set. (I think, but probably wrong in thinking that)

It is about finding the median of a set which could be changing, assuming an initial O(n) 'heapify'/preprocess kind of step on an initial set of n elements.

If you allow O(n) preprocessing, you can actually implement subsequent median finding in O(1) time, even if you allow inserts and deletes later.

The preprocess step does this:

  • Find median using the selection algorithm (median of medians etc), in O(n) time.
  • Create two heaps, one max-heap with the smallest n/2 elements, and another min-heap with the largest n/2 elements.

Now once you have done this preprocessing, the subsequent inserts and deletes try to maintain the invariant that the median is always the max element of the max-heap while maintaining O(logn) insert/delete time. I leave that to you.

Moron
Min of the max, or max of the min. But yes, sounds sensible.
ANeves
The O(log n) makes sense for inserts, but how could that work with arbitrary deletes? Do you maintain two additional heaps to track deleted items, and then adjust the median location by subtracting out their sizes, or something like that?
Jeffrey L Whitledge
@Jeffrey. By delete I meant delete-median/delete-min type things. Heaps really don't support arbitrary deletes.
Moron
A: 

Obviously, min and max in O(n) is easy and don't require a heap.

K'th largest can be done fairly simply by maintaining a k-sized heap of the top k values so far. Runtime would be O(n*logk). You could call that linear time if k is fixed size and k << n.

I don't think median is possible though. Just creating an O(n) sized heap requires O(n*logn) time.

Edit: Ok, after thinking about this a bit more, IVlad is right. You can create a heap in O(n), for a fixed size. But... that doesn't help the OP with his Median question. The linear heap creation technique only produces a valid heap as its final output. The simple approach of doing n inserts, resulting in a valid heap after each step is O(n*logn).

It seems to me that using heaps to find the median would require the use of those running sub-heaps. For instance, there was an answer posted here (that seems to be deleted now) that linked to a blog post suggesting an algorithm for this problem. It tracked the running median using two heaps (the smaller half and the larger half) as it does a single pass through the data. That would require the slower, naive heap approach becuase it depends on maintaining valid heaps as it inserts and removes from them.

Is there some other way to find the median using the linear one-shot heap creation technique?

Alan
"Just creating an O(n) sized heap requires O(n *logn) time" - wrong, you can create a heap in O(N) time.
IVlad
@IVlad - You can create a heap for already-sorted data in O(n) time, and you can create a fixed-size heap in O(n) time, but I don't see either of those preconditions in the quesion.
Jeffrey L Whitledge
If the data is already sorted, then you don't need a heap to find the median, or any of the other objectives in the OP.
Alan
@Jeffrey L Whitledge - you can create a heap for unsorted data in O(n) time as well. A sorted set of data is already a heap, so creating a heap out of that is O(1) actually. I'm pretty sure the question refers to a fixed-sized input, "selection algorithm" implies that.
IVlad
If the question refers to a "fixed-size input", then the expression "linear time" is meaningless. By "fixed-size heap" I meant one that may be used to find, say, the 10th largest element of an unsorted set. An algorithm to find the m-th largest or smallest value would be O(n), where n is the input size and m is a fixed number intrensic to the algorithm. If m is allowed to vary as part of the input, then heap-manipulation is no longer constant time, but becomes O(n log m).
Jeffrey L Whitledge
A: 

if you know more about heap data structure, you will easily understand that that s actually the case. heap structure can be built in O(n) time, there is min heap and max heap. min heap root element will give you the smallest element. max heap root element will give you the max element. Just by building the heap you find the min and max. same idea for median and kth largest, while building your heap, you can find the median and kth largest by looking at left or right branch of the tree and by keeping a constant amount of memory to store the element number. etc.

@user177883: how will you build a heap so that the root is the median?
Lazer
+1  A: 

You would use a min-max-median heap to find the min, max and median in constant time (and take linear time to build the heap). You can use order-statistics trees to find the kth smallest/largest value. Both of these data structures are described in this paper on min-max heaps [pdf link]. Min-max heaps are binary heaps that alternate between min-heaps and max-heaps.

From the paper: A min-max-median heap is a binary heap with the following properties:

1) The median of all elements is located at the root

2) The left subtree of the root is a min-max heap Hl of size ceiling[((n-1)/2)] containing elements less than or equal to the median. The right subtree is a max-min heap Hr of size floor[((n-1)/2)] containing only elements greater than or equal to the median.

The paper goes on to explain how to build such a heap.

Edit: Upon reading the paper more thoroughly it appears as though building the min-max-median heaps requires that you first find the median (FTA: "Find the median of all n elements using any one of the known linear-time algorithms"). That said, once you have built the heap you can maintain the median simply by maintaining the balance between the min-max heap on the left and the max-min heap on the right. DeleteMedian replaces the root with either the min of the max-min heap or the max of the min-max heap (whichever maintains the balance).

So if you plan on using a min-max-median heap to find the median of a fixed data set you're SOL but if you are using it on a changing data set it is possible.

Niki Yoshiuchi