views:

114

answers:

4

Is there an algorithm to split a sequence of random numbers into two groups based on a median value determined on the fly(without sorting them)?

Ex. If I have the sequence 2-3-6-7-1-4-5, the result would be two separated groups:

A) 1 2 3

B) 5 6 7

Median value: 4

+3  A: 

You can find the median of an array (and split) in linear time.

sdcvvc
That algorithm will estimate the median in linear time, but is not guaranteed to give you the median...
BlueRaja - Danny Pflughoeft
It is guaranteed to give median exactly. For n=2k+1 take (k+1)-th element, for n=2k take average of k-th and (k+1)-th element.
sdcvvc
@sdcvvc: Unfortunately, no, it's not. Imagine the case of splitting the numbers 1-9 into thirds (instead of fifths, for simplicity): if our groups are (1,2,6), (3,4,5), (7,8,9), the pivot chosen will be 4 (the median is 5). The article itself admits the pivot is only guaranteed to be in the middle 40% (when splitting into 5ths).
BlueRaja - Danny Pflughoeft
correction: Middle 60%
BlueRaja - Danny Pflughoeft
@BlueRaja - Danny Pflughoeft: The pivot is not the final result. The algorithm is run recursively on 7/10 of sequence.
sdcvvc
Ah, I see, it mentions that above the section you linked. In that case, it's the same as my answer.
BlueRaja - Danny Pflughoeft
A: 

You can find the median by finding the average between the floor(n/2)th largest item and the floor(n/2)th smallest item. This can be done with help of this previous SO question.

After that, simply iterate through your array, putting elements greater than the median into one and lower than the median into the other.


Alternatively, if you knew the size of your sequence, you could create two collections of size floor(n/2): one "smallest half" (S) and one "largest half" (L), and then one by one by one:

  • Take out one element in your sequence, call it e.
  • Put it into S if S is not full.
  • If S is full, find the largest element of (S | e) (the union of the two) (this can be impelemented by iterating through S until an element larger than e is found; if none is found, it is e, else, it is the found element), and add it to L. If this largest was in S, put e in S to re-fill it.
  • If L is full, find the smallest element of (L | e) and remove it, adding e into L if e was not removed.

I believe this is O(n) time; someone correct me if I'm wrong. The worst case scenario I could imagine is the original sequence being sorted in descending order.

ruby implementation (with much un-performancy shortcuts):

def split_into_halves to_split
  s = []
  l = []
  medianlimit = to_split.size/2
  for e in to_split
    if s.size < medianlimit
      s.push(e)
    else

      if s.max >= n
        max = s.max
        s.delete max
        s.push(e)
      else
        max = e
      end

      if l.size < medianlimit
        l.push(max)
      elsif l.max >= max
        l.delete l.max
        l.push(max)
      end

    end
  end

  return [s,l]
end

k = [2,3,6,7,1,4,5]
split_into_halves(k) #=> [[2,3,1],[6,4,5]]
Justin L.
why the downvote? :(
Justin L.
I didn't downvote, but I guess it is because you seem to be using a non-standard definition of median, and the main question which the OP has, you have dismissed by a link instead of saying use that... :-)
Moron
"iterating through S until an element larger than e is found" is going to be O(N), since S is half N in size. You need to do this for half of the elements, once S is full. So your algorithm is O(N squared)
Pete Kirkham
Also, the answer you reference will not be O(N) as you are not working with a fixed k, but with k=n/2 to find the values you mention.
Pete Kirkham
A: 

Yes, this can be done in O(n).

First of all, if we already knew the median, we could easily split the sequence in two in O(n) by iterating the sequence and comparing each value with the median. So how do we find the median in O(n)?

The basic idea is to use quicksort, but instead of recursively sorting both sides of the pivot, only sort the half that contains the median (ie. the half that encompasses the index ⌈n/2⌉). If our selection of a pivot guarantees geometric convergence of quicksort (like median-of-medians does), then our overall algorithm will be O(n).

Algorithm

Let's call the current size of our array k, and the reduction due to median-of-medians c - ie. our pivot guarantees the array shrinks by a factor of at least c each step

  1. Estimate the median of the array using median-of-medians - O(k)
  2. Partition the array quicksort-style (with the estimate as our pivot) - O(k)
  3. Choose the half of the array containing the median (index ⌈n/2⌉). This new sub-array will have size no greater than k/c. Repeat steps 1 & 2 recursively until we've determined the element whose position in the original array is ⌈n/2⌉.

The asymptotic running time of this algorithm is

2 O(n) + 2 O(n/c) + 2 O(n/c2) + 2 O(n/c3) + ...
= O(n)

BlueRaja - Danny Pflughoeft
A: 

The BFPRT (Blum-Floyd-Pratt-Rivest-Tarjan)-Algorithm (look at wiki) can find the median in linear time, i.e. in O(n).

However the constant "hidden" in the O-notation is so large that for practice it is faster to sort the array in O(n log n) for reasonable array sizes.

phimuemue