views:

278

answers:

3

In an algorithm I have to calculate the 75th percentile of a data set whenever I add a value. Right now I am doing this:

  1. Get value x
  2. Insert x in an already sorted array at the back
  3. swap x down until the array is sorted
  4. Read the element at position array[array.size * 3/4]

Point 3 is O(n), and the rest is O(1), but this is still quite slow, especially if the array gets larger. Is there any way to optimize this?

UPDATE

Thanks Nikita! Since I am using C++ this is the solution easiest to implement. Here is the code:

template<class T>
class IterativePercentile {
public:
  /// Percentile has to be in range [0, 1(
  IterativePercentile(double percentile)
    : _percentile(percentile)
  { }

  // Adds a number in O(log(n))
  void add(const T& x) {
    if (_lower.empty() || x <= _lower.front()) {
      _lower.push_back(x);
      std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
    } else {
      _upper.push_back(x);
      std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
    }

    unsigned size_lower = (unsigned)((_lower.size() + _upper.size()) * _percentile) + 1;
    if (_lower.size() > size_lower) {
      // lower to upper
      std::pop_heap(_lower.begin(), _lower.end(), std::less<T>());
      _upper.push_back(_lower.back());
      std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
      _lower.pop_back();
    } else if (_lower.size() < size_lower) {
      // upper to lower
      std::pop_heap(_upper.begin(), _upper.end(), std::greater<T>());
      _lower.push_back(_upper.back());
      std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
      _upper.pop_back();
    }            
  }

  /// Access the percentile in O(1)
  const T& get() const {
    return _lower.front();
  }

  void clear() {
    _lower.clear();
    _upper.clear();
  }

private:
  double _percentile;
  std::vector<T> _lower;
  std::vector<T> _upper;
};
A: 

You can use binary search to do find the correct position in O(log n). However, shifting the array up is still O(n).

Matthew Flaschen
+10  A: 

You can do it with two heaps. Not sure if there's a less 'contrived' solution, but this one provides O(logn) time complexity and heaps are also included in standard libraries of most programming languages.

First heap (heap A) contains smallest 75% elements, another heap (heap B) - the rest (biggest 25%). First one has biggest element on the top, second one - smallest.

  1. Adding element.

See if new element x is <= max(A). If it is, add it to heap A, otherwise - to heap B.
Now, if we added x to heap A and it became too big (holds more than 75% of elements), we need to remove biggest element from A (O(logn)) and add it to heap B (also O(logn)).
Similar if heap B became too big.

  1. Finding "0.75 median"

Just take the largest element from A (or smallest from B). Requires O(logn) or O(1) time, depending on heap implementation.

edit
As Dolphin noted, we need to specify precisely how big each heap should be for every n (if we want precise answer). For example, if size(A) = floor(n * 0.75) and size(B) is the rest, then, for every n > 0, array[array.size * 3/4] = min(B).

Nikita Rybak
but how do you determine if heap A became too big?
Raze2dust
@Raze2dust Heap A should hold approx 75% of elements. If it's size goes beyond that, it became too big.
Nikita Rybak
@Raze2dust If you mean, "how to get heap size", it's an O(1) operation :)
Nikita Rybak
I think this idea will work, but I think a few changes are necessary. First, one of the heaps should always have the item you are looking for on it. This way you cann figure out what size each heap should be for a given number of elements `heap A=floor(n*.75) and heap B=ceil(n*.25)` (in this case). Next, when you add an item, determine which heap needs to grow. If heap A needs to grow and the item is less than the the top of B, add it to A. Otherwise remove the top of B, add it to A, then add the new item to B. (The remove then add would be more efficient as a modify).
Dolphin
@Dolphin Sorry, I don't completely understand your suggestions. Are you saying that algorithm has mistake? Or it can become simpler or asymptotically faster?
Nikita Rybak
great idea! to find out where to add a number, I think you can do it this way: given `size` is total size of A+B. When adding a number, calculate `(int)(size * 0.75)` and `(int)((size+1)*0.75)`. If both numbers are the same, grow A, otherwise grow B.
martinus
@martinus Don't forget, any element in B should be >= any element in A. So, if you choose where to add depending on the size, you'll need afterwards to compare max(A) and min(B) and exchange them if second one is smaller.
Nikita Rybak
@Nikita - no, just a couple tweaks. Defining which heap should grow makes the add operation slightly simpler (your add can do 3 O(logn) operations (add, remove, add). My suggestion is two (modify, add) in the worst case. It doesn't really matter which heap you choose, but picking the small heap to always have the item will keep the size of the heaps closer, for a (probably insignificant) performance gain.
Dolphin
Nice solution! Since you only remove max from heap A and min from heap B, maybe you should mention that heap A is a max-heap and heap B is a min-heap.
Eyal Schneider
@Nikita Ah yeah, now I know why they say sleep is necessary.. :D
Raze2dust
+4  A: 

A simple Order Statistics Tree is enough for this.

A balanced version of this tree supports O(logn) time insert/delete and access by Rank. So you not only get the 75% percentile, but also the 66% or 50% or whatever you need without having to change your code.

If you access the 75% percentile frequently, but only insert less frequently, you can always cache the 75% percentile element during an insert/delete operation.

Most standard implementations (like Java's TreeMap) are order statistic trees.

Moron
+1 for a useful technique. But you have a mistake: Java's TreeSet (or Map) won't give you tools necessary to iterate from tree root down to leafs. IIRC, STL version too. You'll have to write your own balanced tree or hack someone else's code. Hardly enjoyable.
Nikita Rybak
+1 - But you can't index a Java `TreeSet` by rank. You _can_ use Java's `TreeSet` if the values will not repeat; you just need to keep track of your current 75th percentile and the number of items to the left and to the right. When you add something, place it into the set and update the left/right numbers. If you now have too many on the right, use `higher` to get the next one; if too many on the left, use `lower` to get the previous; if you're okay, don't do anything. If the values repeat, you'll have to create a map from key to some collection (list?), and then a similar trick works.
Rex Kerr
@Nikita: I believe TreeMap has it! Look at the comments to this answer:http://stackoverflow.com/questions/3071497/list-or-container-o1-ish-insertion-deletion-performance-with-array-semantics/3071566#3071566. @Rex, I was talking of TreeMap. Of course I haven't used Java in a while.
Moron
@Moron I don't see any reference to particular TreeMap method there. In fact, to go from root of tree down you need some kind of Node struct having references to left and right children. Neither Java, nor STL (IIRC) provide you with such structure: it's considered implementation detail.
Nikita Rybak
But Rex's idea should work (although it's not terribly simple to implement)
Nikita Rybak
@Nikita: I am not claiming that you _have_ to traverse the tree yourself. I am claiming that the data structure provides API for accessing/inserting/deleting by position. Anyway I am not so sure about TreeMap now...
Moron
Ive tried it with a tree, but the heap implementation is several times faster for my use case.
martinus
@martinus: Did you try caching? Anyway, glad this forum worked out for you :-)
Moron
For me caching is no use for me since after each insert I call one get() operation. I think the heap solution is faster because it can use two arrays as the backend
martinus
@Martinus: I see. If your 75% is fixed, I agree the heap will be faster: you have partitioned it based on the 75% element. So insertions will be faster etc.
Moron