ansaurus

Question

Fast algorithm for repeated calculation of percentile?

Answer 1

A:

You can use binary search to do find the correct position in O(log n). However, shifting the array up is still O(n).

Matthew Flaschen 2010-09-17 19:29:17

Answer 2

+10 A:

You can do it with two heaps. Not sure if there's a less 'contrived' solution, but this one provides O(logn) time complexity and heaps are also included in standard libraries of most programming languages.

First heap (heap A) contains smallest 75% elements, another heap (heap B) - the rest (biggest 25%). First one has biggest element on the top, second one - smallest.

Adding element.

See if new element x is <= max(A). If it is, add it to heap A, otherwise - to heap B.
Now, if we added x to heap A and it became too big (holds more than 75% of elements), we need to remove biggest element from A (O(logn)) and add it to heap B (also O(logn)).
Similar if heap B became too big.

Finding "0.75 median"

Just take the largest element from A (or smallest from B). Requires O(logn) or O(1) time, depending on heap implementation.

edit
As Dolphin noted, we need to specify precisely how big each heap should be for every n (if we want precise answer). For example, if size(A) = floor(n * 0.75) and size(B) is the rest, then, for every n > 0, array[array.size * 3/4] = min(B).

Nikita Rybak 2010-09-17 19:44:25

but how do you determine if heap A became too big?

Raze2dust 2010-09-17 19:47:33

@Raze2dust Heap A should hold approx 75% of elements. If it's size goes beyond that, it became too big.

Nikita Rybak 2010-09-17 19:48:41

@Raze2dust If you mean, "how to get heap size", it's an O(1) operation :)

Nikita Rybak 2010-09-17 19:51:48

I think this idea will work, but I think a few changes are necessary. First, one of the heaps should always have the item you are looking for on it. This way you cann figure out what size each heap should be for a given number of elements `heap A=floor(n*.75) and heap B=ceil(n*.25)` (in this case). Next, when you add an item, determine which heap needs to grow. If heap A needs to grow and the item is less than the the top of B, add it to A. Otherwise remove the top of B, add it to A, then add the new item to B. (The remove then add would be more efficient as a modify).

Dolphin 2010-09-17 20:17:05

@Dolphin Sorry, I don't completely understand your suggestions. Are you saying that algorithm has mistake? Or it can become simpler or asymptotically faster?

Nikita Rybak 2010-09-17 20:41:36

great idea! to find out where to add a number, I think you can do it this way: given `size` is total size of A+B. When adding a number, calculate `(int)(size * 0.75)` and `(int)((size+1)*0.75)`. If both numbers are the same, grow A, otherwise grow B.

martinus 2010-09-17 20:51:44

@martinus Don't forget, any element in B should be >= any element in A. So, if you choose where to add depending on the size, you'll need afterwards to compare max(A) and min(B) and exchange them if second one is smaller.

Nikita Rybak 2010-09-17 21:12:27

@Nikita - no, just a couple tweaks. Defining which heap should grow makes the add operation slightly simpler (your add can do 3 O(logn) operations (add, remove, add). My suggestion is two (modify, add) in the worst case. It doesn't really matter which heap you choose, but picking the small heap to always have the item will keep the size of the heaps closer, for a (probably insignificant) performance gain.

Dolphin 2010-09-17 21:16:33

Nice solution! Since you only remove max from heap A and min from heap B, maybe you should mention that heap A is a max-heap and heap B is a min-heap.

Eyal Schneider 2010-09-17 22:44:56

@Nikita Ah yeah, now I know why they say sleep is necessary.. :D

Raze2dust 2010-09-18 14:28:09

Answer 3

+4 A:

A simple Order Statistics Tree is enough for this.

A balanced version of this tree supports O(logn) time insert/delete and access by Rank. So you not only get the 75% percentile, but also the 66% or 50% or whatever you need without having to change your code.

If you access the 75% percentile frequently, but only insert less frequently, you can always cache the 75% percentile element during an insert/delete operation.

Most standard implementations (like Java's TreeMap) are order statistic trees.

Moron 2010-09-17 23:05:46

+1 for a useful technique. But you have a mistake: Java's TreeSet (or Map) won't give you tools necessary to iterate from tree root down to leafs. IIRC, STL version too. You'll have to write your own balanced tree or hack someone else's code. Hardly enjoyable.

Nikita Rybak 2010-09-18 00:13:41

+1 - But you can't index a Java `TreeSet` by rank. You _can_ use Java's `TreeSet` if the values will not repeat; you just need to keep track of your current 75th percentile and the number of items to the left and to the right. When you add something, place it into the set and update the left/right numbers. If you now have too many on the right, use `higher` to get the next one; if too many on the left, use `lower` to get the previous; if you're okay, don't do anything. If the values repeat, you'll have to create a map from key to some collection (list?), and then a similar trick works.

Rex Kerr 2010-09-18 02:05:06

@Nikita: I believe TreeMap has it! Look at the comments to this answer:http://stackoverflow.com/questions/3071497/list-or-container-o1-ish-insertion-deletion-performance-with-array-semantics/3071566#3071566. @Rex, I was talking of TreeMap. Of course I haven't used Java in a while.

Moron 2010-09-18 02:36:10

@Moron I don't see any reference to particular TreeMap method there. In fact, to go from root of tree down you need some kind of Node struct having references to left and right children. Neither Java, nor STL (IIRC) provide you with such structure: it's considered implementation detail.

Nikita Rybak 2010-09-18 02:42:15

But Rex's idea should work (although it's not terribly simple to implement)

Nikita Rybak 2010-09-18 02:46:13

@Nikita: I am not claiming that you _have_ to traverse the tree yourself. I am claiming that the data structure provides API for accessing/inserting/deleting by position. Anyway I am not so sure about TreeMap now...

Moron 2010-09-18 06:06:18

Ive tried it with a tree, but the heap implementation is several times faster for my use case.

martinus 2010-09-22 07:17:22

@martinus: Did you try caching? Anyway, glad this forum worked out for you :-)

Moron 2010-09-22 13:41:14

For me caching is no use for me since after each insert I call one get() operation. I think the heap solution is faster because it can use two arrays as the backend

martinus 2010-09-22 14:36:53

@Martinus: I see. If your 75% is fixed, I agree the heap will be faster: you have partitioned it based on the 75% element. So insertions will be faster etc.

Moron 2010-09-22 16:12:48

ansaurus

tags:

views:

answers:

Fast algorithm for repeated calculation of percentile?

related questions