tags:

views:

70

answers:

3

I have an idea of what I want the algorithm to do, but I'm looking to see if this has been implemented for me in python and I need to know the algorithm's name.

I want to insert a set of values in the range [0,65535] into the container. I want to then ask the container for an arbitrary order statistic. Both insert and query should be logarithmic.

A: 

I think you're looking for the quickselect or median-of-medians algorithm.

Steve Emmerson
Hmm, thanks for the answer, but I don't think quickselect is what I'm looking for. It is O(1) insert, O(n) query. Also, it has large space requirements. I think I figured out the algorithm, and I'll leave the question open a while longer in case someone knows the name.
Neil G
A: 

Here's the algorithm. However, I still don't know what it's called.

#!/usr/bin/env python
"""
This module declares Histogram, a class that maintains a histogram.
It can insert and remove values from a predetermined range, and return order
statistics.  Insert, remove, and query are logarithmic time in the size of the
range.  The space requirement is linear in the size of the range.
"""

import numpy as np

class Histogram:
    def __init__(self, size):
        """Create the data structure that holds elements in the range
           [0, size)."""
        self.__data = np.zeros(size, np.int32)
        self.total = 0

    def size(self):
        return self.__data.shape[0]

    def __find(self, o, a, b):
        if b == a + 1:
            return a
        mid = (b - a) / 2 + a
        if o > self.__data[mid]:
            return self.__find(o - self.__data[mid], mid, b)
        return self.__find(o, a, mid)

    def find(self, o):
        """Return the o'th smallest element in the data structure.  Takes
           O(log(size)) time."""
        return self.__find(o + 1, 0, self.size())

    def __alter(self, x, a, b, delta):
        if b == a + 1:
            self.total += delta
            return
        mid = (b - a) / 2 + a
        if x >= mid:
            self.__alter(x, mid, b, delta)
        else:
            self.__data[mid] += delta
            self.__alter(x, a, mid, delta)

    def insert(self, x):
        """Inserts element x into the data structure in O(log(size)) time."""
        assert(0 <= x < self.size())
        self.__alter(x, 0, self.size(), +1)

    def remove(self, x):
        """Removes element x from the data structure in O(log(size)) time."""
        assert(0 <= x < self.size())
        self.__alter(x, 0, self.size(), -1)

    def display(self):
        print self.__data

def histogram_test():
    size = 100
    total = 100
    h = Histogram(size)
    data = np.random.random_integers(0, size - 1, total)
    for x in data:
        h.insert(x)
    data.sort()
    for i in range(total):
        assert(h.find(i) == data[i])
    assert(h.find(total + 1) == size - 1)
    h.display()
Neil G
+1  A: 

The standard way to do this is by using an augmented binary search tree. Essentially, in addition to keeping the set of keys stored in the tree, you keep a count of the nodes stored in each subtree. This lets you computer order statistics efficiently.

Since you're dealing with bounded integers, you can just keep a binary search tree with the 65536 values stored in it, and keep a count of the number of elements stored in each subtree. This yields a running time of O(lg 65536) instead of O(lg n).

jonderry
That's a nice way of thinking about it. It's essentially what I did.
Neil G