views:

328

answers:

3

If I have a large set of continuous ranges ( e.g. [0..5], [10..20], [7..13],[-1..37] ) and can arrange those sets into any data-structure I like, what's the most efficient way to test which sets a particular test_number belongs to?

I've thought about storing the sets in a balanced binary tree based on the low number of a set ( and each node would have all the sets that have the same lowest number of their set). This would allow you to efficiently prune the number of sets based on whether the test_number you're testing against the sets is less than the lowest number of a set, and then prune that node and all the nodes to the right of that node ( which have a low number in their range which is greater than the test_number) . I think that would prune about 25% of the sets on average, but then I would need to linearly look at all the rest of the nodes in the binary tree to determine whether the test_number belonged in those sets. ( I could further optimize by sorting the lists of sets at any one node by the highest number in the set, which would allow me to do binary search within a specific list to determine which set, if any, contain the test_number. Unfortunately, most of the sets I'll be dealing with don't have overlapping set boundaries.)

I think that this problem has been solved in graphics processing since they've figured out ways to efficiently test which polygons in their entire model contribute to a specific pixel, but I don't know the terminology of that type of algorithm.

A: 

I think I would organise them in the same way Mediawiki indexes pages - as a bucket sort. I don't know that it's the most efficient algorithm out there, but it should be fast, and is pretty easy to implement (even I've managed it, and in SQL at that!!).

Basically, the algorithm for sorting is

For Each SetOfNumbers
   For Each NumberInSet
      Put SetOfNumbers into Bin(NumberInSet)

Then to query, you can just count the number of items in Bin(MyNumber)

This approach will work well when your SetOfNumbers rarely changes, although if they change regularly it's generally not too hard to keep the Bins updated either. It's chief disadvantage is that it trades space, and initial sorting time, for very fast queries.

Note that in the algorithm I've expanded the Ranges into SetsOfNumbers - enumerating every number in a given range.

RB
I think bucket sort is irrelevant here. In bucket sort, buckets do not have any intersection. Here, we have intersection in sets.
Mehrdad Afshari
I don't think I follow you. In my algorithm I'm expanding the set of numbers to contain all the numbers in the range, rather than just the range delimiters. This makes it very space inefficient, but very time efficient. The intersections between the buckets are not relevant.
RB
+5  A: 

Your intuition about the relevance of your problem to graphics is correct. Consider building and querying a segment tree. It is particularly-well suited for the counting query you want. See also its description in Computational Geometry.

Diomidis Spinellis
A segment tree is not the fastest method to just count the number of sets. As it'll require O(m.(log(n)+k)) where m is the number of checks, and k is the number of sets it falls into, n is total number of sets. My algorithm is O(m.log(n))
Mehrdad Afshari
Mehrdad, your idea is unbeatable for appropriate data sets. But the segment tree is drastically more flexible. It can handle doubles while yours is limited to integers. And it will effortlessly handle huge ranges (say [0..2000000000] which will make yours an enormous hog of space and time.
Sol
If you're only interested in counting, you just store the number of sets in the segment tree, and then the cost for retrieving the count becomes O(n log n).
Diomidis Spinellis
Sol, my idea does not depend on the range, only number of intervals.
Mehrdad Afshari
+1  A: 

I think building a tree structure will speed things up considerably (provided you have enough sets and numbers to check that it's worth the initial cost). Instead of a binary tree it should be a ternary tree. Each node should have left, middle, and right nodes, where the left node contains a set that is strictly less than the node set, the right is strictly greater, and the middle has overlap.

                Set1
              /  |  \
             /   |   \
            /    |    \
         Set2  Set3   Set4

It's quick and easy to tell if there's overlap in the sets, since you only have to compare the min and max values to order them. In the simple case above, Set2[max] < Set1[min], Set4[min] > Set1[max], and Set1 and Set3 have some overlap. This will speed up your search because if the number you're searching for is in Set1, it won't be in Set2 or Set4, and you don't have to check them.

I just want to point out that using a scheme like this only saves time over the naive implementation of checking every set if you have more numbers to check than you have sets.

Bill the Lizard