ansaurus

Question

efficient algorithm to test _which_ sets a particular number belongs to

Answer 1

A:

I think I would organise them in the same way Mediawiki indexes pages - as a bucket sort. I don't know that it's the most efficient algorithm out there, but it should be fast, and is pretty easy to implement (even I've managed it, and in SQL at that!!).

Basically, the algorithm for sorting is

For Each SetOfNumbers
   For Each NumberInSet
      Put SetOfNumbers into Bin(NumberInSet)

Then to query, you can just count the number of items in Bin(MyNumber)

This approach will work well when your SetOfNumbers rarely changes, although if they change regularly it's generally not too hard to keep the Bins updated either. It's chief disadvantage is that it trades space, and initial sorting time, for very fast queries.

Note that in the algorithm I've expanded the Ranges into SetsOfNumbers - enumerating every number in a given range.

RB 2008-12-19 16:22:48

I think bucket sort is irrelevant here. In bucket sort, buckets do not have any intersection. Here, we have intersection in sets.

Mehrdad Afshari 2008-12-19 16:29:55

I don't think I follow you. In my algorithm I'm expanding the set of numbers to contain all the numbers in the range, rather than just the range delimiters. This makes it very space inefficient, but very time efficient. The intersections between the buckets are not relevant.

RB 2008-12-19 16:35:15

Answer 2

+5 A:

Your intuition about the relevance of your problem to graphics is correct. Consider building and querying a segment tree. It is particularly-well suited for the counting query you want. See also its description in Computational Geometry.

Diomidis Spinellis 2008-12-19 16:30:40

A segment tree is not the fastest method to just count the number of sets. As it'll require O(m.(log(n)+k)) where m is the number of checks, and k is the number of sets it falls into, n is total number of sets. My algorithm is O(m.log(n))

Mehrdad Afshari 2008-12-19 16:39:31

Mehrdad, your idea is unbeatable for appropriate data sets. But the segment tree is drastically more flexible. It can handle doubles while yours is limited to integers. And it will effortlessly handle huge ranges (say [0..2000000000] which will make yours an enormous hog of space and time.

Sol 2008-12-19 18:12:12

If you're only interested in counting, you just store the number of sets in the segment tree, and then the cost for retrieving the count becomes O(n log n).

Diomidis Spinellis 2008-12-19 18:26:10

Sol, my idea does not depend on the range, only number of intervals.

Mehrdad Afshari 2008-12-19 20:22:24

Answer 3

+1 A:

I think building a tree structure will speed things up considerably (provided you have enough sets and numbers to check that it's worth the initial cost). Instead of a binary tree it should be a ternary tree. Each node should have left, middle, and right nodes, where the left node contains a set that is strictly less than the node set, the right is strictly greater, and the middle has overlap.

                Set1
              /  |  \
             /   |   \
            /    |    \
         Set2  Set3   Set4

It's quick and easy to tell if there's overlap in the sets, since you only have to compare the min and max values to order them. In the simple case above, Set2[max] < Set1[min], Set4[min] > Set1[max], and Set1 and Set3 have some overlap. This will speed up your search because if the number you're searching for is in Set1, it won't be in Set2 or Set4, and you don't have to check them.

I just want to point out that using a scheme like this only saves time over the naive implementation of checking every set if you have more numbers to check than you have sets.

Bill the Lizard 2008-12-19 16:47:45

ansaurus

tags:

views:

answers:

efficient algorithm to test _which_ sets a particular number belongs to

related questions