views:

222

answers:

5

I have a buckets of numbers e.g. - 1 to 4, 5 to 15, 16 to 21, 22 to 34,.... I have roughly 600,000 such buckets. The range of numbers that fall in each of the bucket varies. I need to store these buckets in a suitable data structure so that the lookups for a number is as fast as possible.

So my question is what is the suitable data structure and a sorting mechanism for this type of problem.

Thanks in advance

+2  A: 

You will probably want some kind of sorted tree, like a B-Tree, B+ Tree, or Binary Search tree.

Jared Updike
+2  A: 

If I understand you correctly, you have a list of buckets and you want, given an arbitrary integer, to find out which bucket it goes in.

Assuming that none of the bucket ranges overlap, I think you could implement this in a binary search tree. That would make the lookup possible in O(logn) (whenere n=number of buckets).

It would be simple to do this, just define the left branch to be less than the low end of the bucket, the right branch to be greater than the right end. So in your example we'd end up with a tree something like:

    16-21
    /    \
  5-15  22-34
  /
1-4

To search for, say, 7, you just check the root. Less than 16? Yes, go left. Less than 5? No. Greater than 15? No, you're done.

You just have to be careful to balance your tree (or use a self balancing tree) in order to keep your worst-case performance down. this is really important if your input (the bucket list) is already sorted.

SoapBox
+6  A: 

If the buckets are contiguous and disjoint, as in your example, you need to store in a vector just the left bound of each bucket (i.e. 1, 5, 16, 22) plus, as the last element, the first number that doesn't fall in any bucket (35). (I assume, of course, that you are talking about integer numbers.)

Keep the vector sorted. You can search the bucket in O(log n), with kind-of-binary search. To search which bucket does a number x belong to, just go for the only index i such that vector[i] <= x < vector[i+1]. If x is strictly less than vector[0], or if it is greater than or equal to the last element of vector, then no bucket contains it.

EDIT. Here is what I mean:

#include <stdio.h>

// ~ Binary search. Should be O(log n)
int findBucket(int aNumber, int *leftBounds, int left, int right)
{
    int middle;

    if(aNumber < leftBounds[left] || leftBounds[right] <= aNumber) // cannot find
        return -1;
    if(left + 1 == right) // found
        return left;

    middle = left + (right - left)/2;

    if( leftBounds[left] <= aNumber && aNumber < leftBounds[middle] )
        return findBucket(aNumber, leftBounds, left, middle);
    else
        return findBucket(aNumber, leftBounds, middle, right);
}


#define NBUCKETS 12
int main(void)
{
    int leftBounds[NBUCKETS+1] = {1, 4, 7, 15, 32, 36, 44, 55, 67, 68, 79, 99, 101};
    // The buckets are 1-3, 4-6, 7-14, 15-31, ...

    int aNumber;
    for(aNumber = -3; aNumber < 103; aNumber++)
    {
        int index = findBucket(aNumber, leftBounds, 0, NBUCKETS);
        if(index < 0)
            printf("%d: Bucket not found\n", aNumber);
        else
            printf("%d belongs to the bucket %d-%d\n", aNumber, leftBounds[index], leftBounds[index+1]-1);
    }   
    return 0;
}
Federico Ramponi
I think the time to search a key will be much more on a list-based solution than a tree-based based solution
BlitzKrieg
I didn't say 'list'; I said 'vector' (or, an array). If you mean linked list, I agree :) Put the left bounds in any data structure that keeps them ordered and lets you search in O(log n)...
Federico Ramponi
@BlitzKrieg The average height of a balanced binary search tree is O(log n). Lookups, therefore, are O(log n). The same O(log n) that lookups in a sorted array of buckets would be. The difference in speed between the two will have to do with memory usage and memory access patterns. On both those counts, the sorted array wins: it has no memory usage overhead (a balanced binary tree has at least two pointers of overhead, usually a little more, e.g., for a red/black tag) and its memory locality, at least toward the end, will be better.
jemfinch
@Federico: I think you mean "Should be O(log n)" in your code. Also, you really shouldn't have to write your own binary search code; it's unfortunate that C's `bsearch` function returns NULL on lookup failure, rather than a pointer to the greatest element less than the key (which C++'s `std::binary_search` returns).
jemfinch
@jemfinch: corrected O(n), thank you.
Federico Ramponi
You're right guys. Thanks Federico !!
BlitzKrieg
A: 

A simple way to store and sort these in C++ is to use a pair of sorted arrays that represent the lower and upper bounds on each bucket. Then, you can use int bucket_index= std::distance(lower_bounds.begin(), std::lower_bound(lower_bounds, value)) to find the bucket that the value will match with, and if (upper_bounds[bucket_index]>=value), bucket_index is the bucket you want.

You can replace that with a single struct holding the bucket, but the principle will be the same.

MSN
A: 

+1 to the kind-of binary search idea. It's simple and gives good performance for 600000 buckets. That being said, if it's not good enough, you could create an array with MAX BUCKET VALUE - MIN BUCKET VALUE = RANGE elements, and have each element in this array reference the appropriate bucket. Then, you get a lookup in guaranteed constant [O(1)] time, at the cost of using a huge amount of memory.

If A) the probability of accessing buckets is not uniform and B) you knew / could figure out how likely a given set of buckets were to be accessed, you could probably combine these two approaches to create a kind of cache. For example, say bucket {0, 3} were accessed all the time, as was {7, 13}, then you can create an array CACHE. . .

int cache_low_value = 0;

int cache_hi_value = 13;

CACHE[0] = BUCKET_1

CACHE[1] = BUCKET_1

...

CACHE[6] = BUCKET_2

CACHE[7] = BUCKET_3

CACHE[8] = BUCKET_3

...

CACHE[13] = BUCKET_3

. . . which will allow you to find a bucket in O(1) time assuming the value you're trying to associate a value with a bucket is between cache_low_value and cache_hi_value (if Y <= cache_hi_value && Y >= cache_low_value; then BUCKET = CACHE[Y]). On the up side, this approach wouldn't use all the memory on your machine; on the downside, it'd add the equivalent of an additional operation or two to your bsearch in the case you can't find your number / bucket pair in the cache (since you had to check the cache in the first place).

cubic1271