views:

814

answers:

3

The Scenario

I have several number ranges. Those ranges are not overlapping - as they are not overlapping, the logical consequence is that no number can be part of more than one range at any time. Each range is continuously (there are no holes within a single range, so a range 8 to 16 will really contain all numbers between 8 and 16), but there can be holes between two ranges (e.g. range starts at 64 and goes to 128, next range starts at 256 and goes to 384), so some numbers may not belong to any range at all (numbers 129 to 255 would not belong to any range in this example).

The Problem

I'm getting a number and need to know to which range the number belongs to... if it belongs to any range at all. Otherwise I need to know that it does not belong to any range. Of course speed is important; I can not simply check all the ranges which would be O(n), as there might be thousands of ranges.

Simple Solutions

A simple solution was keeping all numbers in a sorted array and run a binary search on it. That would give me at least O(log n). Of course the binary search must be somewhat modified as it must always check against the smallest and biggest number of a range. If the number to look for is in between, we have found the correct range, otherwise we must search ranges below or above the current one. If there is only one range left in the end and the number is not within that range, the number is within no range at all and we can return a "not found" result.

Ranges could also be chained together in some kind of tree structure. This is basically like a sorted list with binary search. The advantage is that it will be faster to modify a tree than a sorted array (adding/removing range), but unlike we waste some extra time on keeping the tree balanced, the tree might get very unbalanced over the time and that will lead to much slower searches than a binary search on a sorted array.

One can argue which solution is better or worse as in practice the number of searches and modification operations will be almost balanced (there will be an equal number of searches and add/remove operations performed per second).

Question

Is there maybe a better data structure than a sorted list or a tree for this kind of problem? Maybe one that could be even better than O(log n) in best case and O(log n) in worst case?

Some additional information that might help here is the following: All ranges always start and end at multiple of a power of two. They always all start and end at the same power of two (e.g. they all start/end at a multiple of 4 or at a multiple of 8 or at a multiple of 16 and so on). The power of two cannot change during run time. Before the first range is added, the power of two must be set and all ranges ever added must start/end at a multiple of this value till the application terminates. I think this can be used for optimization, as if they all start at a multiple of e.g. 8, I can ignore the first 3 bits for all comparison operations, the other bits alone will tell me the range if any.

I read about section and ranges trees. Are these optimal solutions to the problem? Are there possibly better solutions? The problem sounds similar to what a malloc implementation must do (e.g. every free'd memory block belongs to a range of available memory and the malloc implementation must find out to which one), so how do those commonly solve the issue?

+3  A: 

Create a sorted list and sort by the lower margin / start. That's easiest to implement and fast enough unless you have millions of ranges (and maybe even then).

When looking for a range, find the range where start <= position. You can use a binary search here since the list is sorted. The number is in the range if position <= end.

Since the end of any range is guaranteed to be smaller than start of the next range, you don't need to care about the end until you have found a range where the position might be contained.

All other data structures become interesting when you get intersections or you have a whole lot of ranges and when you build the structure one and query often.

Aaron Digulla
Yup, you are right. I don't have to look at start and end all the time during searching as the ranges don't overlap. This already somewhat simplifies the problem. Thanks :-)
Mecki
if you have a important set of range, you could do a binary search on the sorted ranges to improve on this solution.
PATRY
Using binary search is the obvious algorithmic optimization.
hughdbrown
I've improved my answer.
Aaron Digulla
+5  A: 

A balanced, sorted tree with ranges on each node seems to be the answer. I can't prove it's optimal, but if I were you I wouldn't look any further.

reinierpost
A balanced tree is certainly a good solution, but I'm somewhat scared wasting to much time on keeping that tree balanced at all times as the fluctuation of ranges will be huge.
Mecki
Several languages provide flexible implementations of balanced trees; C++'s set being the obvious example. If you're doing many inserts/deletes you will benefit greatly as these operations will be O(log(n)) instead of O(n) with the array implementation.
chrispy
Well, the best data structure depends on what your data fluctuations look like. But you have to modify a lot more than you look up before using a balanced tree will become a bad idea.
reinierpost
+4  A: 

After running various benchmarks, I came to the conclusion that only a tree like structure can work here. A sorted list shows of course good lookup performance - O(log n) - but it shows horribly update performance (inserts and removals are slower by more than the factor 10 compared to trees!).

A balanced binary tree also has O(log n) lookup performance, however it is much faster to update, also around O(log n), while a sorted list is more like O(n) for updates (O(log n) to find the position for insert or the element to delete, but then up to n elements must be moved within the list and this is O(n)).

I implemented an AVL tree, a red-black tree, a Treap, an AA-Tree and various variations of B-Trees (B means Bayer Tree here, not Binary). Result: Bayer trees almost never win. Their lookup is good, but their update performance is bad (as within each node of a B-Tree you have a sorted list again!). Bayer trees are only superior in cases where reading/writing a node is a very slow operation (e.g. when the nodes are directly read or written from/to hard disk) - as a B-Tree must read/write much less nodes than any other tree, so in such a case it will win. If we are having the tree in memory though, it stands no chance against other trees, sorry for all the B-Tree fans out there.

A Treap was easiest to implement (less than half the lines of code you need for other balanced trees, only twice the code you need for an unbalanced tree) and shows good average performance for lookups and updates... but we can do better than that.

An AA-Tree shows amazing good lookup performance - I have no idea why. They sometimes beat all other trees (not by far, but still enough to not be coincident)... and the removal performance is okay, however unless I'm too stupid to implement them correctly, the insert performance is really bad (it performs much more tree rotations on every insert than any other tree - even B-Trees have faster insert performance).

This leaves us with two classics, AVL and RB-Tree. They are both pretty similar but after hours of benchmarking, one thing is clear: AVL Trees definitely have better lookup performance than RB-Trees. The difference is not gigantic, but in 2/3 out of all benchmarks they will win the lookup test. Not too surprising, after all AVL Trees are more strictly balanced than RB-Trees, so they are closer to the optimal binary tree in most cases. We are not talking about a huge difference here, it is always a close race.

On the other hand RB Trees beat AVL Trees for inserts in almost all test runs and that is not such a close race. As before, that is expected. Being less strictly balanced RB Trees perform much less tree rotations on inserts compared to AVL Trees.

How about removal of nodes? Here it seems to depend a lot on the number of nodes. For small node numbers (everything less than half a million) RB Trees again own AVL Trees; the difference is even bigger than for inserts. Rather unexpected is that once the node number grows beyond a million nodes AVL Trees seems to catch up and the difference to RB Trees shrinks until they are more or less equally fast. This could be an effect of the system, though. It could have to do with memory usage of the process or CPU caching or the like. Something that has a more negative effect on RB Trees than it has on AVL Trees and thus AVL Trees can catch up. The same effect is not observed for lookups (AVL usually faster, regardless how many nodes) and inserts (RB usually faster, regardless how many nodes).

Conclusion:
I think the fastest I can get is when using RB-Trees, since the number of lookups will only be somewhat higher than the number of inserts and deletions and no matter how fast AVL is on lookups, the overall performance will suffer from their worse insert/deletion performance.

That is, unless anyone here may come up with a much better data structure that will own RB Trees big time ;-)

Mecki