views:

64

answers:

3

I have a list of incomplete ordered numbers. I want to find a particular number with as few steps as possible.

Are there any improvements on this algorithm, I assume you can count the set size without difficulty - it will be stored and updated every time a new item is added.

Your object is to get your cursor over the value x

The first number (smallest) is s, and the last number (greatest) is g.

  1. Take the midpoint m1 of the set: calculate is x < m1,
  2. If yes then s <= x < m1
  3. If no then m1 < x <= g
  4. If m1 = x then you're done.

Keep repeating till you find x. Basically dividing the set into two parts with each iteration till you hit x.

The purpose is to retrieve a numerical id from a very large table to then find the associated other records.

I would imagine this is the most trivial kind of indexing available, are there improvements?

+1  A: 

Binary search is fast, really fast...


You might want to focus your optimisation efforts here:

  • Using the most efficient Datatype for the number field.

  • Reducing repeated access to the same fields.


Also you might want to have a look at these to implement similar code yourself:

GoodLUCK!!

CVS-2600Hertz
"Binary sort is fast..." While that's a true statement, the algorithm described is a binary search... might want to edit.
andand
@andand Thanks.Done.
CVS-2600Hertz
+1  A: 

As others have mentioned, you have desribed a binary search. As easy as it sounds to implement, there are some nuances that could come back to bite you if you're not careful. Suggest you read this section of the Wikipedia link to become aware of them.

andand
+2  A: 

If you want to use an ordered datastructure, binary search is optimal in an asymptotic sense. However, if you use an auxiliary tree, you can gain a large constant factor in time performance if you pay attention to locality.

Specifically, if you are accessing your data from disk, then disk access time will dominate everything else. In that case, you want to reduce the number of distinct blocks of data that need to be randomly accessed from the disk. This is what B trees, B+ trees, and similar do: they store the data in the form of a tree, and make sure that the nodes have a large fanout so they can limit the depth, and therefore don't need to do very many random seeks.

If accessing data in ram, you can do something similar by paying attention to the cache lines; Judy trees are one example of this.

If you're doing an exact match, you can do hashing in constant time -- whether your numbers are ordered or not. Hashing can have significant overhead in time and space, though, and ordered methods are often competitive, so you really want to decide on a case-by-case basis.

comingstorm
B tress sounds like they would be good for indexing text ... will read more to find out.. thanks
Ankur