views:

1083

answers:

16

I'm looking to optimize this linear search:

static int
linear (const int *arr, int n, int key)
{
        int i = 0;
        while (i < n) {
                if (arr [i] >= key)
                        break;
                ++i;
        }
        return i;
}

The array is sorted and the function is supposed to return the index of the first element that is greater or equal to the key. They array is not large (below 200 elements) and will be prepared once for a large number of searches. Array elements after the n-th can if necessary be initialized to something appropriate, if that speeds up the search.

No, binary search is not allowed, only linear search.

Edit: All my knowledge about this topic is now summarized in this blog post.

A: 

Well, you could use pointers...

static int linear(const int *array, int arraySize, int key) {
    int i;

    for(i = 0; i < arraySize; ++i) {
        if(*array >= key) {
            return i;
        }

        ++array;
    }

    return arraySize;
}
strager
Why is it faster?
the_drow
@the_drow: In theory incrementing a pointer then testing that should be faster than: increment a counter, then adding that to an address, then testing that. In practice, I doubt their any different. Stylistically I prefer the answer presented here, because it removes the guesswork (and personally I think it flows better.)
GMan
Yeah, but the compiler will probably optimize that bit anyway. You could also try loop unrolling.
BobbyShaftoe
Look at the output from your compiler on that one, it's probably the same as the OP's code. (gcc has been doing this optimization since <2.95, which is where I first noticed it.) The "counter" variable will be initialized to n and each time through the loop the counter is decremented while the pointer is advanced by 4 (or whatever sizeof(int) returns).
dash-tom-bang
I don't think this helps at all. It just means you're incrementing an extra variable every loop. Unless dereferencing a pointer is faster than array[i]...
Chris Cooper
@Shaftoe, Yes; this kind of microoptimization I have a hard time doing with a clean conscience.
strager
@GMan: Just about any compiler that offers code optimizations will reduce the counter + array index to pointer arithmetic in the generated code.
dthorpe
@Chris: Did you read the details of what I wrote? There is no "extra" variable being incremented, incrementing n was replaced with incrementing a pointer, and the indexing math disappears. @dthrope: Indeed, hence why I said "In practice..."
GMan
Last time I tested on X86, array[i] was faster than *array because you only have to do one increment instead of two.
Mark Ransom
This will make your code slower at worst, because you're doing one extra increment per iteration. If the compiler's smart, it will give the same performance as the original code.
Mark Probst
-1 Oh, the joys of hair splitting.
Romain Hippeau
@GMan - The code has changed since I made my comment, so it is no longer applicable.
Chris Cooper
@Chris: The code hasn't been changed at all. No edit history, and your comment was made 6 minutes after the answer was posted (so no free edits.)
GMan
@GMan - Unless it took me a while to actually write the comment I assume...? Anyway, other people have said similar things. I re-read the post, and my comment IS still applicable. But I agree with Romain... Too much hair-splitting here. =P
Chris Cooper
@Chris: Well I also have my personal testimony that the code hasn't changed, but I prefer evidence-based claims. Either way, this answer is fair, though obviously not what OP is looking for. (Which such a poorly worded question, stager gives a fair answer.)
GMan
@GMan - Agreed. And regardless of if it's changed or not, the change I testify to was simply a moving of the "++array" from the for loop () to the end of the loop, which has no function difference. =P
Chris Cooper
+9  A: 

Since you can put known values after the last valid entry, add an extra element n+1 = max to make sure the loop doesn't go past the end of the array without having to test for i < n.

static int
linear (const int *arr, int n, int key)
{
        assert(arr[n] >= key);
        int i = 0;
        while (arr[i] < key) {
                ++i;
        }
        return i;
}

You could also try unrolling the loop, with the same sentinel value:

static int
linear (const int *arr, int n, int key)
{
        assert(arr[n] >= key);
        int i = 0;
        while (true) {
                if (arr [i++] >= key)
                        break;
                if (arr [i++] >= key)
                        break;
                if (arr [i++] >= key)
                        break;
                if (arr [i++] >= key)
                        break;
        }
        return --i;
}
Mark Ransom
Correct in principle, but incorrect in detail. The sentinel must be greater or equal to the key, not less.
Mark Probst
Took a few edits to get these right, sorry if anybody is confused.
Mark Ransom
Also, the assert is incorrect, apart from the sign. The element after the last one has index n, not n+1.
Mark Probst
@Mark, thanks for spotting n+1, I guess I'm not done editing. And I think you're right about the sentinel too, which is how I had it first - I'm trying to do this too fast.
Mark Ransom
As long as we're micro-optomizing, you may as well start `i` at `-1` and then pre-increment it in the array indexing in your unrolled loop example. That will save you the additional `--i` at the end.
Matt B.
@Matt, since the decrement is outside the loop it won't make a measurable difference in the results. I would also need to change all the postincrements to preincrements. Thanks for the suggestion though.
Mark Ransom
@Mark Ransom: Why unrolling 4 times? Why not 2 or 8 or 16?
Lazer
@Lazer, you could certainly do that but there's a point of diminishing returns. Also the larger the code, the more possibility of cache problems.
Mark Ransom
@Mark Ransom: yes, I do understand that, but how did you arrive at 4 anyways? Also, I am not sure about the "of course" part of ["For n < 4 this unrolling will not speed up search at all, of course"](http://schani.wordpress.com/2010/04/30/linear-vs-binary-search/).
Lazer
A: 

loop backwards, this might be translated...

// loop backward

for (int i = arraySize - 1; i >=0; --i)

...to this( "could be" faster ):

    mov cx, arraySize - 1
detectionHere:
    ...
    loop detectionHere   

Other than that, only binary search can make searching faster

Michael Buen
`loop` isn't fast; most complex instructions are slower than multiple simple instructions nowadays. Also, woudln't this make a bad use of caches?
Bastien Léonard
hence the "could be" faster. one less instruction, one less cycle, just my thoughts
Michael Buen
A: 

this might force vector instructions (suggested by Gman):

for (int i = 0; i < N; i += 4) {
    bool found = false;   
    found |= (array[i+0] >= key);
    ...
    found |= ( array[i+3] >= key);
    // slight variation would be to use max intrinsic
    if (found) return i;
}
...
// quick search among four elements

this also makes fewer branch instructions. you make help by ensuring input array is aligned to 16 byte boundary

another thing that may help vectorization (doing vertical max comparison):

for (int i = 0; i < N; i += 8) {
    bool found = false;   
    found |= max(array[i+0], array[i+4]) >= key;
    ...
    found |= max(array[i+3], array[i+7] >= key;
    if (found) return i;
}
// have to search eight elements
aaa
This is interesting.Can you explain this code?
the_drow
@the_drow basically, you hoping to use vector instructions to do 4x things in one time. many compilers can be forced to use such instructions.in the first, you loading 4 elements, in the second, you loading eight elements, and eliminate half by using vector max function.the result is range in which index is located (four or eight elements long).after this you have to search small range for exact index
aaa
+18  A: 
  1. Tell your boss you can make it 50% faster, but it will take 6 months, and some money.
  2. Wait six months.
  3. Buy new hardware.

Well, it makes about as much sense as a linear search through a sorted array!

(More seriously, can you give us some clues about why no binary search?)

Oddthinking
If you read through all the comments, you can see that he asked this as a mental exercise. I like your answer, it's a classic! Definitely thinking outside the box. Unfortunately it doesn't really address the spirit of the question, which is how you would write the code differently.
Mark Ransom
I can top @Mark Ransom... just overclock your processor.
rlbond
-1 The answer is funny, but not quite helpful.
Andrei Ciobanu
Andrei, I upvoted your comment. Of course, you are right it isn't helpful. There's a meta-discussion around here somewhere about how much effort the community should give to problems which is really a self-imposed thought exercise and isn't tagged as "puzzle"/"golf" or the like.
Oddthinking
I just went and looked at his blog post. His justification is given there. And his blog is named after four of my favourite activities. Now I feel bad :-(
Oddthinking
+4  A: 

If you had a quantum computer, you could use Grover's algorithm to search your data in O(N1/2) time and using O(log N) storage space. Otherwise, your question is pretty silly. Binary search or one of its variants (trinary search, for example) is really your best choice. Doing micro-optimizations on a linear search is stupid when you can pick a superior algorithm.

George
Ok, Mister Smarty-Pants, I have a Core i7 and need to search in an array of size 64, and it needs to be super-fast. Linear or binary? Any further optimizations?
Mark Probst
George: For small arrays, cache misses and branch mispredictions will dominate the time to run a binary search. A linear search can use prefetch to eliminate cache misses and will be able to predict most branches.
Gabe
@Mark Probst, In your case, I can do it in constant time...
aioobe
Yes, you can do almost everything in constant time, if you just make the constant large enough. But that was not the question.
Mark Probst
In theory, a fixed size array is searched in constant time. In theory, there is no difference between theory and practice. In practice, that is not true.
Alan
@Mark: True, but if you're searching an array of size 64 why go through all this effort to optimize the search?
BlueRaja - Danny Pflughoeft
I could ask the same question for any array size, couldn't I?
Mark Probst
A: 

You could search for a larger element than an int at a time - platform specifically, this can be much faster or slower depending on how it handles the larger data read. For instance, on a 64-bit system, reading in the array 2 elements at a time and checking the hi/low elements seperately could run faster due to less I/O. Still, this is an O(n) variety sort no matter what.

Michael Dorgan
A: 

If a target-specific solution is acceptable then you can quite easily use SIMD (SSE, AltiVec, or whatever you have available) to get ~ 4x speed-up by testing 4 elements at a time rather than just 1.

Out of interest I put together a simple SIMD implementation as follows:

int linear_search_ref(const int32_t *A, int32_t key, int n)
{
    int result = -1;
    int i;

    for (i = 0; i < n; ++i)
    {
        if (A[i] >= key)
        {
            result = i;
            break;
        }
    }
    return result;
}

int linear_search(const int32_t *A, int32_t key, int n)
{
#define VEC_INT_ELEMS 4
#define BLOCK_SIZE (VEC_INT_ELEMS * 32)
    const __m128i vkey = _mm_set1_epi32(key);
    int vresult = -1;
    int result = -1;
    int i, j;

    for (i = 0; i <= n - BLOCK_SIZE; i += BLOCK_SIZE)
    {
        __m128i vmask0 = _mm_set1_epi32(-1);
        __m128i vmask1 = _mm_set1_epi32(-1);
        int mask0, mask1;

        for (j = 0; j < BLOCK_SIZE; j += VEC_INT_ELEMS * 2)
        {
            __m128i vA0 = _mm_load_si128(&A[i + j]);
            __m128i vA1 = _mm_load_si128(&A[i + j + VEC_INT_ELEMS]);
            __m128i vcmp0 = _mm_cmpgt_epi32(vkey, vA0);
            __m128i vcmp1 = _mm_cmpgt_epi32(vkey, vA1);
            vmask0 = _mm_and_si128(vmask0, vcmp0);
            vmask1 = _mm_and_si128(vmask1, vcmp1);
        }
        mask0 = _mm_movemask_epi8(vmask0);
        mask1 = _mm_movemask_epi8(vmask1);
        if ((mask0 & mask1) != 0xffff)
        {
            vresult = i;
            break;
        }
    }
    if (vresult > -1)
    {
        result = vresult + linear_search_ref(&A[vresult], key, BLOCK_SIZE);
    }
    else if (i < n)
    {
        result = i + linear_search_ref(&A[i], key, n - i);
    }
    return result;
#undef BLOCK_SIZE
#undef VEC_INT_ELEMS
}

On a 2.67 GHz Core i7, using OpenSUSE x86-64 and gcc 4.3.2, I get around 7x - 8x improvement around a fairly broad "sweet spot" where n = 100000 with the key being found at the midpoint of the array (i.e. result = n / 2). Performance drops off to around 3.5x when n gets large and the array therefore exceeds cache size (presumably becoming memory bandwidth-limited in this case). Performance also drops off when n is small, due to inefficiency of the SIMD implementation (it was optimised for large n of course).

Paul R
You can use SIMD, but the speedup will not be 4x, especially not for small arrays. Tested with SSE2 on a Core i7. I'd be interested in your implementation.
Mark Probst
For small arrays, perhaps not, but for large arrays I think you should be able to hit 4x using SIMD. I would unroll the main loop by 2 so that you have two vector loads issued per iteration and you should then be able to hide most of the latencies.
Paul R
I've spent quite some time fiddling with this, and the best speedup I can get with SSE2 over my best non-SSE2 implementation is 2.6x for large arrays. I'd be happy to test your implementation, though :-)
Mark Probst
@Mark: OK - challenge accepted - I'll code it up later and get back to you...
Paul R
For large buffers, around 2.5x is the typical speedup I've seen for carefully optimized sse2 code over carefully optimized straight c math.
Alan
@Alan: it depends a great deal on what CPU you are using, and also some extent on what compiler. Prior to Woodcrest when SSE2 was only a 64 bit implementation under the hood, SSE speed-ups were modest compared to full 128 bit SIMD implementations such as AltiVec, but from Core 2 Duo onwards you should be able to get 4x improvement for float/int.
Paul R
@Mark Probst: OK, I've added a simple SIMD implementation to my answer above. It's around `8x` faster than scalar code at best, with array sizes of the order of 100000 and the key value being found half way through the array. It drops off to around `3.5x` for very large arrays.
Paul R
For small array sizes this is much slower than my fastest implementations (and slower, actually, than my slowest). For N=1000 it's about as fast as my fastest non-SIMD implementation, but still not even half the speed of my best SSE2 implementation. At N=10000 it's almost as fast as my best SSE2 implementation, but it never catches up fully. GCC 4.2.1 on a Core i7.
Mark Probst
@Mark: I wonder how you're compiling it, and also how you're timing it ? I'm using `gcc -O3` and it's an x86-64 executable (twice as many SSE registers as x86). When I time it I'm doing it in a loop (100 iterations) and taking the minimum time - this means that for all but the first iteration the caches will be primed. If you're just timing one iteration then your measurements will be skewed. And yes, of course the performance will be poor for small arrays - that is expected since the routine evaluates blocks of the array rather than individual elements or vectors.
Paul R
@Mark: just one further sanity check: what are your absolute times ? At 2.67 GHz I'm seeing around 1.0 ns / element searched for the scalar code and under 0.15 ns / element searched for my SIMD code (both for the N = 100000 case, where the key is at N / 2, hence times are equal to total time / 50000).
Paul R
0.15ns / element searched means you're doing 2.5 elements/cycle. My code is doing 2.6 elements/cycle. Feel free to try it out yourself http://github.com/schani/linbin - I made a branch "paulr" that includes your implementation. Test linear_sentinel_sse2_nobranch vs linear_sse2_paulr.
Mark Probst
@Mark: Thanks - I have some more ideas for further optimisations which I'll try out later today. One involves using SSE4 but I'll #ifdef that code in case you want to restrict this to SSE2.
Paul R
@Mark: well my further optimisations didn't help much - I suspect that performance has reached the point where it is limited by cache/memory read bandwidth. This is always the problem when you have very little computation relative to memory I/O.
Paul R
Could well be. Good for us, then, isn't it? :-)
Mark Probst
+1  A: 

unroll with fixed array indices.

int linear( const int *array, int n, int key ) {
  int i = 0;
  if ( array[n-1] >= key ) {
     do {
       if ( array[0] >= key ) return i+0;
       if ( array[1] >= key ) return i+1;
       if ( array[2] >= key ) return i+2;
       if ( array[3] >= key ) return i+3;
       array += 4;
       i += 4;
     } while ( true );
  }
  return -1;
}
drawnonward
+2  A: 

If you're on an Intel platform:

int linear (const int *array, int n, int key)
{
  __asm
  {
    mov edi,array
    mov ecx,n
    mov eax,key
    repne scasd
    mov eax,-1
    jne end
    mov eax,n
    sub eax,ecx
    dec eax
end:
  }
}

but that only finds exact matches, not greater than or equal matches.

In C, you can also use Duff's Device:

int linear (const int *array, int n, int key)
{
  const int
    *end = &array [n];

  int
    result = 0;

  switch (n % 8)
  {
    do {
  case 0:
    if (*(array++) >= key) break;
    ++result;
  case 7:
    if (*(array++) >= key) break;
    ++result;
  case 6:
    if (*(array++) >= key) break;
    ++result;
  case 5:
    if (*(array++) >= key) break;
    ++result;
  case 4:
    if (*(array++) >= key) break;
    ++result;
  case 3:
    if (*(array++) >= key) break;
    ++result;
  case 2:
    if (*(array++) >= key) break;
    ++result;
  case 1:
    if (*(array++) >= key) break;
    ++result;
    } while(array < end);
  }

  return result;
}
Skizz
Be careful recommending Duff's device. It's clever C code, for some value of "clever", but because it's extremely unstructured, it can sometimes defeat modern optimizing compilers.
Dale Hagglund
@Dale: You're right, modern compilers almost certainly would do a better job of loop unrolling than this.
Skizz
+2  A: 

You can do it in parallel.

If the list is small, maybe it won't be worth to split the search, but if have to process lots of searches, then you can definitively run them in parallel. That wouldn't reduce the latency of the operations, but would improve the throughput.

fortran
There's almost no way that creating even one thread will be cheaper than a linear scan of 100-200 items.
Dale Hagglund
Still, if there are going to be lots of searches, those can be done in parallel, and the threads can be in a pool and reused.
fortran
In this case, if you are searching <60 items, there is no need for doing it in parallel. However, there are some use cases (I have one now) where an Array of items is not ordered and the order cannot be changed. Binary search cannot be used in this case and if the size of the Array is rather large (it would have to be somewhere around 10,000 to make it worth the extra effort), dividing the array and searching in parallel would definitely be a viable solution
Richard
+2  A: 

You've received many suggestions for improvements, but you need to measure each optimization to see which is best given your hardware and compiler.

As an example of this, in the first version of this response, I guessed that by 100-200 array elements, the slightly higher overhead of binary search should easily be paid for by far fewer probes into the array. However, in the comments below, Mark Probst reports that he sees linear search ahead up to about 500 entries on his hardware. This reinforces the need to measure when searching for the very best performance.

Note: Edited following Mark's comments below on his measurements of linear versus binary search for reasonably small N.

Dale Hagglund
My best linear search beats a standard binary search up to N=550 on a Core i7.
Mark Probst
Thanks for the information. I've updated my comment to reflect this.
Dale Hagglund
The common rules of optimization: 1) Don't, 2) MeasureGiven that this was all a thought exercise, #1 doesn't apply. But #2 must always apply. I'm glad that somebody brought this up!
Harold Bamford
A: 

This answer is a little more obscure than my other one, so I'm posting it separately. It relies on the fact that C guarantees a boolean result false=0 and true=1. X86 can produce booleans without branching, so it might be faster, but I haven't tested it. Micro-optimizations like these will always be highly dependent on your processor and compiler.

As before, the caller is responsible for putting a sentinel value at the end of the array to ensure that the loop terminates.

Determining the optimum amount of loop unrolling takes some experimentation. You want to find the point of diminishing (or negative) returns. I'm going to take a SWAG and try 8 this time.

static int
linear (const int *arr, int n, int key)
{
        assert(arr[n] >= key);
        int i = 0;
        while (arr[i] < key) {
                i += (arr[i] < key);
                i += (arr[i] < key);
                i += (arr[i] < key);
                i += (arr[i] < key);
                i += (arr[i] < key);
                i += (arr[i] < key);
                i += (arr[i] < key);
                i += (arr[i] < key);
       }
       return i;
}

Edit: As Mark points out, this function introduces a dependency in each line on the line preceding, which limits the ability of the processor pipeline to run operations in parallel. So lets try a small modification to the function to remove the dependency. Now the function does indeed require 8 sentinel elements at the end.

static int 
linear (const int *arr, int n, int key) 
{ 
        assert(arr[n] >= key);
        assert(arr[n+7] >= key);
        int i = 0; 
        while (arr[i] < key) {
                int j = i;
                i += (arr[j] < key); 
                i += (arr[j+1] < key); 
                i += (arr[j+2] < key); 
                i += (arr[j+3] < key); 
                i += (arr[j+4] < key); 
                i += (arr[j+5] < key); 
                i += (arr[j+6] < key); 
                i += (arr[j+7] < key); 
       } 
       return i; 
} 
Mark Ransom
Good one, but I don't think it will perform well because it introduces data dependency for the index i, which the more straightforward linear search does not. I'll benchmark it. Also, you need 8 sentinel values, not just one.
Mark Probst
The data's in - it performs atrociously :-). It's beaten even by a straightforward, non-sentinel, non-unrolled linear search by almost a factor of 2.
Mark Probst
Oh well, it was worth a shot. And you still only need one sentinel, because the index stops incrementing as soon as you reach it.
Mark Ransom
But the assert will fail ;-). You're right, though.
Mark Probst
@Mark Probst, try my newest wrinkle.
Mark Ransom
Much better. About 30% faster than the bog-standard linear search, but still only about half the speed of unrolled linear search with sentinel. My code is now online at http://github.com/schani/linbin/ - feel free to play around with it.
Mark Probst
A: 

In one of the comments you said the array length is 64.

Well if you must do it linearly, you can do:

int i = -1;
do {
  if (arr[0] >= key){i = 0; break;}
  if (arr[1] >= key){i = 1; break;}
  ...
  if (arr[62] >= key){i = 62; break;}
  if (arr[63] >= key){i = 63; break;}
} while(0);

However, I seriously doubt if that is faster than this binary search: *

int i = 0;
if (key >= arr[i+32]) i += 32;
if (key >= arr[i+16]) i += 16;
if (key >= arr[i+ 8]) i +=  8;
if (key >= arr[i+ 4]) i +=  4;
if (key >= arr[i+ 2]) i +=  2;
if (key >= arr[i+ 1]) i +=  1;

*Thanks to Jon Bentley for that one.

Added: since you said this table is prepared once for a large number of searches, and you want fast, you could allocate some space somewhere and generate machine code with the values hard-wired into it. It could either be linear or binary search. If binary, the machine code would look like what the compiler would generate from this:

if (key < value32){
  if (key < value16){
    ...
  }
  else {
    ...
  }
}
else {
  if (key < value48){
    ...
  }
  else {
    ...
  }
}

Then you just copy that into a place where you can call it.

OR you could print the code above, compile and link it on the fly into a dll, and load the dll.

Mike Dunlavey
A: 

In reality, the answer to this question is 100% dependent on the platform you're writing the code for. For example:

CPU : Memory speed | Example CPU | Type of optimisation
========================================================================
    Equal          |    8086     | (1) Loop unrolling
------------------------------------------------------------------------
  CPU > RAM        |  Pentium    | (2) None
  1. Avoiding the conditional branch required to loop though the data will give a slight performance improvement.
  2. Once the CPU starts to get faster than the RAM, it doesn't matter how efficient the loop becomes (unless it's a really bad loop), it will be stalling due to having to wait for the data to be brought in from RAM. SIMD doesn't really help since the advantage of parallel testing is still outweighed by having to wait for more data to arrive. SIMD really comes into its own when you're CPU limited.
Skizz
The data (http://schani.wordpress.com/2010/04/30/linear-vs-binary-search/) disagrees with your theory of reality.
Mark Probst
@Mark: Your method appears to eliminate RAM overhead by throwing away the two slowest times, so you're effectively measuring the algorithm, not the whole system. After a couple of runs, the array will be loaded into L1 and L2 cache and be reasonably quick to access. It would be interesting to see the two slowest times included in the timings - if you could guarantee the data was in RAM and not any cache then the algorithm would have less of an effect on the timings.
Skizz
I'm not throwing away the two slowest individual search times - I can't time a search that takes only a handful of cycles. I do, say, the same 20 million random searches, 10 times over, and throw away the times for the two slowest and the two fastest of those 10 runs. I average the 6 that remain and divide the average time by 20 million to get the average time for one individual search.If you know how to reliably time such a search from RAM, i.e. with "empty" L2 and L3 caches, please let me know.
Mark Probst
+7  A: 

So far you received multiple advices most of which state that linear search makes no sense on sorted data, when binary search will work much more efficiently instead. This often happens to be one of those popular "sounds right" assertions made by people who don't care to give the problem too much thought. In reality, if you consider the bigger picture, given the right circumstances, linear search can be much more efficient than binary search.

Note, that if we consider a single search query for a sorted array, binary search is significantly more efficient method than linear search. There's no argument about that. Also, when you perform multiple completely random queries to the same data binary search still wins over linear search.

However, the picture begins to change if the queries are not exactly random. Imagine that queries arrive in sorted order, i.e. each next query is for a higher value than the previous query. I.e. the queries are also sorted. BTW, they don't have to be globally and strictly sorted, from time to time the query sequence might get "reset", i.e. a low value is queried, but on average the consequent queries should arrive in increasing order. In other words, the queries arrive in series, each series sorted in ascending order. In this case, if the average length of the series is comparable to the length of your array, linear search will outperform binary search by a huge margin. However, to take advantage of this situation, you have to implement your search in incremental manner. It is simple: if the next query is greater than the previous one, you don't need to start the search from the beginning of the array. Instead, you can search from the point where the previous search stopped. The most simplistic implementation (just to illustrate the idea) might look as follows

static int linear(const int *arr, int n, int key)
{
  static int previous_key = INT_MIN;
  static int previous_i = 0;

  i = key >= previous_key ? previous_i : 0;

  while (i < n) {
    if (arr[i] >= key)
      break;
    ++i;
  }

  previous_key = key;
  previous_i = i;

  return i;
}

(Disclaimer: the above implementation is terribly ugly for the obvious reason that the array is arriving from outside as a parameter, while the previous search state is stored internally. Of course, this is wrong way to do it in practice. But again, the above is intended to illustrate the idea and no more).

Note, that the complexity of processing each series of ordered queries using the above approach is always O(N), regardless of the length of the series. Using the binary search, the complexity would be O(M * log N). So, for obvious reasons when M is close to N, i.e. queries arrive in sufficiently long ordered series, the above linear search will significantly outperform binary search, while for small M the binary search will win.

Also, even if the ordered series of queries are not very long, the above modification might still give you a noticeable improvement in search performance, considering that you have to use linear search.

P.S. As an additional piece of information about the structure of the problem:

When you need to perform the search in an ordered array of length N and you know in advance that the queries will arrive in ordered series of [approximate, average] length M, the optimal algorithm will look as follows

  1. Calculate the stride value S = [N/M]. It might also make sense to "snap" the value of S to the [nearest] power of 2. Think of your sorted array as a sequence of blocks of length S - S-blocks.
  2. After receiving a query, perform incremental linear search for the S-block that potentially contains the queried value, i.e. it is an ordinary linear search with stride S (of course, remember to start from the block where the previous search left off).
  3. After finding the S-block, perform the binary search within the S-block for the queried value.

The above is the most optimal incremental search algorithm possible, in a sense that it achieves the theoretical limit on the asymptotic efficiency of repetitive search. Note, that if the value of M is much smaller then N, the algorithm "automatically" shifts itself towards binary search, while when M gets close to N the algorithm "automatically" favors linear search. The latter makes sense because in such environment linear search is significantly more efficient than binary search.

This all is just to illustrate the fact that blanket statements like "linear search on a sorted array is always useless" indicate nothing else than lack of knowledge on the part of those who make such statements.

AndreyT
I think this is the best answer since the OP said "for a large number of searches".
temp2290