ansaurus

Question

Answer 1

A:

Well, you could use pointers...

static int linear(const int *array, int arraySize, int key) {
    int i;

    for(i = 0; i < arraySize; ++i) {
        if(*array >= key) {
            return i;
        }

        ++array;
    }

    return arraySize;
}

strager 2010-04-30 01:55:00

Why is it faster?

the_drow 2010-04-30 01:57:32

@the_drow: In theory incrementing a pointer then testing that should be faster than: increment a counter, then adding that to an address, then testing that. In practice, I doubt their any different. Stylistically I prefer the answer presented here, because it removes the guesswork (and personally I think it flows better.)

GMan 2010-04-30 01:58:38

Yeah, but the compiler will probably optimize that bit anyway. You could also try loop unrolling.

BobbyShaftoe 2010-04-30 02:01:03

Look at the output from your compiler on that one, it's probably the same as the OP's code. (gcc has been doing this optimization since <2.95, which is where I first noticed it.) The "counter" variable will be initialized to n and each time through the loop the counter is decremented while the pointer is advanced by 4 (or whatever sizeof(int) returns).

dash-tom-bang 2010-04-30 02:01:18

I don't think this helps at all. It just means you're incrementing an extra variable every loop. Unless dereferencing a pointer is faster than array[i]...

Chris Cooper 2010-04-30 02:01:30

@Shaftoe, Yes; this kind of microoptimization I have a hard time doing with a clean conscience.

strager 2010-04-30 02:03:36

@GMan: Just about any compiler that offers code optimizations will reduce the counter + array index to pointer arithmetic in the generated code.

dthorpe 2010-04-30 02:04:10

@Chris: Did you read the details of what I wrote? There is no "extra" variable being incremented, incrementing n was replaced with incrementing a pointer, and the indexing math disappears. @dthrope: Indeed, hence why I said "In practice..."

GMan 2010-04-30 02:04:26

Last time I tested on X86, array[i] was faster than *array because you only have to do one increment instead of two.

Mark Ransom 2010-04-30 02:04:27

This will make your code slower at worst, because you're doing one extra increment per iteration. If the compiler's smart, it will give the same performance as the original code.

Mark Probst 2010-04-30 02:08:20

-1 Oh, the joys of hair splitting.

Romain Hippeau 2010-04-30 02:11:08

@GMan - The code has changed since I made my comment, so it is no longer applicable.

Chris Cooper 2010-04-30 03:10:07

@Chris: The code hasn't been changed at all. No edit history, and your comment was made 6 minutes after the answer was posted (so no free edits.)

GMan 2010-04-30 03:20:09

@GMan - Unless it took me a while to actually write the comment I assume...? Anyway, other people have said similar things. I re-read the post, and my comment IS still applicable. But I agree with Romain... Too much hair-splitting here. =P

Chris Cooper 2010-04-30 04:28:03

@Chris: Well I also have my personal testimony that the code hasn't changed, but I prefer evidence-based claims. Either way, this answer is fair, though obviously not what OP is looking for. (Which such a poorly worded question, stager gives a fair answer.)

GMan 2010-04-30 05:00:07

@GMan - Agreed. And regardless of if it's changed or not, the change I testify to was simply a moving of the "++array" from the for loop () to the end of the loop, which has no function difference. =P

Chris Cooper 2010-04-30 05:16:32

Answer 2

+9 A:

Since you can put known values after the last valid entry, add an extra element n+1 = max to make sure the loop doesn't go past the end of the array without having to test for i < n.

static int
linear (const int *arr, int n, int key)
{
        assert(arr[n] >= key);
        int i = 0;
        while (arr[i] < key) {
                ++i;
        }
        return i;
}

You could also try unrolling the loop, with the same sentinel value:

static int
linear (const int *arr, int n, int key)
{
        assert(arr[n] >= key);
        int i = 0;
        while (true) {
                if (arr [i++] >= key)
                        break;
                if (arr [i++] >= key)
                        break;
                if (arr [i++] >= key)
                        break;
                if (arr [i++] >= key)
                        break;
        }
        return --i;
}

Mark Ransom 2010-04-30 02:03:33

Correct in principle, but incorrect in detail. The sentinel must be greater or equal to the key, not less.

Mark Probst 2010-04-30 02:10:43

Took a few edits to get these right, sorry if anybody is confused.

Mark Ransom 2010-04-30 02:10:54

Also, the assert is incorrect, apart from the sign. The element after the last one has index n, not n+1.

Mark Probst 2010-04-30 02:12:33

@Mark, thanks for spotting n+1, I guess I'm not done editing. And I think you're right about the sentinel too, which is how I had it first - I'm trying to do this too fast.

Mark Ransom 2010-04-30 02:14:48

As long as we're micro-optomizing, you may as well start `i` at `-1` and then pre-increment it in the array indexing in your unrolled loop example. That will save you the additional `--i` at the end.

Matt B. 2010-04-30 17:16:36

@Matt, since the decrement is outside the loop it won't make a measurable difference in the results. I would also need to change all the postincrements to preincrements. Thanks for the suggestion though.

Mark Ransom 2010-04-30 21:52:27

@Mark Ransom: Why unrolling 4 times? Why not 2 or 8 or 16?

Lazer 2010-06-13 10:41:49

@Lazer, you could certainly do that but there's a point of diminishing returns. Also the larger the code, the more possibility of cache problems.

Mark Ransom 2010-06-13 14:51:26

@Mark Ransom: yes, I do understand that, but how did you arrive at 4 anyways? Also, I am not sure about the "of course" part of ["For n < 4 this unrolling will not speed up search at all, of course"](http://schani.wordpress.com/2010/04/30/linear-vs-binary-search/).

Lazer 2010-06-13 17:16:14

Answer 3

A:

loop backwards, this might be translated...

// loop backward

for (int i = arraySize - 1; i >=0; --i)

...to this( "could be" faster ):

    mov cx, arraySize - 1
detectionHere:
    ...
    loop detectionHere

Other than that, only binary search can make searching faster

Michael Buen 2010-04-30 02:04:11

`loop` isn't fast; most complex instructions are slower than multiple simple instructions nowadays. Also, woudln't this make a bad use of caches?

Bastien Léonard 2010-04-30 09:35:16

hence the "could be" faster. one less instruction, one less cycle, just my thoughts

Michael Buen 2010-04-30 11:52:12

Answer 4

A:

this might force vector instructions (suggested by Gman):

for (int i = 0; i < N; i += 4) {
    bool found = false;   
    found |= (array[i+0] >= key);
    ...
    found |= ( array[i+3] >= key);
    // slight variation would be to use max intrinsic
    if (found) return i;
}
...
// quick search among four elements

this also makes fewer branch instructions. you make help by ensuring input array is aligned to 16 byte boundary

another thing that may help vectorization (doing vertical max comparison):

for (int i = 0; i < N; i += 8) {
    bool found = false;   
    found |= max(array[i+0], array[i+4]) >= key;
    ...
    found |= max(array[i+3], array[i+7] >= key;
    if (found) return i;
}
// have to search eight elements

aaa 2010-04-30 02:10:19

This is interesting.Can you explain this code?

the_drow 2010-04-30 02:31:56

@the_drow basically, you hoping to use vector instructions to do 4x things in one time. many compilers can be forced to use such instructions.in the first, you loading 4 elements, in the second, you loading eight elements, and eliminate half by using vector max function.the result is range in which index is located (four or eight elements long).after this you have to search small range for exact index

aaa 2010-04-30 03:22:47

Answer 5

+18 A:

Tell your boss you can make it 50% faster, but it will take 6 months, and some money.
Wait six months.
Buy new hardware.

Well, it makes about as much sense as a linear search through a sorted array!

(More seriously, can you give us some clues about why no binary search?)

Oddthinking 2010-04-30 02:34:09

If you read through all the comments, you can see that he asked this as a mental exercise. I like your answer, it's a classic! Definitely thinking outside the box. Unfortunately it doesn't really address the spirit of the question, which is how you would write the code differently.

Mark Ransom 2010-04-30 02:39:58

I can top @Mark Ransom... just overclock your processor.

rlbond 2010-04-30 04:06:55

-1 The answer is funny, but not quite helpful.

Andrei Ciobanu 2010-04-30 13:45:28

Andrei, I upvoted your comment. Of course, you are right it isn't helpful. There's a meta-discussion around here somewhere about how much effort the community should give to problems which is really a self-imposed thought exercise and isn't tagged as "puzzle"/"golf" or the like.

Oddthinking 2010-04-30 16:39:29

I just went and looked at his blog post. His justification is given there. And his blog is named after four of my favourite activities. Now I feel bad :-(

Oddthinking 2010-05-01 03:30:10

Answer 6

+4 A:

If you had a quantum computer, you could use Grover's algorithm to search your data in O(N^1/2) time and using O(log N) storage space. Otherwise, your question is pretty silly. Binary search or one of its variants (trinary search, for example) is really your best choice. Doing micro-optimizations on a linear search is stupid when you can pick a superior algorithm.

George 2010-04-30 02:43:17

Ok, Mister Smarty-Pants, I have a Core i7 and need to search in an array of size 64, and it needs to be super-fast. Linear or binary? Any further optimizations?

Mark Probst 2010-04-30 02:56:18

George: For small arrays, cache misses and branch mispredictions will dominate the time to run a binary search. A linear search can use prefetch to eliminate cache misses and will be able to predict most branches.

Gabe 2010-04-30 03:41:15

@Mark Probst, In your case, I can do it in constant time...

aioobe 2010-04-30 08:19:24

Yes, you can do almost everything in constant time, if you just make the constant large enough. But that was not the question.

Mark Probst 2010-04-30 11:07:46

In theory, a fixed size array is searched in constant time. In theory, there is no difference between theory and practice. In practice, that is not true.

Alan 2010-04-30 17:05:47

@Mark: True, but if you're searching an array of size 64 why go through all this effort to optimize the search?

BlueRaja - Danny Pflughoeft 2010-04-30 20:42:13

I could ask the same question for any array size, couldn't I?

Mark Probst 2010-05-02 00:15:04

Answer 7

A:

You could search for a larger element than an int at a time - platform specifically, this can be much faster or slower depending on how it handles the larger data read. For instance, on a 64-bit system, reading in the array 2 elements at a time and checking the hi/low elements seperately could run faster due to less I/O. Still, this is an O(n) variety sort no matter what.

Michael Dorgan 2010-04-30 03:02:04

Answer 8

A:

If a target-specific solution is acceptable then you can quite easily use SIMD (SSE, AltiVec, or whatever you have available) to get ~ 4x speed-up by testing 4 elements at a time rather than just 1.

Out of interest I put together a simple SIMD implementation as follows:

int linear_search_ref(const int32_t *A, int32_t key, int n)
{
    int result = -1;
    int i;

    for (i = 0; i < n; ++i)
    {
        if (A[i] >= key)
        {
            result = i;
            break;
        }
    }
    return result;
}

int linear_search(const int32_t *A, int32_t key, int n)
{
#define VEC_INT_ELEMS 4
#define BLOCK_SIZE (VEC_INT_ELEMS * 32)
    const __m128i vkey = _mm_set1_epi32(key);
    int vresult = -1;
    int result = -1;
    int i, j;

    for (i = 0; i <= n - BLOCK_SIZE; i += BLOCK_SIZE)
    {
        __m128i vmask0 = _mm_set1_epi32(-1);
        __m128i vmask1 = _mm_set1_epi32(-1);
        int mask0, mask1;

        for (j = 0; j < BLOCK_SIZE; j += VEC_INT_ELEMS * 2)
        {
            __m128i vA0 = _mm_load_si128(&A[i + j]);
            __m128i vA1 = _mm_load_si128(&A[i + j + VEC_INT_ELEMS]);
            __m128i vcmp0 = _mm_cmpgt_epi32(vkey, vA0);
            __m128i vcmp1 = _mm_cmpgt_epi32(vkey, vA1);
            vmask0 = _mm_and_si128(vmask0, vcmp0);
            vmask1 = _mm_and_si128(vmask1, vcmp1);
        }
        mask0 = _mm_movemask_epi8(vmask0);
        mask1 = _mm_movemask_epi8(vmask1);
        if ((mask0 & mask1) != 0xffff)
        {
            vresult = i;
            break;
        }
    }
    if (vresult > -1)
    {
        result = vresult + linear_search_ref(&A[vresult], key, BLOCK_SIZE);
    }
    else if (i < n)
    {
        result = i + linear_search_ref(&A[i], key, n - i);
    }
    return result;
#undef BLOCK_SIZE
#undef VEC_INT_ELEMS
}

On a 2.67 GHz Core i7, using OpenSUSE x86-64 and gcc 4.3.2, I get around 7x - 8x improvement around a fairly broad "sweet spot" where n = 100000 with the key being found at the midpoint of the array (i.e. result = n / 2). Performance drops off to around 3.5x when n gets large and the array therefore exceeds cache size (presumably becoming memory bandwidth-limited in this case). Performance also drops off when n is small, due to inefficiency of the SIMD implementation (it was optimised for large n of course).

Paul R 2010-04-30 07:51:35

You can use SIMD, but the speedup will not be 4x, especially not for small arrays. Tested with SSE2 on a Core i7. I'd be interested in your implementation.

Mark Probst 2010-04-30 11:09:47

For small arrays, perhaps not, but for large arrays I think you should be able to hit 4x using SIMD. I would unroll the main loop by 2 so that you have two vector loads issued per iteration and you should then be able to hide most of the latencies.

Paul R 2010-04-30 13:18:07

I've spent quite some time fiddling with this, and the best speedup I can get with SSE2 over my best non-SSE2 implementation is 2.6x for large arrays. I'd be happy to test your implementation, though :-)

Mark Probst 2010-04-30 15:04:30

@Mark: OK - challenge accepted - I'll code it up later and get back to you...

Paul R 2010-04-30 15:55:17

For large buffers, around 2.5x is the typical speedup I've seen for carefully optimized sse2 code over carefully optimized straight c math.

Alan 2010-04-30 17:13:06

@Alan: it depends a great deal on what CPU you are using, and also some extent on what compiler. Prior to Woodcrest when SSE2 was only a 64 bit implementation under the hood, SSE speed-ups were modest compared to full 128 bit SIMD implementations such as AltiVec, but from Core 2 Duo onwards you should be able to get 4x improvement for float/int.

Paul R 2010-04-30 19:06:24

@Mark Probst: OK, I've added a simple SIMD implementation to my answer above. It's around `8x` faster than scalar code at best, with array sizes of the order of 100000 and the key value being found half way through the array. It drops off to around `3.5x` for very large arrays.

Paul R 2010-05-01 17:17:51

For small array sizes this is much slower than my fastest implementations (and slower, actually, than my slowest). For N=1000 it's about as fast as my fastest non-SIMD implementation, but still not even half the speed of my best SSE2 implementation. At N=10000 it's almost as fast as my best SSE2 implementation, but it never catches up fully. GCC 4.2.1 on a Core i7.

Mark Probst 2010-05-01 20:10:29

@Mark: I wonder how you're compiling it, and also how you're timing it ? I'm using `gcc -O3` and it's an x86-64 executable (twice as many SSE registers as x86). When I time it I'm doing it in a loop (100 iterations) and taking the minimum time - this means that for all but the first iteration the caches will be primed. If you're just timing one iteration then your measurements will be skewed. And yes, of course the performance will be poor for small arrays - that is expected since the routine evaluates blocks of the array rather than individual elements or vectors.

Paul R 2010-05-02 07:27:43

@Mark: just one further sanity check: what are your absolute times ? At 2.67 GHz I'm seeing around 1.0 ns / element searched for the scalar code and under 0.15 ns / element searched for my SIMD code (both for the N = 100000 case, where the key is at N / 2, hence times are equal to total time / 50000).

Paul R 2010-05-02 10:07:40

0.15ns / element searched means you're doing 2.5 elements/cycle. My code is doing 2.6 elements/cycle. Feel free to try it out yourself http://github.com/schani/linbin - I made a branch "paulr" that includes your implementation. Test linear_sentinel_sse2_nobranch vs linear_sse2_paulr.

Mark Probst 2010-05-02 11:00:56

@Mark: Thanks - I have some more ideas for further optimisations which I'll try out later today. One involves using SSE4 but I'll #ifdef that code in case you want to restrict this to SSE2.

Paul R 2010-05-02 13:54:03

@Mark: well my further optimisations didn't help much - I suspect that performance has reached the point where it is limited by cache/memory read bandwidth. This is always the problem when you have very little computation relative to memory I/O.

Paul R 2010-05-02 15:22:40

Could well be. Good for us, then, isn't it? :-)

Mark Probst 2010-05-02 15:25:11

Answer 9

+1 A:

unroll with fixed array indices.

int linear( const int *array, int n, int key ) {
  int i = 0;
  if ( array[n-1] >= key ) {
     do {
       if ( array[0] >= key ) return i+0;
       if ( array[1] >= key ) return i+1;
       if ( array[2] >= key ) return i+2;
       if ( array[3] >= key ) return i+3;
       array += 4;
       i += 4;
     } while ( true );
  }
  return -1;
}

drawnonward 2010-04-30 08:14:20

Answer 10

+2 A:

If you're on an Intel platform:

int linear (const int *array, int n, int key)
{
  __asm
  {
    mov edi,array
    mov ecx,n
    mov eax,key
    repne scasd
    mov eax,-1
    jne end
    mov eax,n
    sub eax,ecx
    dec eax
end:
  }
}

but that only finds exact matches, not greater than or equal matches.

In C, you can also use Duff's Device:

int linear (const int *array, int n, int key)
{
  const int
    *end = &array [n];

  int
    result = 0;

  switch (n % 8)
  {
    do {
  case 0:
    if (*(array++) >= key) break;
    ++result;
  case 7:
    if (*(array++) >= key) break;
    ++result;
  case 6:
    if (*(array++) >= key) break;
    ++result;
  case 5:
    if (*(array++) >= key) break;
    ++result;
  case 4:
    if (*(array++) >= key) break;
    ++result;
  case 3:
    if (*(array++) >= key) break;
    ++result;
  case 2:
    if (*(array++) >= key) break;
    ++result;
  case 1:
    if (*(array++) >= key) break;
    ++result;
    } while(array < end);
  }

  return result;
}

Skizz 2010-04-30 08:38:09

Be careful recommending Duff's device. It's clever C code, for some value of "clever", but because it's extremely unstructured, it can sometimes defeat modern optimizing compilers.

Dale Hagglund 2010-04-30 09:13:37

@Dale: You're right, modern compilers almost certainly would do a better job of loop unrolling than this.

Skizz 2010-04-30 20:25:08

Answer 11

+2 A:

You can do it in parallel.

If the list is small, maybe it won't be worth to split the search, but if have to process lots of searches, then you can definitively run them in parallel. That wouldn't reduce the latency of the operations, but would improve the throughput.

fortran 2010-04-30 09:15:19

There's almost no way that creating even one thread will be cheaper than a linear scan of 100-200 items.

Dale Hagglund 2010-04-30 09:23:40

Still, if there are going to be lots of searches, those can be done in parallel, and the threads can be in a pool and reused.

fortran 2010-04-30 10:25:43

In this case, if you are searching <60 items, there is no need for doing it in parallel. However, there are some use cases (I have one now) where an Array of items is not ordered and the order cannot be changed. Binary search cannot be used in this case and if the size of the Array is rather large (it would have to be somewhere around 10,000 to make it worth the extra effort), dividing the array and searching in parallel would definitely be a viable solution

Richard 2010-09-09 11:17:48

Answer 12

+2 A:

You've received many suggestions for improvements, but you need to measure each optimization to see which is best given your hardware and compiler.

As an example of this, in the first version of this response, I guessed that by 100-200 array elements, the slightly higher overhead of binary search should easily be paid for by far fewer probes into the array. However, in the comments below, Mark Probst reports that he sees linear search ahead up to about 500 entries on his hardware. This reinforces the need to measure when searching for the very best performance.

Note: Edited following Mark's comments below on his measurements of linear versus binary search for reasonably small N.

Dale Hagglund 2010-04-30 09:21:43

My best linear search beats a standard binary search up to N=550 on a Core i7.

Mark Probst 2010-04-30 11:11:06

Thanks for the information. I've updated my comment to reflect this.

Dale Hagglund 2010-04-30 15:27:31

The common rules of optimization: 1) Don't, 2) MeasureGiven that this was all a thought exercise, #1 doesn't apply. But #2 must always apply. I'm glad that somebody brought this up!

Harold Bamford 2010-04-30 17:18:58

Answer 13

A:

This answer is a little more obscure than my other one, so I'm posting it separately. It relies on the fact that C guarantees a boolean result false=0 and true=1. X86 can produce booleans without branching, so it might be faster, but I haven't tested it. Micro-optimizations like these will always be highly dependent on your processor and compiler.

As before, the caller is responsible for putting a sentinel value at the end of the array to ensure that the loop terminates.

Determining the optimum amount of loop unrolling takes some experimentation. You want to find the point of diminishing (or negative) returns. I'm going to take a SWAG and try 8 this time.

static int
linear (const int *arr, int n, int key)
{
        assert(arr[n] >= key);
        int i = 0;
        while (arr[i] < key) {
                i += (arr[i] < key);
                i += (arr[i] < key);
                i += (arr[i] < key);
                i += (arr[i] < key);
                i += (arr[i] < key);
                i += (arr[i] < key);
                i += (arr[i] < key);
                i += (arr[i] < key);
       }
       return i;
}

Edit: As Mark points out, this function introduces a dependency in each line on the line preceding, which limits the ability of the processor pipeline to run operations in parallel. So lets try a small modification to the function to remove the dependency. Now the function does indeed require 8 sentinel elements at the end.

static int 
linear (const int *arr, int n, int key) 
{ 
        assert(arr[n] >= key);
        assert(arr[n+7] >= key);
        int i = 0; 
        while (arr[i] < key) {
                int j = i;
                i += (arr[j] < key); 
                i += (arr[j+1] < key); 
                i += (arr[j+2] < key); 
                i += (arr[j+3] < key); 
                i += (arr[j+4] < key); 
                i += (arr[j+5] < key); 
                i += (arr[j+6] < key); 
                i += (arr[j+7] < key); 
       } 
       return i; 
}

Mark Ransom 2010-04-30 11:40:04

Good one, but I don't think it will perform well because it introduces data dependency for the index i, which the more straightforward linear search does not. I'll benchmark it. Also, you need 8 sentinel values, not just one.

Mark Probst 2010-04-30 12:58:17

The data's in - it performs atrociously :-). It's beaten even by a straightforward, non-sentinel, non-unrolled linear search by almost a factor of 2.

Mark Probst 2010-04-30 13:08:13

Oh well, it was worth a shot. And you still only need one sentinel, because the index stops incrementing as soon as you reach it.

Mark Ransom 2010-04-30 13:28:43

But the assert will fail ;-). You're right, though.

Mark Probst 2010-04-30 15:00:44

@Mark Probst, try my newest wrinkle.

Mark Ransom 2010-04-30 16:17:19

Much better. About 30% faster than the bog-standard linear search, but still only about half the speed of unrolled linear search with sentinel. My code is now online at http://github.com/schani/linbin/ - feel free to play around with it.

Mark Probst 2010-04-30 16:49:52

Answer 14

A:

In one of the comments you said the array length is 64.

Well if you must do it linearly, you can do:

int i = -1;
do {
  if (arr[0] >= key){i = 0; break;}
  if (arr[1] >= key){i = 1; break;}
  ...
  if (arr[62] >= key){i = 62; break;}
  if (arr[63] >= key){i = 63; break;}
} while(0);

However, I seriously doubt if that is faster than this binary search: *

int i = 0;
if (key >= arr[i+32]) i += 32;
if (key >= arr[i+16]) i += 16;
if (key >= arr[i+ 8]) i +=  8;
if (key >= arr[i+ 4]) i +=  4;
if (key >= arr[i+ 2]) i +=  2;
if (key >= arr[i+ 1]) i +=  1;

*Thanks to Jon Bentley for that one.

Added: since you said this table is prepared once for a large number of searches, and you want fast, you could allocate some space somewhere and generate machine code with the values hard-wired into it. It could either be linear or binary search. If binary, the machine code would look like what the compiler would generate from this:

if (key < value32){
  if (key < value16){
    ...
  }
  else {
    ...
  }
}
else {
  if (key < value48){
    ...
  }
  else {
    ...
  }
}

Then you just copy that into a place where you can call it.

OR you could print the code above, compile and link it on the fly into a dll, and load the dll.

Mike Dunlavey 2010-04-30 20:17:58

Answer 15

A:

In reality, the answer to this question is 100% dependent on the platform you're writing the code for. For example:

CPU : Memory speed | Example CPU | Type of optimisation
========================================================================
    Equal          |    8086     | (1) Loop unrolling
------------------------------------------------------------------------
  CPU > RAM        |  Pentium    | (2) None

Avoiding the conditional branch required to loop though the data will give a slight performance improvement.
Once the CPU starts to get faster than the RAM, it doesn't matter how efficient the loop becomes (unless it's a really bad loop), it will be stalling due to having to wait for the data to be brought in from RAM. SIMD doesn't really help since the advantage of parallel testing is still outweighed by having to wait for more data to arrive. SIMD really comes into its own when you're CPU limited.

Skizz 2010-04-30 20:54:52

The data (http://schani.wordpress.com/2010/04/30/linear-vs-binary-search/) disagrees with your theory of reality.

Mark Probst 2010-05-01 00:05:36

@Mark: Your method appears to eliminate RAM overhead by throwing away the two slowest times, so you're effectively measuring the algorithm, not the whole system. After a couple of runs, the array will be loaded into L1 and L2 cache and be reasonably quick to access. It would be interesting to see the two slowest times included in the timings - if you could guarantee the data was in RAM and not any cache then the algorithm would have less of an effect on the timings.

Skizz 2010-05-01 21:31:09

I'm not throwing away the two slowest individual search times - I can't time a search that takes only a handful of cycles. I do, say, the same 20 million random searches, 10 times over, and throw away the times for the two slowest and the two fastest of those 10 runs. I average the 6 that remain and divide the average time by 20 million to get the average time for one individual search.If you know how to reliably time such a search from RAM, i.e. with "empty" L2 and L3 caches, please let me know.

Mark Probst 2010-05-02 00:19:44

Answer 16

+7 A:

So far you received multiple advices most of which state that linear search makes no sense on sorted data, when binary search will work much more efficiently instead. This often happens to be one of those popular "sounds right" assertions made by people who don't care to give the problem too much thought. In reality, if you consider the bigger picture, given the right circumstances, linear search can be much more efficient than binary search.

Note, that if we consider a single search query for a sorted array, binary search is significantly more efficient method than linear search. There's no argument about that. Also, when you perform multiple completely random queries to the same data binary search still wins over linear search.

However, the picture begins to change if the queries are not exactly random. Imagine that queries arrive in sorted order, i.e. each next query is for a higher value than the previous query. I.e. the queries are also sorted. BTW, they don't have to be globally and strictly sorted, from time to time the query sequence might get "reset", i.e. a low value is queried, but on average the consequent queries should arrive in increasing order. In other words, the queries arrive in series, each series sorted in ascending order. In this case, if the average length of the series is comparable to the length of your array, linear search will outperform binary search by a huge margin. However, to take advantage of this situation, you have to implement your search in incremental manner. It is simple: if the next query is greater than the previous one, you don't need to start the search from the beginning of the array. Instead, you can search from the point where the previous search stopped. The most simplistic implementation (just to illustrate the idea) might look as follows

static int linear(const int *arr, int n, int key)
{
  static int previous_key = INT_MIN;
  static int previous_i = 0;

  i = key >= previous_key ? previous_i : 0;

  while (i < n) {
    if (arr[i] >= key)
      break;
    ++i;
  }

  previous_key = key;
  previous_i = i;

  return i;
}

(Disclaimer: the above implementation is terribly ugly for the obvious reason that the array is arriving from outside as a parameter, while the previous search state is stored internally. Of course, this is wrong way to do it in practice. But again, the above is intended to illustrate the idea and no more).

Note, that the complexity of processing each series of ordered queries using the above approach is always O(N), regardless of the length of the series. Using the binary search, the complexity would be O(M * log N). So, for obvious reasons when M is close to N, i.e. queries arrive in sufficiently long ordered series, the above linear search will significantly outperform binary search, while for small M the binary search will win.

Also, even if the ordered series of queries are not very long, the above modification might still give you a noticeable improvement in search performance, considering that you have to use linear search.

P.S. As an additional piece of information about the structure of the problem:

When you need to perform the search in an ordered array of length N and you know in advance that the queries will arrive in ordered series of [approximate, average] length M, the optimal algorithm will look as follows

Calculate the stride value S = [N/M]. It might also make sense to "snap" the value of S to the [nearest] power of 2. Think of your sorted array as a sequence of blocks of length S - S-blocks.
After receiving a query, perform incremental linear search for the S-block that potentially contains the queried value, i.e. it is an ordinary linear search with stride S (of course, remember to start from the block where the previous search left off).
After finding the S-block, perform the binary search within the S-block for the queried value.

The above is the most optimal incremental search algorithm possible, in a sense that it achieves the theoretical limit on the asymptotic efficiency of repetitive search. Note, that if the value of M is much smaller then N, the algorithm "automatically" shifts itself towards binary search, while when M gets close to N the algorithm "automatically" favors linear search. The latter makes sense because in such environment linear search is significantly more efficient than binary search.

This all is just to illustrate the fact that blanket statements like "linear search on a sorted array is always useless" indicate nothing else than lack of knowledge on the part of those who make such statements.

AndreyT 2010-04-30 21:51:19

I think this is the best answer since the OP said "for a large number of searches".

temp2290 2010-05-03 15:36:40

ansaurus

tags:

views:

answers:

How fast can you make linear search?

related questions