views:

861

answers:

14

By 'empty buffer,' I mean a buffer full of zeroes.

I am searching for a faster method of accomplishing this:

int is_empty(char * buf, int size) {
 int i;
 for(i = 0; i < size; i++) {
  if(buf[i] != 0) return 0;
 }
 return 1;
}

I realize I am searching micro optimization unnecessary except in extreme cases, but I know a faster method exists, and I'm curious what it is.

A: 
int is_empty(char * buf, int size)
{
   return buf[0] == '\0';
}

If your buffer is not a character string, I think that's the fastest way to check...

memcmp() would require you to create a buffer the same size and then use memset to set it all as 0. I doubt that would be faster...

Kieveli
`memcmp()` doesn't _require_ that. You could make a really big static buffer initialized with `{ 0 }` and compare only portions of that to the buffer in question.
Chris Lutz
+5  A: 

One potential way, inspired by Kieveli's dismissed idea:

int is_empty(char *buf, size_t size)
{
    static const char zero[999] = { 0 };
    return !memcmp(zero, buf, size > 999 ? 999 : size);
}

Note that you can't make this solution work for arbitrary sizes. You could do this:

int is_empty(char *buf, size_t size)
{
    char *zero = calloc(size);
    int i = memcmp(zero, buf, size);
    free(zero);
    return i;
}

But any dynamic memory allocation is going to be slower than what you have. The only reason the first solution is faster is because it can use memcmp(), which is going to be hand-optimized in assembly language by the library writers and will be much faster than anything you could code in C.

EDIT: An optimization no one else has mentioned, based on earlier observations about the "likelyness" of the buffer to be in state X: If a buffer isn't empty, will it more likely not be empty at the beginning or the end? If it's more likely to have cruft at the end, you could start your check at the end and probably see a nice little performance boost.

EDIT 2: Thanks to Accipitridae in the comments:

int is_empty(char *buf, size_t size)
{
    return buf[0] == 0 && !memcmp(buf, buf + 1, size - 1);
}

This basically compares the buffer to itself, with an initial check to see if the first element is zero. That way, any non-zero elements will cause memcmp() to fail. I don't know how this would compare to using another version, but I do know that it will fail quickly (before we even loop) if the first element is nonzero. If you're more likely to have cruft at the end, change buf[0] to buf[size] to get the same effect.

Chris Lutz
On the other hand, the memcmp version is going to be bound by the memory bandwidth, and so will the smarter C version if the compiler is not too stupid, but the C version will have *half* the memory to read. Without trying, I predict the C version to be at least 50% faster than the memcmp one.
Pascal Cuoq
a) What "C" versions are you comparing? b) Why would `memcmp()` take longer?
Chris Lutz
Accipitridae
@Pascal - Timing it, the first `memcmp()` version is almost the same speed as the OP's version, clocking it consistently _slightly_ (like, 0.001s) faster.
Chris Lutz
You might want to make that `static` array `const`, too. That version is trivially inline-able, which could be another (tiny) win.
caf
+13  A: 

On many architectures, comparing 1 byte takes the same amount of time as 4 or 8, or sometimes even 16. 4 bytes is normally easy (either int or long), and 8 is too (long or long long). 16 or higher probably requires inline assembly to e.g., use a vector unit.

Also, a branch mis-predictions really hurt, it may help to eliminate branches. For example, if the buffer is almost always empty, instead of testing each block against 0, bit-or them together and test the final result.

derobert
+1 for branch information, since the poster is looking for micro-optimization.
Ben M
Let it be noted that comparing groups of 1 bytes as a single integral value brings up issues of alignment, which makes it difficult to do safely.
Chris Lutz
There is no way you could really optimize this loop in regards to branch prediction. A modern processor will favor taking a backwards jump over falling through if it has no prediction information and after that it will continue correctly predict the branch until the loop ends. The end result is one misprediction per buffer, I can't imagine a way to get this down to 0.
Falaina
He's not talking about prediction of the loop, he's talking about prediction of testing that the buffer contents are zero (and improving it by doing *fewer* comparisons)
Stephen Canon
Alignment doesn't complicate things too much. Process bytes one at a time until you have the necessary alignment, then fall into a main loop that uses larger comparisons, and have some single-byte comparisons to clean up the leftover items when you're done. Alignment is only a substantial pain when you're dealing with multiple buffers that may have different relative alignments.
Stephen Canon
+2  A: 

Try checking the buffer using an int-sized variable where possible (it should be aligned).

Off the top of my head (uncompiled, untested code follows - there's almost certainly at least one bug here. This just gives the general idea):

/* check the start of the buf byte by byte while it's unaligned */
while (size && !int_aligned( buf)) {
    if (*buf != 0) {
        return 0;
    }

    ++buf;
    --size;
}


/* check the bulk of the buf int by int while it's aligned */

size_t n_ints = size / sizeof( int);
size_t rem = size / sizeof( int);

int* pInts = (int*) buf;

while (n_ints) {
    if (*pInt != 0) {
        return 0;
    }

    ++pInt;
    --n_ints;
}


/* now wrap up the remaining unaligned part of the buf byte by byte */

buf = (char*) pInts;

while (rem) {
    if (*buf != 0) {
        return 0;
    }

    ++buf;
    --rem;
}

return 1;
Michael Burr
+1  A: 

The Hackers Delight book/site is all about optimized C/assembly. Lots of good references from that site also and is fairly up to date (AMD64, NUMA techniques also).

RandomNickName42
+7  A: 

For something so simple, you'll need to see what code the compiler is generating.

$ gcc -S -O3 -o empty.s empty.c

And the contents of the assembly:

        .text
        .align 4,0x90
.globl _is_empty
_is_empty:
        pushl       %ebp
        movl        %esp, %ebp
        movl        12(%ebp), %edx  ; edx = pointer to buffer
        movl        8(%ebp), %ecx   ; ecx = size
        testl       %edx, %edx
        jle L3
        xorl        %eax, %eax
        cmpb        $0, (%ecx)
        jne L5
        .align 4,0x90
L6:
        incl        %eax            ; real guts of the loop are in here
        cmpl        %eax, %edx
        je  L3
        cmpb        $0, (%ecx,%eax) ; compare byte-by-byte of buffer
        je  L6
L5:
        leave
        xorl        %eax, %eax
        ret
        .align 4,0x90
L3:
        leave
        movl        $1, %eax
        ret
        .subsections_via_symbols

This is very optimized. The loop does three things:

  • Increase the offset
  • Compare the offset to the size
  • Compare the byte-data in memory at base+offset to 0

It could be optimized slightly more by comparing at a word-by-word basis, but then you'd need to worry about alignment and such.

When all else fails, measure first, don't guess.

Sufian
It could be optimized MUCH more by using string instruction REP SCASx
Tomas
A: 

Edit: Bad answer

A novel approach might be

int is_empty(char * buf, int size) {
    char start = buf[0];
    char end = buff[size-1];
    buf[0] = 'x';
    buf[size-1] = '\0';
    int result = strlen(buf) == 0;
    buf[0] = start;
    buff[size-1] = end;
    return result;
}

Why the craziness? because strlen is one of the library function that's more likely to be optimized. Storing and replacing the first character is to prevent the false positive. Storing and replacing the last character is to make sure it terminates.

Calyth
This won't work for the binary data `0 0 1 0` among others. It doesn't check whether or not a string is empty.
Chris Lutz
Hm... Good point.
Calyth
+3  A: 

Look at fast memcpy - it can be adapted for memcmp (or memcmp against a constant value).

Vitali
interesting....
Jason S
A: 

Cast the buffer to int and iterate it as if it were an int array rather than a char array. The number of iterations being reduced to:

size / sizeof(int)

Further you can unroll the loop. On some architectures Duff's Device may present a gain, but that is by no means a given.

Clifford
This has been suggested already, and is probably unsafe due to alignment issues. It's best to read the answers already posted before posting more.
Chris Lutz
+1  A: 

Four functions for testing zeroness of a buffer with simple benchmarking:

#include <stdio.h> 
#include <string.h> 
#include <wchar.h> 
#include <inttypes.h> 

#define SIZE (8*1024) 
char zero[SIZE] __attribute__(( aligned(8) ));

#define RDTSC(var)  __asm__ __volatile__ ( "rdtsc" : "=A" (var)); 

#define MEASURE( func ) { \ 
  uint64_t start, stop; \ 
  RDTSC( start ); \ 
  int ret = func( zero, SIZE ); \ 
  RDTSC( stop ); \ 
  printf( #func ": %s   %12"PRIu64"\n", ret?"non zero": "zero", stop-start ); \ 
} 


int func1( char *buff, size_t size ){
  while(size--) if(*buff++) return 1;
  return 0;
}

int func2( char *buff, size_t size ){
  return *buff || memcmp(buff, buff+1, size-1);
}

int func3( char *buff, size_t size ){
  return *(uint64_t*)buff || memcmp(buff, buff+sizeof(uint64_t), size-sizeof(uint64_t));
}

int func4( char *buff, size_t size ){
  return *(wchar_t*)buff || wmemcmp((wchar_t*)buff, (wchar_t*)buff+1, size/sizeof(wchar_t)-1);
}

int main(){
  MEASURE( func1 );
  MEASURE( func2 );
  MEASURE( func3 );
  MEASURE( func4 );
}

Result on my old PC:

func1: zero         108668
func2: zero          38680
func3: zero           8504
func4: zero          24768
sambowry
intriguing, but func3 and func4 need some minor fixup to address non-integer multiples of wordsize, and alignment issues
Jason S
A: 

I see a lot of people saying things about alignment issues preventing you from doing word sized accesses, but that's not always true. If you're looking to make portable code, then this is certainly an issue, however x86 will actually tolerate misaligned accesses. For exmaple this will only fail on the x86 if alignment checking is turned on in EFLAGS (and of course buf is actuallly not word aligned).

int is_empty(char * buf, int size) {
 int i;
 for(i = 0; i < size; i+= 4) {
   if(*(int *)(buf + i) != 0) {
     return 0;
   }   
 }

 for(; i < size; i++) {
   if(buf[i] != 0) 
     return 0;
 }

 return 1;
}

Regardless the compiler CAN convert your original loop into a loop of word-based comparisons with extra jumps to handle alignment issues, however it will not do this at any normal optimization level because it lacks information. For cases when size is small, unrolling the loop in this way will make the code slower, and the compiler wants to be conservative.

A way to get around this is to make use of profile guided optimizations. If you let GCC get profile information on the is_empty function then re-compile it, it will be willing to unroll the loop into word-sized comparisons with an alignment check. You can also force this behavior with -funroll-all-loops

Falaina
A: 

int is_empty(char * buf, int size) {

int i;
int content=0;
for(i = 0; !content && i < size; i++)
{
content=content | buf(i); // bitwise or
}
return (content==0);
}

Emilio M Bumachar
a) Please use formatting. b) `||` is not bitwise OR, it's logical OR.
Chris Lutz
c) For efficiency, change your `for()` loop to `for(i = 0; !content i++)` so that it won't keep looping once bad content is found.
Chris Lutz
A: 

If your program is x86 only or x64 only, you can easily optimize using inline assambler. The REPE SCASD instruction will scan a buffer until a non EAX dword is found.

Since there is no equivalent standard library function, no compiler/optimizer will probably be able to use these instructions (as Confirmed by Sufian's code).

From the head, something like this would do if your buffer length is 4-bytes aligned (MASM syntax):

_asm {
   XOR EAX, EAX       ; search for non-zero
   LEA ESI, [buf]     ; search in buf
   MOV ECX, [buflen]  ; search buflen bytes
   SHL ECX, 2         ; using dwords so len/=4
   REPE SCASD         ; perform scan
   JCXZ bufferEmpty:  ; completes? then buffer is 0
}

Tomas

Tomas
A: 

Did anyone mention unrolling the loop? In any of these loops, the loop overhead and indexing is going to be significant.

Also, what is the probability that the buffer will actually be empty? That's the only case where you have to check all of it. If there typically is some garbage in the buffer, the loop should stop very early, so it doesn't matter.

If you plan to clear it to zero if it's not zero, it would probably be faster just to clear it with memset(buf, 0, sizeof(buf)), whether or not it's already zero.

Mike Dunlavey