views:

411

answers:

10

The challenge is to find the fastest way to determine in C/C++ the length of a c-string using bitwise operations in C.

char thestring[16];

The c-string has a max size of 16 chars and is inside a buffer If the string is equal to 16 chars doesn't have the null byte at the end.

I am sure can be done but didn't got it right yet.

I am working on this at the moment, but assuming the string is memcpied on a zero-filled buffer.

len =   buff[0] != 0x0 +
            buff[1] != 0x0 +
            buff[2] != 0x0 +
            buff[3] != 0x0 +
            buff[4] != 0x0 +
            buff[5] != 0x0 +
            buff[6] != 0x0 +
            buff[7] != 0x0 +
            buff[8] != 0x0 +
            buff[9] != 0x0 +
            buff[10] != 0x0 +
            buff[11] != 0x0 +
            buff[12] != 0x0 +
            buff[13] != 0x0 +
            buff[14] != 0x0 +
            buff[15] != 0x0;

Note: the buffer is zero-filled "\0123456789abcde" can't happen.

+4  A: 

This would work fine since buf is initialized with zero. Your solution has != which will use jump instruction. If the GPU has multiple XOR units, the following code can be pipelined quite nicely. On other hand, JUMP instruction would cause flushing of the pipeline.

len = !!buf[0] +
      !!buf[1] +
      //...
      !!buf[15]

Update: The above code and OP's code produce the same assembly code when compiled by GCC with -O3 flags. (different if no optimization flags are provided)

N 1.1
What's the point in xor'ing with 0?
GMan
@Gman: I was thinking of two solutions, mixed up while posting. Thanks :)
N 1.1
++1 Wow, I didn't know this !! (see why I love sharing problems on SO)
fabrizioM
If the GPU has multiple XOR units, the above code can be pipelined quite nicely. On other hand, JUMP instruction would cause flushing of the pipeline.
N 1.1
Only works when the buffer is zero initialized, otherwise just counts the number of non-nul characters, including garbage after the first nul.
ergosys
@ergo: Yes. fabrizioM is assuming the buffer has been memcpied with 0.
N 1.1
`!!buf[0]` and `buf[0] != 0` are, with any modern compiler, likely to result in identical machine language code.
caf
@caf: gcc 4.4.3 produces different code for the two.
N 1.1
gcc 4.3.2 doesn't (at least on -O3, x86 target).
caf
@caf: `-O3` does makes them same. I thought `!=` would be implemented using JMP. But compilers are super smart. (without optimization, they are different though :)
N 1.1
+3  A: 

The code you have won't work correctly. Just for example, consider a buffer containing something like:

"\0123456789abcde";

According to your code, this has a length of 15, but in reality its length is 0, because of the initial "\0".

As nice as it would be to do the computation in parallel, the simple fact is that the definition of the string more or less mandates starting from the beginning and counting characters only up to the point that you encounter a "\0" (or, in your case, get to 16).

Jerry Coffin
+1  A: 

Bitwise operations... maybe something like:

// TODO: optimize for 64-bit architectures
uint32_t *a = (uint32_t*)thestring;

for (int i = 0; i < 4; i++) // will be unwound
    for (int j = 0; j < 4; j++)
        if (a[i] & 0xff << j == 0)
           return 4*i+j;
return 16;
el.pescado
+1 it gave me some ideas on how to use 4 chars at a time (registers are 32bit)
fabrizioM
Just make sure those uints are aligned properly. Misalignment may be costly, if it doesn't cause other problems.
Maciej Hehl
+1  A: 

You can start with

template <typename T>
bool containsANull(T n) {
   return (n  - ((T) -1)/255) & ((T) -1)/255*128) & ~n;
}

and build something. To be worthwize T probably should be an unsigned 64 bits type, but even then there is some adjustment to do that makes me wonder if your buffer is long enough for that trick being usefull.

How does it works?

(T)-1/255 is the bit pattern 0x01010101 repeated as long as needed

(T)-1/255*128 is thus the bit pattern 0x80808080 repeated

if n is                        0x0123456789ABCDEF
n - 0x1111..1 is               0xF0123456789ABCDE
(n-0x1111...1) & 0x8888...8 is 0x8000000008888888
~n is                          0xFEDCBA9876543210 
so the result is               0x8000000000000000

The only way to get a non nul byte here is to start with a null byte.

AProgrammer
Can you elaborate on what does your logic ?
fabrizioM
+1 For the explanation. And I figured out that yours is a generic version of @sparky's link!
fabrizioM
Cool, but you're not going to win "fastest" with divisions and multiplications.
Seth
Yes, (I got this ages ago from Usenet and I used the division tricks to get a size independant version. The name Mycroft rings a bell, by 87 is too early for me; my source probably cited him).
AProgrammer
@Seth, they are constant expressions. If your compiler don't remark this, just use the correct bit pattern adjusted to the size of the longest available type.
AProgrammer
+1  A: 

Please refer yourself to fstrlen() implemented by Paul Hsieh at ...

http://www.azillionmonkeys.com/qed/asmexample.html

Although not precisely what you are looking for, with a little tweaking it should do it for you.

The algorithm attempts to check four bytes at once for the end-of-string character using some bit-twiddling.

Sparky
+2  A: 

Here's a little trick I read about in Hacker's Delight called SWAR (SIMD-within-a-register), assuming 8-bits per character:

#define CHAR_BITS 8
uint_fast_16_t all_character_bits[CHAR_BITS]= { 0 };

for (int bit_index= 0; bit_index<CHAR_BITS; ++bit_index)
{
    for (int character_index= 0; character_index<16; ++character_index)
    {
        all_character_bits[bit_index]|= ((buff[character_index] >> bit_index) & 1) << character_index;
    }
}

uint_fast_32_t zero_byte_character_mask= ~0;

for (int bit_index= 0; bit_index<CHAR_BITS; ++bit_index)
{
    zero_byte_character_mask&= (0xffff0000 | ~all_character_bits[bit_index]);
}

uint_fast_8_t first_null_byte= first_bit_set(zero_byte_character_mask);

where first_bit_set is any number of popular and fast implementations of finding the first bit set in an integer.

The basic idea here is to take the 16 characters as a 8x16 bit matrix and AND the bitwise-NOT of all the columns together. Any row that is all zeros will have that row's bit set in the result. Then, we just find the first bit set in the result and that's the lenght of the string. This particular implementation ensures that bits 16-31 are set in the result in case all the characters are not NULL. The actual bit transposition could be a lot faster (meaning without branches) as well.

MSN
A: 

In a hypothetical C++-like language, assuming 2's complement and little-endian,

int128_t v = *reinterpret_cast<int128_t*>(thestring);
const int bit_count = 128;
int eight = ((1 << 64) - 1 - v) >> (bit_count - 4) & 8;
v >>>= 8 * eight;
int four  = ((1 << 32) - 1 - v) >> (bit_count - 3) & 4;
v >>>= 8 * four;
int two   = ((1 << 16) - 1 - v) >> (bit_count - 2) & 2;
v >>>= 8 * two;
int one   = ((1 <<  8) - 1 - v) >> (bit_count - 1) & 1;
return (one | two | four | eight) + !!v;

(Modified from http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog.)

KennyTM
hey Kenny is >>> valid?
fabrizioM
KennyTM
Just make the type unsigned or cast it to unsigned for the shift
nategoose
A: 

You can bit-twiddle all you want, but you probably won't beat this:

int fast1(const char *s)
{ 
    if (!*s++) return 0; 
    if (!*s++) return 1; 
    if (!*s++) return 2; 
    if (!*s++) return 3; 
    if (!*s++) return 4; 
    if (!*s++) return 5; 
    if (!*s++) return 6; 
    if (!*s++) return 7; 
    if (!*s++) return 8; 
    if (!*s++) return 9; 
    if (!*s++) return 10; 
    if (!*s++) return 11; 
    if (!*s++) return 12; 
    if (!*s++) return 13; 
    if (!*s++) return 14; 
    if (!*s++) return 15; 
}

Alternatively, you can do this: (whether this is faster depends on your processor and compiler).

int fast2(const char *s)
{ 
    if (!s[0]) return 0; 
    if (!s[1]) return 1; 
    if (!s[2]) return 2; 
    if (!s[3]) return 3; 
    if (!s[4]) return 4; 
    if (!s[5]) return 5; 
    if (!s[6]) return 6; 
    if (!s[7]) return 7; 
    if (!s[8]) return 8; 
    if (!s[9]) return 9; 
    if (!s[10]) return 10; 
    if (!s[11]) return 11; 
    if (!s[12]) return 12; 
    if (!s[13]) return 13; 
    if (!s[14]) return 14; 
    if (!s[15]) return 15; 
}

Update:

I profiled both of these functions on my Core2Duo T7200 @ 2.0 GHz, Windows XP pro, Visual Studio 2008 with optimizations turned off. (Turning on the optimizer causes VS to notice that there's no output in my timing loop, so it removes it entirely).

I called each function in a loop 222 times, then took the average over 8 runs.

fast1 takes about 87.20 ns per function call.

fast2 takes about 45.46 ns per function call.

So on my CPU, the array indexing version is almost twice as fast as the pointer version.

I wasn't able to get any of the other functions posted here to work, so I wasn't able to compare. The closest is the original poster's function, which compiles, but doesn't always return the correct value. When it does, it executes in about 59 ns per function call.

Update 2

This function is pretty fast too, at about 60 ns per call. I'd guess that the pointer dereference is being performed by the address unit and the multiplication by the integer unit, so the operations are pipelining. In my other examples, all the work is being done by the address unit.

int fast5(const char *s)
{
    return  /* 0 * (s[0] == 0) + don't need to test 1st byte */
            1 * (s[1] == 0)  +
            2 * (s[2] == 0)  +
            3 * (s[3] == 0)  +
            4 * (s[4] == 0)  +
            5 * (s[5] == 0)  +
            6 * (s[6] == 0)  +
            7 * (s[7] == 0)  +
            8 * (s[8] == 0)  +
            9 * (s[9] == 0)  +
            10 * (s[10] == 0) +
            11 * (s[11] == 0) +
            12 * (s[12] == 0) +
            13 * (s[13] == 0) +
            14 * (s[14] == 0) +
            15 * (s[15] == 0);
}
Seth
Multiple exit paths for GPU?
KennyTM
Have you measured the speed of this?
Peter Jansson
@KennyTM - I don't know much specifics about GPUs, why would multiple returns matter? @Peter - I posted some results
Seth
There is no way all those memory references and comparisons beat bit twiddling.
drawnonward
A: 

Assuming 64 bit long and little endian system:

long a = ((long *)string)[0];
long b = ((long *)string)[1];

a = (a - 0x0101010101010101UL) & ~a & 0x8080808080808080UL;
b = (b - 0x0101010101010101UL) & ~b & 0x8080808080808080UL;

return a ? count_trailing_zeros( a ) / 8 : b ? 8 + count_trailing_zeros( b ) / 8 : 16;

For big endian count leading zeros. Any system strlen implementation will use this.

drawnonward
+1  A: 

From what you've said I believe that what you are trying to do is avoid jumps, so that's what I'm working towards.

I'm pretty sure that the code that you posted only looks slick, but wouldn't actually be that great when compiled for many processors, though it might on yours. Most processors that I know of don't actually have an easy way to get a 1 out of a comparison, so that would likely end up being a conditional jump or conditional operation of the form:

set R1, 0
test R2+0, 0
cinc R1                   ; conditional increment
test R2+1, 0
cinc R1
...

This might work well for a GPU if it can do conditional increments and work well on octet sized items.

If the compiler did an excelent job, on many processors this would end up being something like:

set R1, 0
test R2+0, 0
jz end  ; jump if zero
inc R1
test R2+1, 0
jz end
inc R1
...

This could also be acceptable if non-followed conditional jumps don't hurt you much since then you only have one followed conditional jump (the first one where you find a 0).

Since you said that you were targeting a GPU and those tend to be very math friendly you might be able to do:

int acc = 0;
acc += str[0]/str[0];
acc += str[1]/str[1];
...

if you are able to trap on divide by zero without too much cost and just handle the mess from the trap. That would probably end up being expensive, though.

If your machine has registers that may hold more than one octet of your string then you could try doing a limited number of jumps and test for 0 more than one byte at a time, and then examine the last non-zero word at the byte level.

You should check out Bit Twiddling Hacks for a cool way to speed up strlen that works well for large register sizes.

Something else you may want to consider is starting your measurement from the end of the string (you know the max length). As long as the null termination byte is followed by more nulls this would work, and if you're likely to have longer strings this can be a win even if you do throw a jump in there.

nategoose
+1 the division trick could actually be faster than the !!str[0] (1 op instead of 2)on the gpus the registers are 32 bits, I wonder if this works on 32 too ..
fabrizioM