ansaurus

Question

Strlen of MAX 16 chars string using bitwise operators

Answer 1

+4 A:

This would work fine since buf is initialized with zero. Your solution has != which will use jump instruction. If the GPU has multiple XOR units, the following code can be pipelined quite nicely. On other hand, JUMP instruction would cause flushing of the pipeline.

len = !!buf[0] +
      !!buf[1] +
      //...
      !!buf[15]

Update: The above code and OP's code produce the same assembly code when compiled by GCC with -O3 flags. (different if no optimization flags are provided)

N 1.1 2010-04-19 17:26:08

What's the point in xor'ing with 0?

GMan 2010-04-19 17:29:15

@Gman: I was thinking of two solutions, mixed up while posting. Thanks :)

N 1.1 2010-04-19 17:33:48

++1 Wow, I didn't know this !! (see why I love sharing problems on SO)

fabrizioM 2010-04-19 17:37:43

If the GPU has multiple XOR units, the above code can be pipelined quite nicely. On other hand, JUMP instruction would cause flushing of the pipeline.

N 1.1 2010-04-19 17:42:29

Only works when the buffer is zero initialized, otherwise just counts the number of non-nul characters, including garbage after the first nul.

ergosys 2010-04-19 18:05:00

@ergo: Yes. fabrizioM is assuming the buffer has been memcpied with 0.

N 1.1 2010-04-19 18:09:19

`!!buf[0]` and `buf[0] != 0` are, with any modern compiler, likely to result in identical machine language code.

caf 2010-04-20 02:18:32

@caf: gcc 4.4.3 produces different code for the two.

N 1.1 2010-04-20 02:39:44

gcc 4.3.2 doesn't (at least on -O3, x86 target).

caf 2010-04-20 02:46:22

@caf: `-O3` does makes them same. I thought `!=` would be implemented using JMP. But compilers are super smart. (without optimization, they are different though :)

N 1.1 2010-04-20 02:59:48

Answer 2

+3 A:

The code you have won't work correctly. Just for example, consider a buffer containing something like:

"\0123456789abcde";

According to your code, this has a length of 15, but in reality its length is 0, because of the initial "\0".

As nice as it would be to do the computation in parallel, the simple fact is that the definition of the string more or less mandates starting from the beginning and counting characters only up to the point that you encounter a "\0" (or, in your case, get to 16).

Jerry Coffin 2010-04-19 17:27:06

Answer 3

+1 A:

Bitwise operations... maybe something like:

// TODO: optimize for 64-bit architectures
uint32_t *a = (uint32_t*)thestring;

for (int i = 0; i < 4; i++) // will be unwound
    for (int j = 0; j < 4; j++)
        if (a[i] & 0xff << j == 0)
           return 4*i+j;
return 16;

el.pescado 2010-04-19 17:27:28

+1 it gave me some ideas on how to use 4 chars at a time (registers are 32bit)

fabrizioM 2010-04-19 18:02:17

Just make sure those uints are aligned properly. Misalignment may be costly, if it doesn't cause other problems.

Maciej Hehl 2010-04-20 23:58:05

Answer 4

+1 A:

You can start with

template <typename T>
bool containsANull(T n) {
   return (n  - ((T) -1)/255) & ((T) -1)/255*128) & ~n;
}

and build something. To be worthwize T probably should be an unsigned 64 bits type, but even then there is some adjustment to do that makes me wonder if your buffer is long enough for that trick being usefull.

How does it works?

(T)-1/255 is the bit pattern 0x01010101 repeated as long as needed

(T)-1/255*128 is thus the bit pattern 0x80808080 repeated

if n is                        0x0123456789ABCDEF
n - 0x1111..1 is               0xF0123456789ABCDE
(n-0x1111...1) & 0x8888...8 is 0x8000000008888888
~n is                          0xFEDCBA9876543210 
so the result is               0x8000000000000000

The only way to get a non nul byte here is to start with a null byte.

AProgrammer 2010-04-19 17:55:06

Can you elaborate on what does your logic ?

fabrizioM 2010-04-19 17:59:34

+1 For the explanation. And I figured out that yours is a generic version of @sparky's link!

fabrizioM 2010-04-19 18:46:22

Cool, but you're not going to win "fastest" with divisions and multiplications.

Seth 2010-04-19 19:17:48

Yes, (I got this ages ago from Usenet and I used the division tricks to get a size independant version. The name Mycroft rings a bell, by 87 is too early for me; my source probably cited him).

AProgrammer 2010-04-19 19:32:54

@Seth, they are constant expressions. If your compiler don't remark this, just use the correct bit pattern adjusted to the size of the longest available type.

AProgrammer 2010-04-19 19:36:19

Answer 5

+1 A:

Please refer yourself to fstrlen() implemented by Paul Hsieh at ...

http://www.azillionmonkeys.com/qed/asmexample.html

Although not precisely what you are looking for, with a little tweaking it should do it for you.

The algorithm attempts to check four bytes at once for the end-of-string character using some bit-twiddling.

Sparky 2010-04-19 18:24:08

Answer 6

+2 A:

Here's a little trick I read about in Hacker's Delight called SWAR (SIMD-within-a-register), assuming 8-bits per character:

#define CHAR_BITS 8
uint_fast_16_t all_character_bits[CHAR_BITS]= { 0 };

for (int bit_index= 0; bit_index<CHAR_BITS; ++bit_index)
{
    for (int character_index= 0; character_index<16; ++character_index)
    {
        all_character_bits[bit_index]|= ((buff[character_index] >> bit_index) & 1) << character_index;
    }
}

uint_fast_32_t zero_byte_character_mask= ~0;

for (int bit_index= 0; bit_index<CHAR_BITS; ++bit_index)
{
    zero_byte_character_mask&= (0xffff0000 | ~all_character_bits[bit_index]);
}

uint_fast_8_t first_null_byte= first_bit_set(zero_byte_character_mask);

where first_bit_set is any number of popular and fast implementations of finding the first bit set in an integer.

The basic idea here is to take the 16 characters as a 8x16 bit matrix and AND the bitwise-NOT of all the columns together. Any row that is all zeros will have that row's bit set in the result. Then, we just find the first bit set in the result and that's the lenght of the string. This particular implementation ensures that bits 16-31 are set in the result in case all the characters are not NULL. The actual bit transposition could be a lot faster (meaning without branches) as well.

MSN 2010-04-19 18:25:39

Answer 7

A:

In a hypothetical C++-like language, assuming 2's complement and little-endian,

int128_t v = *reinterpret_cast<int128_t*>(thestring);
const int bit_count = 128;
int eight = ((1 << 64) - 1 - v) >> (bit_count - 4) & 8;
v >>>= 8 * eight;
int four  = ((1 << 32) - 1 - v) >> (bit_count - 3) & 4;
v >>>= 8 * four;
int two   = ((1 << 16) - 1 - v) >> (bit_count - 2) & 2;
v >>>= 8 * two;
int one   = ((1 <<  8) - 1 - v) >> (bit_count - 1) & 1;
return (one | two | four | eight) + !!v;

(Modified from http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog.)

KennyTM 2010-04-19 18:46:22

hey Kenny is >>> valid?

fabrizioM 2010-04-19 19:09:14

KennyTM 2010-04-19 19:16:33

Just make the type unsigned or cast it to unsigned for the shift

nategoose 2010-04-20 23:21:25

Answer 8

A:

You can bit-twiddle all you want, but you probably won't beat this:

int fast1(const char *s)
{ 
    if (!*s++) return 0; 
    if (!*s++) return 1; 
    if (!*s++) return 2; 
    if (!*s++) return 3; 
    if (!*s++) return 4; 
    if (!*s++) return 5; 
    if (!*s++) return 6; 
    if (!*s++) return 7; 
    if (!*s++) return 8; 
    if (!*s++) return 9; 
    if (!*s++) return 10; 
    if (!*s++) return 11; 
    if (!*s++) return 12; 
    if (!*s++) return 13; 
    if (!*s++) return 14; 
    if (!*s++) return 15; 
}

Alternatively, you can do this: (whether this is faster depends on your processor and compiler).

int fast2(const char *s)
{ 
    if (!s[0]) return 0; 
    if (!s[1]) return 1; 
    if (!s[2]) return 2; 
    if (!s[3]) return 3; 
    if (!s[4]) return 4; 
    if (!s[5]) return 5; 
    if (!s[6]) return 6; 
    if (!s[7]) return 7; 
    if (!s[8]) return 8; 
    if (!s[9]) return 9; 
    if (!s[10]) return 10; 
    if (!s[11]) return 11; 
    if (!s[12]) return 12; 
    if (!s[13]) return 13; 
    if (!s[14]) return 14; 
    if (!s[15]) return 15; 
}

Update:

I profiled both of these functions on my Core2Duo T7200 @ 2.0 GHz, Windows XP pro, Visual Studio 2008 with optimizations turned off. (Turning on the optimizer causes VS to notice that there's no output in my timing loop, so it removes it entirely).

I called each function in a loop 2²² times, then took the average over 8 runs.

fast1 takes about 87.20 ns per function call.

fast2 takes about 45.46 ns per function call.

So on my CPU, the array indexing version is almost twice as fast as the pointer version.

I wasn't able to get any of the other functions posted here to work, so I wasn't able to compare. The closest is the original poster's function, which compiles, but doesn't always return the correct value. When it does, it executes in about 59 ns per function call.

Update 2

This function is pretty fast too, at about 60 ns per call. I'd guess that the pointer dereference is being performed by the address unit and the multiplication by the integer unit, so the operations are pipelining. In my other examples, all the work is being done by the address unit.

int fast5(const char *s)
{
    return  /* 0 * (s[0] == 0) + don't need to test 1st byte */
            1 * (s[1] == 0)  +
            2 * (s[2] == 0)  +
            3 * (s[3] == 0)  +
            4 * (s[4] == 0)  +
            5 * (s[5] == 0)  +
            6 * (s[6] == 0)  +
            7 * (s[7] == 0)  +
            8 * (s[8] == 0)  +
            9 * (s[9] == 0)  +
            10 * (s[10] == 0) +
            11 * (s[11] == 0) +
            12 * (s[12] == 0) +
            13 * (s[13] == 0) +
            14 * (s[14] == 0) +
            15 * (s[15] == 0);
}

Seth 2010-04-19 19:46:54

Multiple exit paths for GPU?

KennyTM 2010-04-19 19:50:26

Have you measured the speed of this?

Peter Jansson 2010-04-20 15:46:41

@KennyTM - I don't know much specifics about GPUs, why would multiple returns matter? @Peter - I posted some results

Seth 2010-04-20 19:32:04

There is no way all those memory references and comparisons beat bit twiddling.

drawnonward 2010-04-20 23:55:29

Answer 9

A:

Assuming 64 bit long and little endian system:

long a = ((long *)string)[0];
long b = ((long *)string)[1];

a = (a - 0x0101010101010101UL) & ~a & 0x8080808080808080UL;
b = (b - 0x0101010101010101UL) & ~b & 0x8080808080808080UL;

return a ? count_trailing_zeros( a ) / 8 : b ? 8 + count_trailing_zeros( b ) / 8 : 16;

For big endian count leading zeros. Any system strlen implementation will use this.

drawnonward 2010-04-20 03:35:50

Answer 10

+1 A:

From what you've said I believe that what you are trying to do is avoid jumps, so that's what I'm working towards.

I'm pretty sure that the code that you posted only looks slick, but wouldn't actually be that great when compiled for many processors, though it might on yours. Most processors that I know of don't actually have an easy way to get a 1 out of a comparison, so that would likely end up being a conditional jump or conditional operation of the form:

set R1, 0
test R2+0, 0
cinc R1                   ; conditional increment
test R2+1, 0
cinc R1
...

This might work well for a GPU if it can do conditional increments and work well on octet sized items.

If the compiler did an excelent job, on many processors this would end up being something like:

set R1, 0
test R2+0, 0
jz end  ; jump if zero
inc R1
test R2+1, 0
jz end
inc R1
...

This could also be acceptable if non-followed conditional jumps don't hurt you much since then you only have one followed conditional jump (the first one where you find a 0).

Since you said that you were targeting a GPU and those tend to be very math friendly you might be able to do:

int acc = 0;
acc += str[0]/str[0];
acc += str[1]/str[1];
...

if you are able to trap on divide by zero without too much cost and just handle the mess from the trap. That would probably end up being expensive, though.

If your machine has registers that may hold more than one octet of your string then you could try doing a limited number of jumps and test for 0 more than one byte at a time, and then examine the last non-zero word at the byte level.

You should check out Bit Twiddling Hacks for a cool way to speed up strlen that works well for large register sizes.

Something else you may want to consider is starting your measurement from the end of the string (you know the max length). As long as the null termination byte is followed by more nulls this would work, and if you're likely to have longer strings this can be a win even if you do throw a jump in there.

nategoose 2010-04-20 23:17:36

+1 the division trick could actually be faster than the !!str[0] (1 op instead of 2)on the gpus the registers are 32 bits, I wonder if this works on 32 too ..

fabrizioM 2010-04-20 23:29:26

ansaurus

tags:

views:

answers:

Strlen of MAX 16 chars string using bitwise operators

related questions