tags:

views:

102

answers:

1

I've been reading about Unicode and UTF-8 in the last couple of days and I often come across a bitwise comparison similar to this :

int strlen_utf8(char *s) 
{
  int i = 0, j = 0;
  while (s[i]) 
  {
    if ((s[i] & 0xc0) != 0x80) j++;
    i++;
  }
  return j;
}

Can someone clarify the comparison with 0xc0 and checking if it's the most significant bit ?

Thank you!

EDIT: ANDed, not comparison, used the wrong word ;)

+11  A: 

It's not a comparison with 0xc0, it's a logical AND operation with 0xc0.

The bit mask 0xc0 is 11 00 00 00 so what the AND is doing is extracting only the top two bits:

    ab cd ef gh
AND 11 00 00 00
    -- -- -- --
  = ab 00 00 00

This is then compared to 0x80 (binary 10 00 00 00). In other words, the if statement is checking to see if the top two bits of the value are not equal to 10.

Why? In UTF-8, all bytes that begin with the bit pattern 10 are subsequent bytes of a multi-byte sequence:

Range              Encoding  Binary value
-----------------  --------  --------------------------
U+000000-U+00007f  0xxxxxxx  0xxxxxxx

U+000080-U+0007ff  110yyyxx  00000yyy xxxxxxxx
                   10xxxxxx

U+000800-U+00ffff  1110yyyy  yyyyyyyy xxxxxxxx
                   10yyyyxx
                   10xxxxxx

U+010000-U+10ffff  11110zzz  000zzzzz yyyyyyyy xxxxxxxx
                   10zzyyyy
                   10yyyyxx
                   10xxxxxx

So, what this little snippet is doing is going through every byte of your UTF-8 string and counting up all the bytes that aren't continuation bytes (i.e., it's getting the length of the string, as advertised). See this wikipedia link for more detail and Joel Spolsky's excellent article for a primer.

paxdiablo