It's not a comparison with 0xc0
, it's a logical AND operation with 0xc0
.
The bit mask 0xc0
is 11 00 00 00
so what the AND is doing is extracting only the top two bits:
ab cd ef gh
AND 11 00 00 00
-- -- -- --
= ab 00 00 00
This is then compared to 0x80
(binary 10 00 00 00
). In other words, the if
statement is checking to see if the top two bits of the value are not equal to 10
.
Why? In UTF-8, all bytes that begin with the bit pattern 10
are subsequent bytes of a multi-byte sequence:
Range Encoding Binary value
----------------- -------- --------------------------
U+000000-U+00007f 0xxxxxxx 0xxxxxxx
U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx
10xxxxxx
U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx
10yyyyxx
10xxxxxx
U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy xxxxxxxx
10zzyyyy
10yyyyxx
10xxxxxx
So, what this little snippet is doing is going through every byte of your UTF-8 string and counting up all the bytes that aren't continuation bytes (i.e., it's getting the length of the string, as advertised). See this wikipedia link for more detail and Joel Spolsky's excellent article for a primer.