If I have a byte array that contains UTF8 content, how would I go about parsing it? Are there delimiter bytes that I can split off to get each character?
Take a look here...
http://en.wikipedia.org/wiki/UTF-8
If you're looking to identify the boundary between characters, what you need is in the table in "Description".
The only way to get a high bit zero is the ASCII subset 0..127, encoded in a single byte. All the non-ASCII codepoints have 2nd byte onwards with "10" in the highest two bits. The leading byte of a codepoint never has that - it's high bits indicate the number of bytes, but there's some redundancy - you could equally watch for the next byte that doesn't have the "10" to indicate the next codepoint.
0xxxxxxx : ASCII
10xxxxxx : 2nd, 3rd or 4th byte of code
11xxxxxx : 1st byte of code, further high bits indicating number of bytes
A codepoint in unicode isn't necessarily the same as a character. There are modifier codepoints (such as accents), for instance.
Bytes that have the first bit set to 0 are normal ASCII characters. Bytes that have their first bit set to 1 are part of a UTF-8 character.
The first byte in every UTF-8 character has its second bit set to 1, so that the byte has the most significant bits 11
. Each following byte that belongs to the same UTF-8 character starts with 10
instead.
The first byte of each UTF-8 character additionally indicates how many of the following bytes belong to the character, depending on the number of bits that are set to 1 in the most significant bits of that byte.
For more details, see the Wikipedia page for UTF-8.