tags:

views:

153

answers:

2

If I have a byte array that contains UTF8 content, how would I go about parsing it? Are there delimiter bytes that I can split off to get each character?

+8  A: 

Take a look here...

http://en.wikipedia.org/wiki/UTF-8

If you're looking to identify the boundary between characters, what you need is in the table in "Description".

The only way to get a high bit zero is the ASCII subset 0..127, encoded in a single byte. All the non-ASCII codepoints have 2nd byte onwards with "10" in the highest two bits. The leading byte of a codepoint never has that - it's high bits indicate the number of bytes, but there's some redundancy - you could equally watch for the next byte that doesn't have the "10" to indicate the next codepoint.

0xxxxxxx : ASCII
10xxxxxx : 2nd, 3rd or 4th byte of code
11xxxxxx : 1st byte of code, further high bits indicating number of bytes

A codepoint in unicode isn't necessarily the same as a character. There are modifier codepoints (such as accents), for instance.

Steve314
A: 

Bytes that have the first bit set to 0 are normal ASCII characters. Bytes that have their first bit set to 1 are part of a UTF-8 character.

The first byte in every UTF-8 character has its second bit set to 1, so that the byte has the most significant bits 11. Each following byte that belongs to the same UTF-8 character starts with 10 instead.

The first byte of each UTF-8 character additionally indicates how many of the following bytes belong to the character, depending on the number of bits that are set to 1 in the most significant bits of that byte.

For more details, see the Wikipedia page for UTF-8.

sth
"UTF-8 character" is a misnomer. You seem to be referring to a sequence of two to four bytes which represents a non-ASCII character. When it comes to understanding Unicode, I believe getting the vocabulary right is half the battle.
Alan Moore