views:

47

answers:

2

I already searched for answers to this sort of question here, and have found plenty of them -- but I still have this nagging doubt about the apparent triviality of the matter.

I have read this very interesting an helpful article on the subject: http://www.joelonsoftware.com/articles/Unicode.html, but it left me wondering about how one would go about identifying individual glyphs given a buffer of Unicode data.

My questions are:

How would I go about parsing a Unicode string, say UTF-8?

Assuming I know the byte order, what happens when I encounter the beginning of a glyph that is supposed to be represented by 6 bytes?

That is, if I interpreted the method of storage correctly.

This is all related to a text display system I am designing to work with OpenGL. I am storing glyph data in display lists and I need to translate the contents of a string to a sequence of glyph indexes, which are then mapped to display list indices (since, obviously, storing the entire glyph set in graphics memory is not always practical).

To have to represent every string as an array of shorts would require a significant amount of storage considering everything I have need to display.

Additionally, it seems to me that 2 bytes per character simply isn't enough to represent every possible Unicode element.

A: 

Well, I think this answers it:

http://en.wikipedia.org/wiki/UTF-8

Why it didn't show up the first time I went searching, I have no idea.

i_photon
+2  A: 

How would I go about parsing a Unicode string, say UTF-8?

I'm assuming that by "parsing", you mean converting to code points.

Often, you don't have to do that. For example, you can search for a UTF-8 string within another UTF-8 string without needing to care about what characters those bytes represent.

If you do need to convert to code points (UTF-32), then:

  1. Check the first byte to see how many bytes are in the character.
  2. Look at the trailing bytes of the character to ensure that they're in the range 80-BF. If not, report an error.
  3. Use bit masking and shifting to convert the bytes to the code point.
  4. Report an error if the byte sequence you got was longer than the minimum needed to represent the character.
  5. Increment your pointer by the sequence length and repeat for the next character.

Additionally, it seems to me that 2 bytes per character simply isn't enough to represent every possible Unicode element.

It's not. Unicode was originally intended to be a fixed-with 16-bit encoding. It was later decided that 65,536 characters wasn't enough, so UTF-16 was created, and Unicode was redefined to use code points between 0 and 1,114,111.

If you want a fixed-width encoding, you need 21 bits. But they aren't many languages that have a 21-bit integer type, so in practice you need 32 bits.

dan04
Thanks for the reply!Based on what I've read, it seems I ought to work with UTF-8: I can iterate through a string in a byte order agnostic manner and assemble the individual code points as they appear, like variable-length structures.Which brings me to wonder why wchar_t and the nastiness surrounding its manipulation was such a brilliant idea in the first place.
i_photon
`wchar_t` is intended to be big enough to store any character. That makes manipulation easier, not harder. You can increment a `wchar_t*` once to get the next character, something which is much harder when you have a `char*` pointing to a multibyte string. There's a small VC++/Windows bug where they use `wchar_t` for UTF-16 strings, but you can't blame C++ in general for that. On Linux for instance, it's just UTF-32 and things work as intended.
MSalters
The nastiness I was referring to was the fact that I have to keep track of how big it is and the byte ordering, and design code that can deal with an "atomic" data type of a somewhat unpredictable size and storage (not unlike "byte" vs. "char"). UTF-8 is annoying to iterate through, but my intention was to use it for storage. If wchar_t was a UTF-32 no matter what the compiler, life would be a little easier (despite the irritation of byte order and the inelegance it forces "portable" serialization code to exhibit).
i_photon