views:

41

answers:

1

This is more an MBCS question than a Unicode question. I need to create an API that returns a list of structs that each instance holds a Unicode character as one of its members. This is in .NET so you'd think I'd want UTF-16, but then for Asian characters, there'd like be two characters required. What's the best practice when returning Unicode characters?

  1. Use an array of 2 UTF-16 chars - Test the 1st char to see if it's surrogate, have a count?
  2. Ignore the surrogate issue and leave it to the caller to figure out the actual glyph encoding spans structs?
  3. Use a string instead so I don't care if it's one or two chars in length?
  4. Use UTF-32

What do people normally do for UTF-8? I'm guessing they never deal with individual characters and everything is held in a string (for example, searching for a character in a string is really done by looking for a sub-string). Maybe it's the C++ programmer in me but a string seems so heavy handed.

I think I'm going to do #3. What have others done?

+1  A: 

You are right about using strings. In Unicode, because even a single character might require multiple codepoints (which would each take a certain number of bytes depending on the encoding), you can't really ever work on anything less than strings. Even functions like isUpper or such should take a string and only work on the first element of it.

The reason a character might require multiple codepoints is typically because of the combining characters, for accents and such.

See this question in the Unicode FAQ.

Adam Goode
At first I had convinced myself accents wouldn't be a problem, but I think actually they are. I was assuming there'd be a normalization form that would make it all fit into a single code point. In my case, I'd want to treat the glyph+any number of accents as a single 'character'.
Tony Lee
Yeah, only some accented characters will fit into a single codepoint, typically ones that come from a pre-unicode character set.
Adam Goode