views:

1245

answers:

3

I'm using TinyXML (http://www.grinninglizard.com/tinyxml/) to parse/build XML files. Now according to the documentation (http://www.grinninglizard.com/tinyxmldocs/) this library supports multibyte character sets through UTF-8. So far so good I think. But, the only API that the library provides (for getting/setting element names, attribute names and values, ... everything where a string is used) is through std::string or const char*. This has me doubting my own understanding of multibyte character set support. How can a string that only supports 8-bit characters contain a 16 bit character (unless it uses a code page, which would negate the 'supports unicode' claim)? I understand that you could theoretically take a 16-bit code point and split it over 2 chars in a std::string, but that wouldn't transform the std::string to a 'unicode' string, it would make it invalid for most purposes and would maybe accidentally work when written to a file and read in by another program.

So, can somebody explain to me how a library can offer an '8-bit interface' (std::string/const char*) and still support 'unicode' strings?

(I probably mixed up some unicode-terminology here, sorry about any confusion coming from that).

+2  A: 

UTF-8 is compatible to 7-bit ASCII code. If the value of a byte is larger then 127, it means a multibyte character starts. Depending on the value of the first byte you can see how many bytes the character will take, that can be 2-4 bytes including the first byte (technical also 5 or 6 are possible, but they are not valid utf-8). Here is a good resource about UTF-8: UTF-8 and Unicode FAQ, also the wiki page for utf8 is very informative. Since UTF-8 is char based and 0-terminated, you can use the standard string functions for most things. The only important thing is that the character count can differ from the byte count. Functions like strlen() return the byte count but not necessarily the character count.

quinmars
A: 

By using between 1 and 4 chars to encode one Unicode code point.

Nemanja Trifunovic
+5  A: 

First, utf-8 is stored in const char * strings, as @quinmars said. And it's not only a superset of 7-bit ASCII (code points <= 127 always encoded in a single byte as themselves), it's furthermore careful that bytes with those values are never used as part of the encoding of the multibyte values for code points >= 128. So if you see a byte == 44, it's a '<' character, etc. All of the metachars in XML are in 7-bit ASCII. So one can just parse the XML, breaking strings where the metachars say to, sticking the fragments (possibly including non-ASCII chars) into a char * or std::string, and the returned fragments remain valid UTF-8 strings even though the parser didn't specifically know UTF-8.

Further (not specific to XML, but rather clever), even more complex things genrally just work (tm). For example, if you sort UTF-8 lexicographically by bytes, you get the same answer as sorting it lexicographically by code points, despite the variation in # of bytes used, because the prefix bytes introducing the longer (and hence higher-valued) code points are numerically greater than those for lesser values).

puetzk
Ok thanks, it's getting clearer, but then still - using std::string to represent UTF-8 data this way, isn't that semantically wrong? You'll never be able to rely on the contents of that string - there won't even be a way to know how long it is! (in character length).
Roel
And even for the const char* version, you'd still have to use another library to work with the string reliably.
Roel
More undefined than wrong. std::string's methods (concatenation, iterator slicing, find_*, etc) still work. length() is only defined as == size() anyway. There's a new precondition that offsets be at a char boundary. If std::string made any promises about encoding it would be wrong, but it doesn't.
puetzk