How does Microsoft handle the fact that UTF-16 is a variable length encoding in their C++ standard library implementation

views:

283

answers:

+8 Q:

How does Microsoft handle the fact that UTF-16 is a variable length encoding in their C++ standard library implementation

Having a variable length encoding is indirectly forbidden in the standard.

So I have several questions:

How is the following part of the standard handled?

17.3.2.1.3.3 Wide-character sequences

A wide-character sequence is an array object (8.3.4) A that can be declared as T A[N], where T is type wchar_t (3.9.1), optionally qualified by any combination of const or volatile. The initial elements of the array have defined contents up to and including an element determined by some predicate. A character sequence can be designated by a pointer value S that designates its first element.

The length of an NTWCS is the number of elements that precede the terminating null wide character. An empty NTWCS has a length of zero.

Questions:

basic_string<wchar_t>

How is operator[] implemented and what does it return?
- standard: If pos < size(), returns data()[pos]. Otherwise, if pos == size(), the const version returns charT(). Otherwise, the behavior is undefined.
Does size() return the number of elements or the length of the string?
- standard: Returns: a count of the number of char-like objects currently in the string.
How does resize() work?
- unrelated to standard, just what does it do
How are the position in insert(), erase() and others handled?

cwctype

Pretty much everything in here. How is the variable encoding handled?

cwchar

getwchar() obviously can't return a whole platform-character, so how does this work?

Plus all the rest of the character function (the theme is the same).

Edit: I will be opening a bounty to get some confirmation. I want to get some clear answers or at least a clearer distribution of votes.

Edit: This is starting to get pointless. This is full of totally conflicting answers. Some of you talk about external encodings (I don't care about those, UTF-8 encoded will still be stored as UTF-16 once read into the string, the same for output), the rest simply contradicts each other. :-/

+7 A:

STL deals with strings as simply a wrapper for an array of characters therefore size() or length() on an STL string will tell you how many char or wchar_t elements it contains and not necessarily the number of printable characters it would be in a string.

CashCow 2010-10-26 16:05:11

So, they have implemented basic_string in such way that it can handle variable encoding (I will check for the collision I mentioned and report back) and ignore raw strings?

Let_Me_Be 2010-10-26 17:18:26

The absolutely correct OO way to handle variable-character-length strings would be to have a class represent a variable-length character and then have a vector of them. That would be a horribly inefficient implementation though. A better implementation is to store the array of elements in its raw form and then if you need to have another array of offsets to each character so you can find the nth character in constant time once it has been parsed through once. You can make this a wrapper, with state, including a state that the string is actually one-element-per-character throughout.

CashCow 2010-10-27 08:41:36

@Cash I'm not really interested in the correct way. For me the correct way is to use variable encoding on input and output only. What I'm asking is how Microsoft handles the fact, that they actually have a variable length encoding for `wchar_t` and consequently for `wstring`.

Let_Me_Be 2010-10-27 10:22:54

+6 A:

Assuming that you're talking about the wstring type, there would be no handling of the encoding - it just deals with wchar_t elements without knowing anything about the encoding. It's just a sequence of wchar_t's. You'll need to deal with encoding issues using functionality of other functions.

Michael Burr 2010-10-26 16:06:34

AFAIK this is right. Not sure whether UTF-16 offers multiple ways to encode the same code point, in the way UTF-8 does. Even if it does, as long as functions like `wstring::operator==` don't "unfold" the encoding, and in that case returns true if and only if the strings consist of the same sequence of `wchar_t`, the implementation is compliant. It's as if the encoding was UCS-2. The standard doesn't say anything about whether functions like `fopen` are allowed to treat "different" wide strings as representing the same file, so they're allowed to treat the name as variable-length encoded.

Steve Jessop 2010-10-26 16:22:49

You need to deal with the fact the wchar_t != one platform character.

Let_Me_Be 2010-10-26 17:20:26

@Steve Jessop: UTF-16 uses surrogate pairs (2 16 bit values) for any character outside the BMP (past U+10000). You're not allowed to use surrogate pairs for BMP characters. Essentially, it enforces a canonical representation in the same way as UTF-8 does, by mandating the shortest possible encoding. Besides, the strings used by `fopen` are open to platform interpretation anyway. C and C++ don't even specify whether they're case-sensitive.

MSalters 2010-10-27 10:01:43

@Let_Me_Be: deal with it how? As far as the C++ language is concerned, the execution wide character set is `int16_t` (or `uint16_t`, I can't remember whether MS's `wchar_t` is signed). Implementation-defined meanings of strings are a whole separate issue, and it's only here that MS comes in and says, "it's UTF-16". Can you give an example of some code which behaves differently with Microsoft's definition of a wide character, from what the standard allows?

Steve Jessop 2010-10-27 10:18:33

@Steve The standard actually doesn't care about he underlying type. What I'm searching for is what MSalters posted in his answer. If it's true, then I need to know how to walk around these semantics.

Let_Me_Be 2010-10-27 10:26:44

@Let_Me_Be: AFAIK everything in MSalters' answer is true. It should be easy enough for you to confirm that by running a few tests, if you doubt it. I would say that the way to deal with it is to treat wide strings as an output format (to Windows APIs), and don't modify them in that format, but I'm not a Windows programmer so I can't swear to you that MS doesn't provide a better way

Steve Jessop 2010-10-27 10:48:26

@Steve I would if I could. I don't have a Windows machine anywhere near me. Actually I do, but I don't have admin rights on Faculty machines.

Let_Me_Be 2010-10-27 10:51:55

@Let_Me_Be: "The standard actually doesn't care" - OK, I suppose what I really meant was, "as far as the language is concerned, the execution wide character set has the the range of values that `int16_t` has". You're right, it doesn't care what the type "really is", but it cares what the values are. So, what I mean is what MSalters says - on MS compilers, the language "thinks" that a high surrogate is a character.

Steve Jessop 2010-10-27 10:55:37

@Steve Yeah, thanks. I just want to get all of this totally 100% clear.

Let_Me_Be 2010-10-27 10:58:37

+5 A:

Two things:

There is no "Microsoft STL implementation". The C++ Standard Library shipped with Visual C++ is licensed from Dinkumware.
The current C++ Standard knows nothing about Unicode and its encoding forms. std::wstring is merely a container for wchar_t units which happen to be 16-bit on Windows. In practice, if you want to store a UTF-16 encoded string into a wstring, just take into account that you are really storing code units and not code points.

Nemanja Trifunovic 2010-10-26 16:33:14

There certainly *is* a Microsoft STL implementation. It may be licensed, but it's also modified, so it's not the original Dinkumware implementation anymore (and even if it was, it's still part of the Visual Studio product). Also, the current C++ standard *does* know something about Unicode: ISO 10646 is a normative reference, and universal character names are interpreted as Unicode. It's also reasonable to assume that C99's `__STDC_ISO_10646__`, if defined, has the same meaning in a C++ implementation (even though C++98 does not mention it).

Martin v. Löwis 2010-10-31 09:21:14

+9 A:

Here's how Microsoft's STL implementation handles the variable-length encoding:

basic_string<wchar_t>::operator[])( can return a low or a high surrogate, in isolation.

basic_string<wchar_t>::size() returns the number of wchar_t objects. A surrogate pair (one Unicode character) uses two wchar_t's and therefore adds two to the size.

basic_string<wchar_t>::resize() can truncate a string in the middle of a surrogate pair.

basic_string<wchar_t>::insert() can insert in the middle of a surrogate pair.

basic_string<wchar_t>::erase() can erase either half of a surrogate pair.

In general, the pattern should be clear: the STL does not assume that a std::wstring is in UTF-16, nor enforce that it remains UTF-16.

MSalters 2010-10-27 10:08:44

How do you determine the string length if size() returns number of `wchar_t` objects? How do you safely insert(), erase(), etc? The "C" part of the standard library isn't supported in Windows for wide characters?

Let_Me_Be 2010-10-27 10:20:47

@Let_Me_Be: What would you want to do with that string length anyway? As for safely inserting, it's your responsibility anyway to find the correct place to insert text. Not only are there UTF-16 rules, but (human) languages add spelling and grammar rules that must be obeyed.

MSalters 2010-10-28 11:31:20

@Let_Me_Be: If you consider the presence of combining marks in Unicode, you can only realize that the whole of Unicode is actually a variable-length encoding. What would you expect to happen if you have a string with the Unicode codepoints U+0041 U+0308 (LATIN CAPITAL LETTER A and COMBINING DIARESIS) and you try to insert the codepoint U+0030 (DIGIT ZERO) between them? Allog these codepoints are representable with a single element in UTF-16 and UCS-4.

Bart van Ingen Schenau 2010-10-29 11:25:19

@Bart You are talking about external encoding, that is irrelevant for me. I'm only talking about internal encoding.

Let_Me_Be 2010-10-29 12:57:24

@Let_Me_Be: No, I am talking about *internal* coding. Even if your internal coding is UCS-4, it remains a variable-length coding due to the presence of combining marks in Unicode. There is just no way to encode every character (as a non-programmer would define such a thing) in a single codepoint.

Bart van Ingen Schenau 2010-10-29 13:56:07

@Bart No, you are using the word character in the external meaning *rendered character*.

Let_Me_Be 2010-10-29 14:03:42

@Let_Me_Be: Then please enlighten me. What would the internal representation be for the character that is externally visible as an A with a line above it (LATIN CAPITAL A with COMBINING OVERLINE)?

Bart van Ingen Schenau 2010-10-29 14:11:25

@Bart Well, obviously that's not one Unicode character.

Let_Me_Be 2010-10-29 14:13:58

@Let_Me_Be: And why would that make a difference for a user of a text-manipulation program? To the user it appears as a single glyph, so the program should treat it as a single character.

Bart van Ingen Schenau 2010-10-29 14:29:36

@Bart Of it matters to him. But this is high level semantic. How the heck you want to work with Glyphs, when you can't even work with Unicode characters. Plus Glyphs is something that has to be supported by the rendering platform, and if it is that that platform will obviously report correctly that the user is now insert at position X+2 (because the last Glyph consisted of two Unicode characters) and not X+1.

Let_Me_Be 2010-10-29 14:40:00

@MSalters: Did you perhaps mean "the STL does *not* assume that a std::string is UTF-16"?

Martin v. Löwis 2010-10-31 09:14:15

ansaurus

tags:

views:

answers:

How does Microsoft handle the fact that UTF-16 is a variable length encoding in their C++ standard library implementation

related questions