views:

286

answers:

2

The more I work with C++ locale facets, more I understand --- they are broken.

  • std::time_get -- is not symmetric with std::time_put (as it in C strftime/strptime) and does not allow easy parsing of times with AM/PM marks.
  • I descovered recently that simple number formatting may produce illegal UTF-8 under certain locales (like ru_RU.UTF-8).
  • std::ctype is very simplistic assuming that to upper/to lower can be done on per-character base (case conversion may change number of characters and it is context dependent).
  • std::collate -- does not support collation strength (case sensitive or insensitive).
  • There is not way to specify timezone different from global timezone in time formatting.

And much more...

  • Does anybody knows whether any changes are expected in standard facets in C++0x?
  • Is there any way to bring an importance of such changes?

Thanks.

EDIT: Clarifications in case the link is not accessible:

std::numpunct defines thousands separator as char. So when separator in U+2002 -- different kind of space it can't be reproduced as single char in UTF-8 but as multiple byte sequence.

In C API struct lconv defines thousands separator as string and does not suffers from this problem. So, when you try to format numbers with separators outside of ASCII with UTF-8 locale, invalid UTF-8 is produced.

To reproduce this bug write 1234 to std:ostream with imbued ru_RU.UTF-8 locale

EDIT2: I must admit that POSIX C localization API works much smoother:

  • There is inverse of strftime -- strptime (strftime does same as std::time_put::put)
  • No problems with number formatting because of the point I mentioned above.

However it is still for from being perfecet.

EDIT3: According to the latest notes about C++0x I can see that std::time_get::get -- similar to strptime and opposite of std::time_put::put.

+1  A: 

std::numpunct is a template. All specializations try to return the decimal seperator character. Obviously, in any locale where that is a wide character, you should use std::numpunct<wchar_t>, as the <char specialization can't do that.

That said, C++0x is pretty much done. However, if good improvements continue, the C++ committee is likely to start C++1x. The ISO C++ committee on is very likely to accept your help, if offered through your national ISO member organization. I see that Pavel Minaev suggested a Defect Report. That's technically possible, but the problems you describe are in general design limitations. In that case, the most reliable course of action is to design a Boost library for this, have it pass the Boost review, submit it for inclusion in the standard, and participate in the ISO C++ meetings to deal with any issues cropping up there.

MSalters
"you should use std::numpunct<wchar_t>", wchar_t is one of the ways to provide unicode point."What happens if such point is placed outside of BMP and sizeof(wchar_t)==2?What if such separation consists of more then one character? This is exactly the same issue! Also when you use UTF-8 locale you should expect that characters may be wider then 1 byte. The correct solution is provide (CharT const *) return result instead of CharT.In any case, when you write simple program that prints numbers you do expect it to handle Unicode properly -- like this is done in C localization.
Artyom
The design of `wchar_t` is such that a single `wchar_t` can hold any character supported by the implementation. For that reason, an implementation with 16-bits wchar_t cannot support all Unicode 5.0 characters. It would need to pick a supported subset, such as the BMP. There is no such thing in ISO C++ as a "multi-wchar_t string".However, an implementation is free to define a `__char16` or a `__char32` and specialize `std::numpunct<>` for them.
MSalters
"an implementation with 16-bits wchar_t cannot support all Unicode 5.0" It cannot support all of Unicode 2.0 where first surrogate characters were introduced. "There is no such thing in ISO C++ as a "multi-wchar_t string"" -- What about UTF-16? `wchar_t const *` is perfectly well. Take a look there: http://linux.die.net/man/7/locale. The thousands separator is represented as `char *` in `struct lconv`, so there is no problem to represent any Unicode character given UTF-8 locale.
Artyom
Unicode 5 is the current version, and I think the one referenced by C++0x. UTF-16 is a multi-word encoding of Unicode, and not a proper character set for `wchar_t`. UTF-8 is a valid encoding set for `char`, because `char` unlike `wchar_t` _does_ allow multi-byte encodings. However, multi-byte encodings are not desirable for string processing. With that in mind, it makes sense that C++ has limited support for them.
MSalters
+2  A: 

I agree with you, C++ is lacking proper i18n support.

Does anybody knows whether any changes are expected in standard facets in C++0x?

It is too late in the game, so probably not.

Is there any way to bring an importance of such changes?

I am very pessimistic about this.

When asked directly, Stroustrup claimed that he does not see any problems with the current status. And another one of the big C++ guys (book author and all) did not even realize that wchar_t can be one byte, if you read the standard.

And some threads in boost (which seems to drive the direction in the future) show so little understanding on how this works that is outright scary.

C++0x barely added some Unicode character data types, late in the game and after a lot of struggle. I am not holding my breath for more too soon.

I guess the only chance to see something better is if someone really good/respected in the i18n and C++ worlds gets directly involved with the next version of the standard. No clue who that might be though :-(

Mihai Nita