ansaurus

Question

Answer 1

+22 A:

char* does not mean ASCII. You could use an 8 bit Unicode encoding like UTF-8. char could also be 16 bit or more, you could then use UTF-16.

TheFogger 2010-09-21 13:34:40

+1: There is a common misunderstanding about encoding.

ereOn 2010-09-21 13:40:34

The additional benefit with going the UTF-8 path is that STL et al exception text strings already are valid UTF-8. The problem is that it's a bit cumbersome to handle once you pass the 7-bit code points. At that point you'll either need custom output routines for UTF-8 or a conversion routine to an 8- or 16-bit code page all of which may or may not be something you want to do in your exception handler.

Andreas Magnusson 2010-09-21 13:49:45

@Andreas: There's two problems when using `std::string` for UTF-8: One is that in UTF-8, there's a difference between the number of characters and the number of bytes in a string. The other is that it's very easy to confuse system-encoded strings (which every application will continue to need) and UTF-8-encoded ones, resulting in funny text to be shown to the users. I found it better to use, say, `std::basic_string<signed char>` for UTF-8-encoded strings. That eliminates at least the second problem, because it makes the compiler bark at you when you confuse the encoding.

sbi 2010-09-21 14:15:43

How prevalent are system-encoded strings that use characters outside the ASCII subset? If system-encoded strings can be restricted to the ASCII subset, then UTF-8 can be used without funny text. As for string length, I like using `std::string` because I can get a byte count from it and can calculate the number of characters in O(n). Basically, if you want the string to think in characters, you have to subclass `std::basic_string<signed char>`, change its iterator (and maybe demote it from being a random-access iterator), and add a byte count method.

Mike DeSimone 2010-09-21 14:24:21

+1: I need to learn more about UTF-8

John Dibling 2010-09-21 14:31:33

@sbi: I think you misunderstood me, what I meant was that the text string returned from `what()` for the stdlib exceptions already are valid UTF-8 strings since they are ASCII and ASCII is a subset of UTF-8. Further I created one big "cumbersome problem" out of your two problems since all problems with UTF-8 begin when you move outside the ASCII subset. Speaking of solutions I quite like the accepted answer posted in the thread posted by `ybungalobill` below.

Andreas Magnusson 2010-09-21 14:49:21

Answer 2

+2 A:

Standard doesn't specify what encoding is the string returned by what(), neither there is any defacto standard. I just encode it as UTF-8 and return from what(), in my projects. Of course there may be incompatibility with other libraries.

See also: http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful for why UTF-8 is good choice.

ybungalobill 2010-09-21 13:35:57

Answer 3

+3 A:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

Edit: Made CW, commenters may edit in why this link is relevant if they wish

Dustin Getz 2010-09-21 13:39:39

+1: Great link.

ereOn 2010-09-21 13:41:22

old and nearly only windows

David Feurle 2010-09-21 13:42:57

-1 : I think adding a link (a great link, btw.) without any explanation on how this would relate to C++ exceptions does *nothing* to help *answer* the question. (It might help contextualize some encoding issues, but that what comments are for, no?) This is especially true if the OP actually needs to read the link.

Martin 2010-09-21 14:19:51

Moreover, I've already read the link and it does not address my question.

John Dibling 2010-09-21 14:32:08

To the contrary, I think this link provides *great* insight as to why using `char const*` has nothing to do with character encoding.

Alexandre C. 2010-09-21 14:37:21

@Alexandre: but for a reader here on SO, there's no indication of *why* I should read this long article on an external site. As @Martin said, don't just post links, post a short summary and/or an explanation of *why* the link is relevant.

jalf 2010-09-21 15:06:17

Answer 4

+2 A:

A const char* doesn't have to point to an ASCII string; it can be in a multi-byte encoding such as UTF-8. One option is to use wcstombs() and friends to convert wstrings to strings, but you may have to convert the result of what() back to wstring before printing. It also involves more copying and memory allocation than you may be comfortable with in an exception handler.

I usually just define my own base exception class, which uses wstring instead of string in the constructor and returns a const wstring& from what(). It's not that big of a deal. The lack of a standard one is a pretty big oversight.

Another valid opinion is that exception strings should never be presented to the user, so localizing them isn't necessary and so you don't have to worry about any of the above.

Steve M 2010-09-21 13:47:32

Answer 5

+3 A:

Returning UTF-8 is an obvious choice. If the application that uses your exceptions uses a different multibyte encoding, it might have a hard time displaying the string though. (It can't know it's UTF-8, can it?) On the other hand, for ISO-8859-* 8bit encodings (Western european, cyrillic, etc.) displaying a UTF-8 string will "just" display some gibberish and you (or your user) might be fine with that if you cannot disambiguate btw. a char* in the locale character set and UTF-8.

Personally I think only low level error messages should go into what() strings and personally I think these should be english anyway. (Maybe combined with some error number or whatnot.)

The worst problem I see with what() is that it is not uncommon to include some contextual details in the what() message, for example a filename. Filenames are non ASCII rather often, so you are left with no choice but to use UTF-8 as the what() encoding.

Note also that your exception class (that's derived from std::exception) can obviously provide any access methods you like and so it might make sense to add an explicit what_utf8() or what_utf16() or what_iso8859_5().

Edit: Regarding John's comment on how to return UTF-8:

If you have a const char* what() function this function essentially returns a bunch of bytes. On a western european windows platform, these bytes would usually be encoded as Win1252, but on a russian windows it might as well be Win1251.

What the bytes return signify depends on their encoding and their encoding depends on where they "came from" (and who is interpreting them). A string literal's encoding is defined at compile time, but at runtime it's still up to the application how to interpret these.

So, to have your exception return UTF-8 strings with what() (or what_utf8()) you have to make sure that:

The input message to your exception has a well defined encoding
You have a well defined encoding for the string member you use to hold the message.
You appropriately convert the encoding when what()is called

Example:

struct MyExc : virtual public std::exception {
  MyExc(const char* msg)
  : exception(msg)
  { }
  std::string what_utf8() {
    return convert_iso8859_1_to_utf8( what() );
  }
};

// In a ISO-8859-1 encoded source file
const char* my_err_msg = "ISO-8859-1 ... äöüß ...";
...
throw MyExc(my_err_msg);
...
catch(MyExc const& e) {
  std::string iso8859_1_msg = e.what();
  std::string utf_msg = e.what_utf8();
...

The conversion could also be placed in the (overridden) what() member function of MyExc() or you could define the exception to take an already UTF-8 encoded string or you could convert (from an expected input encoding, maybe wchar_t/UTF-16) in the ctor.

Martin 2010-09-21 14:15:37

"Returning UTF-8 is an obvious choice." This seems to follow the arc of current thought. Now the only question is, how do I return UTF-8? :)

John Dibling 2010-09-21 14:33:33

@John Dibling:If the text of your messages is all in English and can be expressed in standard ASCII, you have already done enough because ASCII and the first 128 characters of UTF-8 are identical. If you are using characters and an encoding above 127 you'll need to convert the encoding to UTF-8. There must be a standard C++ library function to do that by now. If not, libiconv can do the trick.

JeremyP 2010-09-21 15:51:06

@JeremyP: we use ICU where I work to handle Unicode, certainly not perfect (C-interface...) but it does the work and handles the quircks of Unicode / Internationalization / Localization.

Matthieu M. 2010-09-21 18:27:35

@Matthieu M: Thanks for that. I was looking for a C compatible unicode library. I could have used libiconv but it's licence is more restrictive.

JeremyP 2010-09-22 08:53:21

@JeremyP: glad to be of help :)

Matthieu M. 2010-09-22 20:33:56

Answer 6

+2 A:

what() is generally not meant to display a message to a user. Among other things the text it returns is not localizable (even if it was Unicode). I'd just use what() to display something of value to you as the developer (like the source file and line number of the place where the exception was raised) and for that sort of text, ASCII is usually more than enough.

Nemanja Trifunovic 2010-09-21 14:22:28

This is your opinion, and while I respect your opinion I don't share it. Even if the `what()` output is only stored to a log file it is on some level "presented to the user" and needs to not be gibberish.

John Dibling 2010-09-21 14:35:56

I am not saying it should be gibberish. I am saying that what() is not suitable to hold "international" text not because it can't hold Unicode (it can) but because it is not localizable.

Nemanja Trifunovic 2010-09-21 14:42:04

Certainly the exception text may not need to be "internationalized" in the same way as text that the users normally see. But I can imagine times where a piece of Unicode text would still be very relevant and one would want it included with the exception. For example, a file name or path could have Unicode characters. Leaving that out would make the exception handling or logging less useful.

TheUndeadFish 2010-09-21 17:40:24

@Nemanja: why can't you internationalize it ? Can't you access the local within `what` ?

Matthieu M. 2010-09-21 18:26:05

Answer 7

+2 A:

The first question is what do you intend to do with the what() string?

Do you plan to log the information somewhere?

If so you should not be using the content of the what() string you should be using that string as a reference to look up the correct local specific logging message. So to me the content of the what() is not for logging purposes (or any form of display) it is a method of looking up the actual logging string (which can be any Unicode string).

Now; It can be us-full for the what() string to contain a human readable message for the developers to help in quick debugging (but for this highly readable polished text is not required). As result there is no reason to support anything more than ASCII. Obey the KISS principle.

Martin York 2010-09-21 15:24:48

In response to your questions. I'd like to use the `what()` string in order to generate two levels of diagnostics. The lower level is a developer- or technician-centric diagnostic that would be displayed in log files. But at a higher level I'd like these strings to be used to construct a diagnostic that is actionable by a normal human being. As you seem to imply, the `what()` return could simply be a lookup value in to a table of more humane messages, but some components of the string (or at least the exception) would need to be human-readable, such as " File blah.txt could not be found."

John Dibling 2010-09-21 15:38:26

John Dibling 2010-09-21 15:39:11

Most local conversionlanguages take an input string and convert it to the local string via resources. So if you say the first part of the string upto a colon is used to look up local strings you can then do this: `File could not be found: blah.txt`. The part `File could not be found:` can then be used to look up the local specific translation.

Martin York 2010-09-21 16:30:22

ansaurus

tags:

views:

answers:

Exceptions with Unicode what()

related questions