tags:

views:

274

answers:

7

Or, "how do Russians throw exceptions?"

The definition of std::exception is:

namespace std {
  class exception {
  public:
    exception() throw();
    exception(const exception&) throw();
    exception& operator=(const exception&) throw();
    virtual ~exception() throw();
    virtual const char* what() const throw();
  };
}

A popular school of thought for designing exception hierarchies is to derive from std::exception:

Generally, it's best to throw objects, not built-ins. If possible, you should throw instances of classes that derive (ultimately) from the std::exception class. By making your exception class inherit (ultimately) from the standard exception base-class, you are making life easier for your users (they have the option of catching most things via std::exception), plus you are probably providing them with more information (such as the fact that your particular exception might be a refinement of std::runtime_error or whatever).std::runtime_error or whatever).

But in the face of Unicode, it seems to be impossible to design an exception hierarchy that achieves both of the following:

  • Derives ultimately from std::exception for ease of use at the catch site
  • Provides Unicode compatibility so that diagnostics are not sliced or gibberish

Coming up with an exception class that can be constructed with Unicode strings is simple enough. But the standard dictates that what() must return a const char*, so at some point the input strings must be converted to ASCII. Whether that is done at construction time or when what() is called (if the source string uses characters not representable by 7-bit ASCII), it might be impossible to format the message without loss of fidelity.

How do you design an exception hierarchy that combines the seamless integration of a std::exception-derived class with lossless Unicode diagnostics?

+22  A: 

char* does not mean ASCII. You could use an 8 bit Unicode encoding like UTF-8. char could also be 16 bit or more, you could then use UTF-16.

TheFogger
+1: There is a common misunderstanding about encoding.
ereOn
The additional benefit with going the UTF-8 path is that STL et al exception text strings already are valid UTF-8. The problem is that it's a bit cumbersome to handle once you pass the 7-bit code points. At that point you'll either need custom output routines for UTF-8 or a conversion routine to an 8- or 16-bit code page all of which may or may not be something you want to do in your exception handler.
Andreas Magnusson
@Andreas: There's two problems when using `std::string` for UTF-8: One is that in UTF-8, there's a difference between the number of characters and the number of bytes in a string. The other is that it's very easy to confuse system-encoded strings (which every application will continue to need) and UTF-8-encoded ones, resulting in funny text to be shown to the users. I found it better to use, say, `std::basic_string<signed char>` for UTF-8-encoded strings. That eliminates at least the second problem, because it makes the compiler bark at you when you confuse the encoding.
sbi
How prevalent are system-encoded strings that use characters outside the ASCII subset? If system-encoded strings can be restricted to the ASCII subset, then UTF-8 can be used without funny text. As for string length, I like using `std::string` because I can get a byte count from it and can calculate the number of characters in O(n). Basically, if you want the string to think in characters, you have to subclass `std::basic_string<signed char>`, change its iterator (and maybe demote it from being a random-access iterator), and add a byte count method.
Mike DeSimone
+1: I need to learn more about UTF-8
John Dibling
@sbi: I think you misunderstood me, what I meant was that the text string returned from `what()` for the stdlib exceptions already are valid UTF-8 strings since they are ASCII and ASCII is a subset of UTF-8. Further I created one big "cumbersome problem" out of your two problems since all problems with UTF-8 begin when you move outside the ASCII subset. Speaking of solutions I quite like the accepted answer posted in the thread posted by `ybungalobill` below.
Andreas Magnusson
+2  A: 

Standard doesn't specify what encoding is the string returned by what(), neither there is any defacto standard. I just encode it as UTF-8 and return from what(), in my projects. Of course there may be incompatibility with other libraries.

See also: http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful for why UTF-8 is good choice.

ybungalobill
+3  A: 

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

Edit: Made CW, commenters may edit in why this link is relevant if they wish

Dustin Getz
+1: Great link.
ereOn
old and nearly only windows
David Feurle
-1 : I think adding a link (a great link, btw.) without any explanation on how this would relate to C++ exceptions does *nothing* to help *answer* the question. (It might help contextualize some encoding issues, but that what comments are for, no?) This is especially true if the OP actually needs to read the link.
Martin
Moreover, I've already read the link and it does not address my question.
John Dibling
To the contrary, I think this link provides *great* insight as to why using `char const*` has nothing to do with character encoding.
Alexandre C.
@Alexandre: but for a reader here on SO, there's no indication of *why* I should read this long article on an external site. As @Martin said, don't just post links, post a short summary and/or an explanation of *why* the link is relevant.
jalf
+2  A: 

A const char* doesn't have to point to an ASCII string; it can be in a multi-byte encoding such as UTF-8. One option is to use wcstombs() and friends to convert wstrings to strings, but you may have to convert the result of what() back to wstring before printing. It also involves more copying and memory allocation than you may be comfortable with in an exception handler.

I usually just define my own base exception class, which uses wstring instead of string in the constructor and returns a const wstring& from what(). It's not that big of a deal. The lack of a standard one is a pretty big oversight.

Another valid opinion is that exception strings should never be presented to the user, so localizing them isn't necessary and so you don't have to worry about any of the above.

Steve M
+3  A: 

Returning UTF-8 is an obvious choice. If the application that uses your exceptions uses a different multibyte encoding, it might have a hard time displaying the string though. (It can't know it's UTF-8, can it?) On the other hand, for ISO-8859-* 8bit encodings (Western european, cyrillic, etc.) displaying a UTF-8 string will "just" display some gibberish and you (or your user) might be fine with that if you cannot disambiguate btw. a char* in the locale character set and UTF-8.

Personally I think only low level error messages should go into what() strings and personally I think these should be english anyway. (Maybe combined with some error number or whatnot.)

The worst problem I see with what() is that it is not uncommon to include some contextual details in the what() message, for example a filename. Filenames are non ASCII rather often, so you are left with no choice but to use UTF-8 as the what() encoding.

Note also that your exception class (that's derived from std::exception) can obviously provide any access methods you like and so it might make sense to add an explicit what_utf8() or what_utf16() or what_iso8859_5().

Edit: Regarding John's comment on how to return UTF-8:

If you have a const char* what() function this function essentially returns a bunch of bytes. On a western european windows platform, these bytes would usually be encoded as Win1252, but on a russian windows it might as well be Win1251.

What the bytes return signify depends on their encoding and their encoding depends on where they "came from" (and who is interpreting them). A string literal's encoding is defined at compile time, but at runtime it's still up to the application how to interpret these.

So, to have your exception return UTF-8 strings with what() (or what_utf8()) you have to make sure that:

  • The input message to your exception has a well defined encoding
  • You have a well defined encoding for the string member you use to hold the message.
  • You appropriately convert the encoding when what()is called

Example:

struct MyExc : virtual public std::exception {
  MyExc(const char* msg)
  : exception(msg)
  { }
  std::string what_utf8() {
    return convert_iso8859_1_to_utf8( what() );
  }
};

// In a ISO-8859-1 encoded source file
const char* my_err_msg = "ISO-8859-1 ... äöüß ...";
...
throw MyExc(my_err_msg);
...
catch(MyExc const& e) {
  std::string iso8859_1_msg = e.what();
  std::string utf_msg = e.what_utf8();
...

The conversion could also be placed in the (overridden) what() member function of MyExc() or you could define the exception to take an already UTF-8 encoded string or you could convert (from an expected input encoding, maybe wchar_t/UTF-16) in the ctor.

Martin
"Returning UTF-8 is an obvious choice." This seems to follow the arc of current thought. Now the only question is, how do I return UTF-8? :)
John Dibling
@John Dibling:If the text of your messages is all in English and can be expressed in standard ASCII, you have already done enough because ASCII and the first 128 characters of UTF-8 are identical. If you are using characters and an encoding above 127 you'll need to convert the encoding to UTF-8. There must be a standard C++ library function to do that by now. If not, libiconv can do the trick.
JeremyP
@JeremyP: we use ICU where I work to handle Unicode, certainly not perfect (C-interface...) but it does the work and handles the quircks of Unicode / Internationalization / Localization.
Matthieu M.
@Matthieu M: Thanks for that. I was looking for a C compatible unicode library. I could have used libiconv but it's licence is more restrictive.
JeremyP
@JeremyP: glad to be of help :)
Matthieu M.
+2  A: 

what() is generally not meant to display a message to a user. Among other things the text it returns is not localizable (even if it was Unicode). I'd just use what() to display something of value to you as the developer (like the source file and line number of the place where the exception was raised) and for that sort of text, ASCII is usually more than enough.

Nemanja Trifunovic
This is your opinion, and while I respect your opinion I don't share it. Even if the `what()` output is only stored to a log file it is on some level "presented to the user" and needs to not be gibberish.
John Dibling
I am not saying it should be gibberish. I am saying that what() is not suitable to hold "international" text not because it can't hold Unicode (it can) but because it is not localizable.
Nemanja Trifunovic
Certainly the exception text may not need to be "internationalized" in the same way as text that the users normally see. But I can imagine times where a piece of Unicode text would still be very relevant and one would want it included with the exception. For example, a file name or path could have Unicode characters. Leaving that out would make the exception handling or logging less useful.
TheUndeadFish
@Nemanja: why can't you internationalize it ? Can't you access the local within `what` ?
Matthieu M.
+2  A: 

The first question is what do you intend to do with the what() string?

Do you plan to log the information somewhere?

If so you should not be using the content of the what() string you should be using that string as a reference to look up the correct local specific logging message. So to me the content of the what() is not for logging purposes (or any form of display) it is a method of looking up the actual logging string (which can be any Unicode string).

Now; It can be us-full for the what() string to contain a human readable message for the developers to help in quick debugging (but for this highly readable polished text is not required). As result there is no reason to support anything more than ASCII. Obey the KISS principle.

Martin York
In response to your questions. I'd like to use the `what()` string in order to generate two levels of diagnostics. The lower level is a developer- or technician-centric diagnostic that would be displayed in log files. But at a higher level I'd like these strings to be used to construct a diagnostic that is actionable by a normal human being. As you seem to imply, the `what()` return could simply be a lookup value in to a table of more humane messages, but some components of the string (or at least the exception) would need to be human-readable, such as " File blah.txt could not be found."
John Dibling
John Dibling
Most local conversionlanguages take an input string and convert it to the local string via resources. So if you say the first part of the string upto a colon is used to look up local strings you can then do this: `File could not be found: blah.txt`. The part `File could not be found:` can then be used to look up the local specific translation.
Martin York