views:

115

answers:

4

I'm working on writing some libraries that will be used both internally and by customers and was wondering what the best method of supporting both Unicode and ASCII. It looks like Microsoft (in the MFC Libraries) writes both the Unicode and ASCII classes and does something similar to this in the header files using macros:

#ifdef _UNICODE
#define CString CStringW
#else
#define CString CStringA
#endif

While I'm not a huge fan of macros, it does the job. If I'm writing libraries using the STL, does it make sense to write headers that contain things like this:

#ifdef _UNICODE
#define GetLastErrorString GetLastErrorStringW
#else
#define GetLastErrorString GetLastErrorStringA
#endif

std::string GetLastErrorStringA();
std::wstring GetLastErrorStringW();

Or should I just release separate libraries, one for ASCII and one for Unicode?

Just wondering what people think is the best thing to do in this situation.

UPDATE: Addressing some comments and questions:

  • These will be C++ class libraries.
  • I believe I will need to use UTF-16 encoding as I would like to support Asian character sets.
  • My reasons for implementing Unicode are twofold: 1) All new SDKs support Unicode and I'm not confident that future SDKs or third party libraries will be supporting separate ASCII versions in the future. 2) While we will not be completely internationalizing our application, it would be nice if we could handle user input (like names) and files loading from paths that contain Asian characters.
+3  A: 

I would make the library entirely Unicode internally. Then, a set of C++ adapter classes would exist for ASCII that thunk to the Unicode implementation.

Michael
The question is: is this necessary to internally use a 'Unicode' since depending on the encoding a plain std::string might do the trick.
Matthieu M.
A: 

The question is a bit imprecise but...

First you have to precise the encoding. Unicode is just a representation of the characters (each being associated a codepoint), when it comes to dealing with Unicode in an application you have to choose how the codepoints are going to be represented. If you can go with Utf-8, you won't have to worry about wide chars, you can store the data in a plain std::string :)

Then you have to precise your problem:

  • do you want to support entries in Unicode and Ascii ?
  • or are you talking about the output ?
  • is there anyway that you could use the std::locale to know in which encoding you should output ?

I am working on a internationalized application (a website, with a c++ backend...) and we simply use std::string's internally. The output in Ascii or Utf-8 depends on a translation file, but the data representation does not vary by a iota (except for counting characters, see my post on this topic).

Really, I am definitely not a fan of macros, as utf-8 was meant to be compatible with Ascii, if you can choose your own encoding, you're saved!

Matthieu M.
A: 

You can store unicode strings in a std::string if you convert them to UTF-8 first.

You only need wstring when interfacing with UTF-16 calls, like the Windows API. If that is the case you can convert your strings to wstrings locally where needed. This can be a bit burdensome, but it's not that bad.

StackedCrooked
A: 

I t-h-i-n-k you're asking about code "understandability" rather than using ASCII, UTF-8, 16 or 32 bit characters.

If so, I prefer making the blocks of code as large as possible: that would dispose one to use the "gate" (the _UNICODE symbolic constant) to select either separate files or, at least, big chunks of code. Code that changes its spots every other line, or so, or, heavens forbid, within a statement, is difficult to comprehend.

I would counsel against using the gate to select inclusions of separate files

#ifdef _UNICODE
#include "myUniLib.h"
#else
#include "myASCIILib.h"
#endif

as such would entail two and maybe even three files (the Unicode file, the 646US (ASCII) file, and, maybe, your nexus file with the above code). That's three times the possibility of something being lost and a resultant build failure.

Instead, use the gate within a file to select large blocks of code:

#ifdef _UNICODE
   ...lotsa code...
#else
   ...lotsa code...
#endif

OK, say you're doing the opposite: wondering about char versus char (UTF-8) versus W versus A. How universal do you want to be? The CStrings you mention are for the Windows world, only. If you want to be Mac and UNIX (OK, Linux) compatible, you are in for a rough ride.

BtW- ASCII is ...not... a recognized standard, any more. There's ASCII and then there's ... ASCII. If you mean the seven bit "standard" from the old days of UNIX, the closest I have found is ISO-646US. The Unicode equivalent is ISO-10646.

Some folks have had luck with encoding the characters as URLs: just ASCII letters and digits and the per cent sign. While you have to encode and decode all of the time, the storage is really predictable. A little strange, yes, but definitely innovative.

There are some linguistic pitfalls. For example, do not depend on case to be bi-directional (I don't know the proper word, here). In Deutsch, lower case ß becomes SS when translated to upper case. SS, however, when lower-cased, morphs to ss, not ß. Turkish has something similar. When designing your application, don't assume that case translations can help you.

Also, remember that grammatical ordering is different across languages. A mere, "Hello, Jim! How is your Monday going?" can end up being "Hello! Your, Monday, it goes well, Jim?"

Finally, a warning: avoid stream IO (std::cin << and std::cout >>). It traps you into embedding your message generators in such a way that localizing them becomes very difficult.

You're asking the right questions. You have an adventure ahead of you! Best!