ansaurus

Question

What is the best practice for creating libraries that support both Unicode and ASCII in C++?

Answer 1

+3 A:

I would make the library entirely Unicode internally. Then, a set of C++ adapter classes would exist for ASCII that thunk to the Unicode implementation.

Michael 2009-09-30 18:03:48

The question is: is this necessary to internally use a 'Unicode' since depending on the encoding a plain std::string might do the trick.

Matthieu M. 2009-09-30 18:16:26

Answer 2

A:

The question is a bit imprecise but...

First you have to precise the encoding. Unicode is just a representation of the characters (each being associated a codepoint), when it comes to dealing with Unicode in an application you have to choose how the codepoints are going to be represented. If you can go with Utf-8, you won't have to worry about wide chars, you can store the data in a plain std::string :)

Then you have to precise your problem:

do you want to support entries in Unicode and Ascii ?
or are you talking about the output ?
is there anyway that you could use the std::locale to know in which encoding you should output ?

I am working on a internationalized application (a website, with a c++ backend...) and we simply use std::string's internally. The output in Ascii or Utf-8 depends on a translation file, but the data representation does not vary by a iota (except for counting characters, see my post on this topic).

Really, I am definitely not a fan of macros, as utf-8 was meant to be compatible with Ascii, if you can choose your own encoding, you're saved!

Matthieu M. 2009-09-30 18:08:13

Answer 3

A:

You can store unicode strings in a std::string if you convert them to UTF-8 first.

You only need wstring when interfacing with UTF-16 calls, like the Windows API. If that is the case you can convert your strings to wstrings locally where needed. This can be a bit burdensome, but it's not that bad.

StackedCrooked 2009-09-30 18:12:00

Answer 4

A:

I t-h-i-n-k you're asking about code "understandability" rather than using ASCII, UTF-8, 16 or 32 bit characters.

If so, I prefer making the blocks of code as large as possible: that would dispose one to use the "gate" (the _UNICODE symbolic constant) to select either separate files or, at least, big chunks of code. Code that changes its spots every other line, or so, or, heavens forbid, within a statement, is difficult to comprehend.

I would counsel against using the gate to select inclusions of separate files

#ifdef _UNICODE
#include "myUniLib.h"
#else
#include "myASCIILib.h"
#endif

as such would entail two and maybe even three files (the Unicode file, the 646US (ASCII) file, and, maybe, your nexus file with the above code). That's three times the possibility of something being lost and a resultant build failure.

Instead, use the gate within a file to select large blocks of code:

#ifdef _UNICODE
   ...lotsa code...
#else
   ...lotsa code...
#endif

OK, say you're doing the opposite: wondering about char versus char (UTF-8) versus W versus A. How universal do you want to be? The CStrings you mention are for the Windows world, only. If you want to be Mac and UNIX (OK, Linux) compatible, you are in for a rough ride.

BtW- ASCII is ...not... a recognized standard, any more. There's ASCII and then there's ... ASCII. If you mean the seven bit "standard" from the old days of UNIX, the closest I have found is ISO-646US. The Unicode equivalent is ISO-10646.

Some folks have had luck with encoding the characters as URLs: just ASCII letters and digits and the per cent sign. While you have to encode and decode all of the time, the storage is really predictable. A little strange, yes, but definitely innovative.

There are some linguistic pitfalls. For example, do not depend on case to be bi-directional (I don't know the proper word, here). In Deutsch, lower case ß becomes SS when translated to upper case. SS, however, when lower-cased, morphs to ss, not ß. Turkish has something similar. When designing your application, don't assume that case translations can help you.

Also, remember that grammatical ordering is different across languages. A mere, "Hello, Jim! How is your Monday going?" can end up being "Hello! Your, Monday, it goes well, Jim?"

Finally, a warning: avoid stream IO (std::cin << and std::cout >>). It traps you into embedding your message generators in such a way that localizing them becomes very difficult.

You're asking the right questions. You have an adventure ahead of you! Best!

2009-09-30 18:31:54

ansaurus

tags:

views:

answers:

What is the best practice for creating libraries that support both Unicode and ASCII in C++?

related questions