tags:

views:

228

answers:

6

I want to write a program in C++ that should work on Unix and Windows. This program should be able to use both: the Unicode and non Unicode environments. Its behavior should depend only on the environment settings.

One of the nice features that I want to have, is to manipulate file names read from directories. These can be unicode... or not.

What is the easiest way to achieve that?

+1  A: 

You have to decide which Unicode encoding you want to use e.g UTF-8, ISO-8859-1 etc Then you should take this into consideration in your C++ in all your string manipulation. E.g. take a look at w_char and wstring. In a non-Unicode environment I assume you mean that the input variables will be ascii only?

Yes, only ASCII in non unicode one.The problem with two versions of the program is that I have to provide both and decide which to run. I'd rather have one program and just run it.
Simon
ISO-8859-1: Is not a unicode encoding.
Martin York
@Simon: ASCII and UTF-8 are backwards compatible. So all ASCII characters are also UTF-8 characters (no change required). But note ASCII is only 0-127. Oncce you get above 127 then you are talking about ISO-8859-* which define how the codes 128-255 are defined.
Martin York
@Simon: What you could do, is use character arrays in your program for the strings. All inputs would be converted to UTF-8 (byte by byte). The ascii chars that are <127 would remain the same, while the rest would be converted according to the schema. What you have to do though is allocate enough space in the array for the UTF-8. A char array is an array of single bytes but in UTF-8 it would be of 2,3 or 4 bytes per character
A: 

The best way I've seen is to have typedefs and a very few macros defined based on conditional compilation. For example:

#ifdef UNICODE
#define mychar wchar_t
#define s(a) L ## a
typedef std::wstring mystringa;
#else
#define mychar char
#define s(a) a
typedef std::string mystringa;
#endif
typedef std::basic_string<mychar> mystringb;

and so on. You would then use strings as s("foo") and mystringa(s("foo"));. I've shown two ways to create a string type. Either should work.

David Thornley
You've got your `typedef` and `#define` syntax mixed up.
dan04
This is a sensible solution. One thing that I would do in addition is let Windows' macro and type names dominate; i.e. `TCHAR` instead of `mychar`, `_UNICODE` instead of `UNICODE`, and `TEXT` instead of `s`.
Daniel Trebbien
`TCHAR` is very Windows-specific. Sure, you *could* define it on Unix, but it's not really useful unless (1) you have a library that's overloaded with `char` and `wchar_t` versions of everything, and (2) you actually bother to build both versions.
dan04
@Daniel: Except that names with leading underscores followed by capital letters belong to the implementation. That means that _UNICODE and _T() are technically out. And thanks for the edit; I don't know what I was thinking.
David Thornley
@dan04: The original question was about being able to switch easily between ASCII and Unicode on Windows and Linux. That implies the libraries are available, or at least that the correct one will be available on the specific Unix being targeted.
David Thornley
@David: Or it could just be an assumption made by a Windows programmer used to having both "ANSI" and "Unicode" functions without realizing that other platforms don't have that.
dan04
+2  A: 

You have to decide how you represent the text internally.
This should be constant no matter what else you choose.

Then whenever you read any input you must trans-code from the input format into the internal format. Then from the internal format to the output format on the way out. If you happen to use the same format internal and externally this becomes an identity operation.

UTF-8 is great for storage and transmission as it compresses well.
But I don't like it as an internal representation as it has variable length.

UTF-16: Was supposed to be the savior of all mankind.
But was quickly superceeded by UTF-32

UTF-32: Fixed with. Therefore great for internal representation and manipulation.
Easy to convert to/from UTF-8.
Very bulky (each character takes 4 bytes).

Most OS have either already converted to a UTF string representation or are heading that way. So using an onld obsolte format internally like ISO-8859 just means than calls to the OS will cause extra work as the string is converted to/from UTF. As a result this seems like a waste of time (to me).

Martin York
Quick note: UTF16 and UTF32 are subject to endianness issues. UTF8 and UTF16 make it hard to know the number of Unicode Points... but the number of Points being different from the number of graphems, it doesn't really matter.
Matthieu M.
+1  A: 

The locale identifier of "" (empty string) specifies an implementation-specific default locale. So, if you set the global locale to std::locale("") then you will, in theory, get a default locale that is initialized based on the environment's locale settings. This is about as much help as standard c++ gives you.

This has some major limitations on Windows, where MSVC doesn't proivde any std::locale's with UTF-8 encoding. And Mac OS X doesn't provide any std::locale other than the culture-neutral "C" locale.

In practice it's common to standardize on UTF-8 encoded std::string everywhere internal to your app. Then, in those specific cases where you need to interact with the OS, do the code conversion as necessary. For example, you'll use a const char * encoded with UTF-8 to define a file name on unix, but a wchar * encoded with UTF-16 to define a filename on windows.

UTF-8 is a widely recommended internal character set for applications that are intended to be portable. UTF-16 has the same variable-width encoding problems as UTF-8, plus uses more space for a lot of languages. Also UTF-16 adds a byte-ordering issue and has relatively little support on unix. UTF-32 is the simplest encoding to work on, but it also uses the most space and has no native support on windows.

karunski
+3  A: 

I want to write a program in C++ that should work on Unix and Windows.

First, make sure you understand the difference between how Unix supports Unicode and how Windows supports Unicode.

In the pre-Unicode days, both platforms were similar in that each locale had its own preferred character encodings. Strings were arrays of char. One char = one character, except in a few East Asian locales that used double-byte encodings (which were awkward to handle due to being non-self-synchronizing).

But they approached Unicode in two different ways.

Windows NT adopted Unicode in the early days when Unicode was intended to be a fixed-width 16-bit character encoding. Microsoft wrote an entirely new version of the Windows API using 16-bit characters (wchar_t) instead of 8-bit char. For backwards-compatibility, they kept the old "ANSI" API around and defined a ton of macros so you could call either the "ANSI" or "Unicode" version depending on whether _UNICODE was defined.

In the Unix world (specifically, Plan 9 from Bell Labs), developers decided it would be easier to expand Unix's existing East Asian multi-byte character support to handle 3-byte characters, and created the encoding now known as UTF-8. In recent years, Unix-like systems have been making UTF-8 the default encoding for most locales.

Windows theoretically could expand their ANSI support to include UTF-8, but they still haven't, because of hard-coded assumptions about the maximum size of a character. So, on Windows, you're stuck with an OS API that doesn't support UTF-8 and a C++ runtime library that doesn't support UTF-8.

The upshot of this is that:

  • UTF-8 is the easiest encoding to work with on Unix.
  • UTF-16 is the easiest encoding to work with on Windows.

This creates just as much complication for cross-platform code as it sounds. It's easier if you just pick one Unicode encoding and stick to it.

Which encoding should that be?

See UTF-8 or UTF-16 or UTF-32 or UCS-2

In summary:

  • UTF-8 lets you keep the assumption of 8-bit code units.
  • UTF-32 lets you keep the assumption of fixed-width characters.
  • UTF-16 sucks, but it's still around because of Windows and Java.

wchar_t

is the standard C++ "wide character" type. But it's encoding is not standardized: It's UTF-16 on Windows and UTF-32 on Unix. Except on those platforms that use locale-dependent wchar_t encodings as a legacy from East Asian programming.

If you want to use UTF-32, use a uint32_t or equivalent typedef to store characters. Or use wchar_t if __STDC_ISO_10646__ is defined and uint32_t.

The new C++ standard will have char16_t and char32_t, which will hopefully clear up the confusion on how to represent UTF-16 and UTF-32.

TCHAR

is a Windows typedef for wchar_t (assumed to be UTF-16) when _UNICODE is defined and char (assumed to be "ANSI") otherwise. It was designed to deal with the overloaded Windows API mentioned above.

In my opinion, TCHAR sucks. It combines the disadvantages of having platform-dependent char with the disadvantages of platform-dependent wchar_t. Avoid it.

The most important consideration

Character encodings are about information interchange. That's what the "II" stands for in ASCII. Your program doesn't exist in a vacuum. You have to read and write files, which are more likely to be encoded in UTF-8 than in UTF-16.

On the other hand, you may be working with libraries that use UTF-16 (or more rarely, UTF-32) characters. This is especially true on Windows.

My recommendation is to use the encoding form that minimizes the amount of conversion you have to do.

This program should be able to use both: the Unicode and non Unicode environments

It would be much better to have your program work entirely in Unicode internally and only deal with legacy encodings for reading legacy data (or writing it, but only if explicitly asked to.)

dan04
UTF-16 is also the internal, native string format of MacOS X and iOS (Cocoa API). Not just Windows and Java. Unix is the odd one out, in fact.
Seva Alekseyev
One important note: The "wide character" Windows API works with UCS-2, not UTF-16.
Daniel Trebbien
There are at least some contexts in which surrogate pairs are supported.http://msdn.microsoft.com/en-us/library/dd374069%28VS.85%29.aspx
dan04
@Daniel. Windows API works with UTF-16 (including the surrogatees) at least since XP.
Nemanja Trifunovic
@Nemanja: Do you have a reference?
Daniel Trebbien
@Daniel: From Michael Kaplan himself: http://blogs.msdn.com/b/michkap/archive/2005/05/11/416552.aspx . Also in MSDN: http://msdn.microsoft.com/en-us/library/dd374069(VS.85).aspx
Nemanja Trifunovic
+1  A: 

Personally, I would go a different road.

Whatever the format you choose, it should accommodate Unicode, that's a given. However you certainly do not have to feel restricted to using an existing encoding.

A specific encoding is meant to communicate easily, however since Unix defaults to UTF-8 and Windows to UTF-16, it's impossible to have a universal encoding. Therefore I would simply suggest using your own internal representation and apply suitable conversion depending on the OS you are targeting. This being down by a common interface to the functions you need and an implementation per OS/encoding.

Also note that you should be able to change the encoding/decoding on the fly regardless of the platform you are on (eg, you might be requested to use UTF-32 on Unix for a specific file), one other reason NOT to use a given encoding.

To sum it up:

  • ICU is great
  • if you implement it yourself and wish to be somewhat "standard" use UTF-32 (4 bytes per Point)
  • if you are tight in memory, 21 bits (< 3 bytes) are sufficient to encode all existing Points

Conversion may seem "computer-intensive" but:

  • you can do it stream-wise
  • it's much faster than I/O

My 2 cts, as they say :)

Matthieu M.