views:

53

answers:

2

I would like to add Unicode support to a C library I am maintaining. Currently it expects all strings to be passed in utf8 encoded. Based on feedback it seems windows usually provides 3 function versions.

  • fooA() ANSI encoded strings
  • fooW() Unicode encoded strings
  • foo() string encoding depends on the UNICODE define

Is there an easy way to add this support without writing a lot of wrapper functions myself? Some of the functions are callable from the library and by the user and this complicates the situation a little.

I would like to keep support for utf8 strings as the library is usable on multiple operating systems.

+4  A: 

The foo functions without the suffix are in fact macros. The fooA functions are obsolete and are simple wrappers around the fooW functions, which are the only ones that actually perform work. Windows uses UTF-16 strings for everything, so if you want to continue using UTF-8 strings, you must convert them for every API call (e.g. with MultiByteToWideChar).

For the public interface of your library, stick to exactly one encoding, either UTF-16, UTF-32 or UTF-8. Everything else (locale-dependent or OS-dependent encodings) is too complex for the callers. You don't need UTF-8 to be compatible with other OSes: many platform-independent libraries such as ICU, Qt or the Java standard libraries use UTF-16 on all systems. I think the choice between the three Unicode encodings depends on which OS you expect the library will be used most: If it will mostly be used on Windows, stick to UTF-16 so that you can avoid all string conversions. On Linux, UTF-8 is a common choice as a filesystem or terminal encoding (because it is the only Unicode encoding with an 8-bit-wide character unit), but see the note above regarding libraries. OS X uses UTF-8 for its POSIX interface and UTF-16 for everything else (Carbon, Cocoa).

Some notes on terminology: The words "ANSI" and "Unicode" as used in the Microsoft documentation are not in accordance to what the international standard say. When Microsoft speaks of "Unicode" or "wide characters", they mean "UTF-16" or (historically) the BMP subset thereof (with one code unit per code point). "ANSI" in Microsoft parlance means some locale-dependent legacy encoding which is completely obsolete in all modern versions of Windows.

If you want a definitive recommendation, go for UTF-16 and the ICU library.

Philipp
+1  A: 

Since your library already requires UTF-8 encoded strings, then it is already fully Unicode enabled, as UTF-8 is a loss-less Unicode encoding. If you are wanting to use your library in an environment that normally uses UTF-16 or even UTF-32 strings, then it could simply encode to, and decode from, UTF-8 when talking with your library. Otherwise, your library would have to expose extra UTF-16/32 functions that do those encoding/decoding operations internally.

Remy Lebeau - TeamB