2-byte (UCS-2) wide strings under GCC

views:

313

answers:

+1 Q:

2-byte (UCS-2) wide strings under GCC

Hi all,

when porting my Visual C++ project to GCC, I found out that the wchar_t datatype is 4-byte UTF-32 by default. I could override that with a compiler option, but then the whole wcs* (wcslen, wcscmp, etc.) part of RTL is rendered unusable, since it assumes 4-byte wide strings.

For now, I've reimplemented 5-6 of these functions from scratch and #defined my implementations in. But is there a more elegant option - say, a build of GCC RTL with 2-byte wchar-t quietly sitting somewhere, waiting to be linked?

The specific flavors of GCC I'm after are Xcode on Mac OS X, Cygwin, and the one that comes with Debian Linux Etch.

+1 A:

Look at the ICU library. It is a portable library with a UTF-16 API.

bmargulies 2010-05-07 17:31:38

Rewriting all my MSVC wide string code is not what I'm looking for, sorry. I want source compatibility with the UCS-2 RTL.

Seva Alekseyev 2010-05-07 17:40:01

*shrug* my employer sells such a library. I'm reasonably sure that ICU is the closest free alternative.

bmargulies 2010-05-07 18:02:25

+1 A:

As you've noticed, wchar_t is implementation defined. There is no way to portable work with that data type.

Linux systems in general had the advantage of gaining Unicode support later, after the whole UCS-2 debacle was declared a not-so-great idea, and use UTF-8 as the encoding. All system APIs still operate on char*, and are Unicode safe.

Your best bets are to use a library which manages this for you: Qt, ICU, etc.

Note that cygwin features a 2 byte wchar_t to make meshing with Windows easier.

Yann Ramin 2010-05-07 17:58:48

+1 A:

But is there a more elegant option - say, a build of GCC RTL with 2-byte wchar-t quietly sitting somewhere, waiting to be linked?

No. This is a platform-specific issue, not a GCC issue.

That is to say, the Linux platform ABI specifies that wchar_t is 32-bits wide, so either you have to use a whole new library (for which ICU is a popular choice), or port your code to handle 4-byte wchar_ts. All libraries that you might link to will also assume a 4-byte wchar_t, and will break if you use GCC's -fshort-wchar.

But on Linux specifically, nearly everyone has standardized on UTF-8 for all multibyte encodings.

greyfade 2010-05-07 17:59:39

Point taken.For the record, any nontrivial string processing in UTF-8 sucks plastic bags. Iterating to the i'th character (not byte) in the string is an O(i) operation, oh my.

Seva Alekseyev 2010-05-07 19:06:02

But that's the locale that's configured on almost all Linux systems these days, so it's something you have to deal with.

greyfade 2010-05-07 19:40:35

@Steva: UTF-16 has the same issue. Don't confuse UCS-2 (pre Win2k) with UTF-16 (Win2k+).

Yann Ramin 2010-05-07 22:19:07

The characters I'm working with are limited to the Basic Multilingual Plane by design. So, for practical purposes, it's all UCS-2.

Seva Alekseyev 2010-05-08 13:48:12

And besides, who said anything about Linux? I first encountered this on Mac, where unsigned short (AKA "unichar") is the OS-level native character format, just like in Win32.

Seva Alekseyev 2010-10-24 21:05:16

@theatrus: demonstrably WRONG. I've just constructed a NSString with a character from a higher Unicode plane (an exotic Chinese character). Such a string is two unichars long, but one Unicode character long. And guess what, [length] would return 2. The system would even let me take a substring with *half a character*. So maybe it's UTF-16-correct on display, but for the purposes of programmatic processing it's still UCS-2. Can't blame them, frankly.

Seva Alekseyev 2010-10-26 16:39:38

Reimplemented 5-6 of more common wcs* functions, #defined my implementations in.

Seva Alekseyev 2010-10-12 03:41:55

ansaurus

tags:

views:

answers:

2-byte (UCS-2) wide strings under GCC

related questions