tags:

views:

134

answers:

5

To my understanding the representation of size_t and wchar_t are completely platform/compiler specific. For instance I have read that wchar_t on Linux is now usually 32bit, but on Windows it is 16bit. Is there any way that I can standardize these to a set size (int, long, etc.) in my own code, while still maintaining backwards comparability with the existing standard C libraries and functions on both platforms?

My goal is essentially to do something like typedef them so they are a set size. Is this possible without breaking something? Should I do this? Is there a better way?

UPDATE: The reason I'd like to do this is so that my string encoding is consistent across both Windows and Linux

Thanks!

+4  A: 

You don't want to redefine those types. Instead, you can use typedefs like int32_t or int16_t (signed 32-bit and 16-bit), which are part of <stdint.h> in the C standard library.

If you're using C++, C++0x will add char16_t and char32_t, which are new types (not just typedefs for integral types) intended for UTF-16 and UTF-32.

For wchar_t, an alternative is to just use a library like ICU which implements Unicode in a platform-independent way. Then, you can just use the UChar type, which will always be UTF-16; you do still need to be careful about endianness. ICU also provides converters to and from UChar (UTF-16).

Matthew Flaschen
If I create an int32_t 'string' will it still work with all wchar_t based functions/methods? I'd like to know that on all platforms my encoding is UTF-32 (as an example)
Tyler
@Tyler, no. You have to make sure anything you pass to a function requiring a `wchar_t` string can safely be converted to it. So for instance, passing a pointer to int32_t to a Windows wchar_t function will fail.
Matthew Flaschen
In ICU you can use u_strToWCS() and u_strFromWCS() to convert between UChar and your platform's Unicode wchar_t (assuming wchar_t is Unicode). Then just use UChar* everywhere for your string. ICU provides plenty of functions to work with the UChar* string.
Steven R. Loomis
+6  A: 

Sounds like you're looking for C99's & C++0x's <stdint.h>/<cstdint> headers. This defines types like uint8_t, and int64_t.

You can use Boost's cstdint.hpp in the case you don't have those headers.

GMan
A: 

wchar_t is going to be a stickier wicket, possibly, than size_t. One could assume a maximum size for size_t (8 bytes say) and cast all variables to that before writing to file (or socket). One other thing to keep in mind is that you are going to have byte ordering issues if you are trying to write/read some sort of binary representation. Anyway, wchar_t may represent a utf-32 encoding on one system (I believe that Linux does this) and could represent a UTF-16 encoding on another system (windows does this). If you are trying to create a standard format between platforms, you are going to have to resolve all of these issues.

Jon Trauntvein
A: 

Just work with UTF-8 internally, and convert to UTF-16 just-in-time when passing arguments to Windows functions that require it. UTF-32 is probably never needed. Since it's usually wrong (in a Unicode sense) to process individual characters instead of strings, it's no more difficult to work with capitalizing or normalizing a UTF-8 string than it is a UTF-32 string.

R..
+1  A: 

No. The fundemental problem with trying to use a typedef to "fix" a character type, is that you end up with something that on some platforms is consistent with the built in functions and with wide character literals, and on other platforms is not.

If you want a string format which is the same on all platforms, you could just pick a size and signed-ness. You want unsigned 8 bit "characters", or signed 64 bit "characters"? You can have them on any platform which has an integer type of the appropriate size (not all do). But, they're not really characters as far as the language is concerned, so don't expect to be able to call strlen or wcslen on them, or to have a nice syntax for literals. A string literal is (well, converts to) a char*, not a signed char* or an unsigned char*. A wide string literal is a wchar_t*, which is equivalent to some other integer type, but not necessarily the one you want it to be.

So, you have to pick an encoding, use that internally, define your own versions of the string functions you need, implement them, then convert to/from the platform's encoding as necessary for non-string functions that take strings. utf-8 is a decent option because most of the C string functions still "work", in the sense that they do something fairly useful even if it isn't entirely correct.

Steve Jessop