ansaurus

Question

Unicode vs Multi-byte

Answer 1

A:

Say I'm compiling my program in Unicode (but ultimately, I want a solution that is independent of the character set used).

This is going to depend on your language - as in programming language rather than human-spoken language. What do you mean by 'compiling my program in Unicode'?

Will all 'char' be interpreted as wide characters?
- It depends on the language and the options chosen. For example, Java uses 16-bit characters (storing UTF-16 or UCS-2 - once upon a long time ago it was UCS-2 but I assume it is now UTF-16). In C, you will have to work rather hard to get the basic 'char' type interpreted as anything other than an 8-bit quantity - at least on the Unix-based compilers.
If I have a simple printf statement, i.e. printf("Hello World\n"); with no character strings, can I just leave it be without using _tprintf and _T("...")? If the printf statement includes a character string, then I should use _tprintf and _T("..."), i.e. _tprintf("Hello %s\n", name); ?
- This requires some understanding of the platform you are working on, since it is far from being standard. I suspect this is MSVC...which makes it more difficult for me to be authoritative since I don't use MSVC. However, the ISO C99 standard (which is signally not supported by MSVC) provides functions such as fwprintf() to print strings of wide characters. If you need information about your specific compiler, tag your question with the correct information.
If I have a text file (saved in the default format, i.e. without changing the default character set used) that I want to read into a buffer, can I still use char instead of TCHAR? Especially if I'm reading it character by character, i.e. by incrementing the character pointer?
- Again, TCHAR is not standard - it is highly specific to MSVC. In standard C, a file stream acquires an 'orientation' (wide-oriented or byte-oriented) when you apply appropriate functions to it. It stays in that orientation until it is closed (or reopened with freopen()).

Jonathan Leffler 2010-02-09 03:37:38

Answer 2

+1 A:

First, if you're compiling with UNICODE/_UNICODE and don't intend to target other platforms, you can avoid using the TCHAR business and use WCHAR (or wchar_t) and W functions everywhere.

1) Will all 'char' be interpreted as wide characters?

char in C is--by definition--1 byte. (This doesn't technically preclude it from being a "wide character" on platforms where wchar_t is also 1 byte, but given that you're using MSVC and are targeting Windows platforms, that's not going to be the case.)

So for practical purposes, the answer to this is: no.

2) If I have a simple printf statement, i.e. printf("Hello World\n"); with no character strings, can I just leave it be without using _tprintf and _T("...")? If the printf statement includes a character string, then I should use _tprintf and _T("..."), i.e. _tprintf("Hello %s\n", name); ?

If you're printing ASCII string literals, you can continue using printf.

If you're printing arbitrary strings that could lie outside of the ASCII range, you should use _tprintf (or wprintf).

3) If I have a text file (saved in the default format, i.e. without changing the default character set used) that I want to read into a buffer, can I still use char instead of TCHAR? Especially if I'm reading it character by character, i.e. by incrementing the character pointer?

What is "the default format"?

When you're reading in an external file, you should read in the first few bytes first to check for a UTF-16 or UTF-8 BOM, and then base your decisions around that.

jamesdlin 2010-02-09 04:16:04

On the printf question: You can use printf on wchar_t strings by applying the "%ls" format specifier. It's not what you print, but what type of output you want from the printf family that dictates which you use.

joveha 2010-02-09 12:29:53

@jovaha: Using `printf` with `%ls` on a `wchar_t` string not representable in the current encoding wouldn't quite work. Your point is well-taken though, and IMO we're both right.

jamesdlin 2010-02-09 12:44:47

UTF-8 does not need a BOM; MS systems may put it there anyway.

Jonathan Leffler 2010-03-23 01:11:44

Answer 3

A:

1) Will all 'char' be interpreted as wide characters?

No. But all TCHARs will be interpreted as wchar_ts

Consider how winnt.h would probably specify this:

#ifdef UNICODE
 typedef WCHAR TCHAR;
#else
 typedef CHAR TCHAR;
#endif

When you call SomeApi() it will wrap to either SomeApiA(char *arg) or SomeApiW(wchar_t *arg). (the arguments will in reality be TCHAR's, but you get the point).

So your source code will be "independent" in the sense that it can be compiled into either an "ANSI" or Widechar version. For this to work you need to use TCHAR's instead of the primitive types.

2) If I have a simple printf statement, i.e. printf("Hello World\n"); with no character strings, can I just leave it be without using _tprintf and _T("...")? If the printf statement includes a character string, then I should use _tprintf and _T("..."), i.e. _tprintf("Hello %s\n", name); ?

I don't know the tprintf family other than I can speculate they work in the same way as the defines above. That is, tprintf takes TCHAR's as argument and dependent on the UNICODE setting either treats them as chars or wchar_ts.

3) If I have a text file (saved in the default format, i.e. without changing the default character set used) that I want to read into a buffer, can I still use char instead of TCHAR? Especially if I'm reading it character by character, i.e. by incrementing the character pointer?

What character encoding the contents of a file uses is entirely up to itself and has nothing to do with TCHAR's. TCHAR's are for filenames and such that you use in win32 API calls.

joveha 2010-02-09 13:01:50

ansaurus

tags:

views:

answers:

Unicode vs Multi-byte

related questions