views:

102

answers:

3

As we know, in windows system, we can set locale language for non-Unicode programs in "Control Panel\Clock, Language, and Region". But what does a local language mean for an application? Since to my understanding, an application is a compiled binary executable file, which only contained machine code instructions and no data, so how the character encoding affect their running?

One guess is if the executable file contain some literal strings in code segment, it will use some internal Charset to encoding them. If the charset is not unicode, then it will display garbage. But is not the internal Charset is a fixed one? Just like in Java, java spec defines the internal encoding is UTF-16.

Hope someone can answer my questions,

Thanks.

A: 

A non-unicode application is one that primarily uses a multi-byte encoding, where the strings are reperesented by char*, not wchar_t*:

char* myString;

By changing the encoding used, you change the character set available to the application.

And most applications contain both instructions and data.

Alexander Rafferty
@Amigable Clark Kant: No, "multi-byte" is correct for the ANSI API and for using `char`. For instance, see the `MultiByteToWideChar` API, where `MultiByte` means non-Unicode and `WideChar` means Unicode.
RichieHindle
Answers and comments should explain that this is incorrect terminology created by Microsoft. The primary encoding for Unicode is UTF-8, a multibyte encoding, and there exist systems where wide character encoding is not Unicode. In fact, one could argue that it's not Unicode on Windows since Windows' `wchar_t` is too small to store arbitrary Unicode codepoints...
R..
@Alexander Rafferty: so for the data segment, what is the internal encoding used in ANSI C? is not defined by C or we can changeit?
Guoqin
@RichieHindle: MultiByte means multibyte, and WideChar means wide char. There are lots of systems out there using utf-8 for multibyte characters, and there's nothing in the C standard specifying that wide chars should be Unicode or ISO/IEC 10646.
ninjalj
@Guoqin: I hope you're not confusing ANSI C (roughly equivalent to ISO 9899, ISO C) with the Windows ANSI API, so called because some of the codepages used by Windows were based on drafts of ANSI standards.
ninjalj
@ninjalj: One could argue that the C standard does imply `wchar_t` *should* be Unicode via specifying the `__STDC_ISO_10646__` macro which is predefined when `wchar_t` is Unicode.
R..
+3  A: 

Windows has two methods by which programs can talk to it, called the "ANSI API" and the "Unicode API", and a "non-unicode application" is one that talks to Windows via the "ANSI API" rather than the "Unicode API".

What that means is that any string that the application passes to Windows is just a sequence of bytes, not a sequence of Unicode characters. Windows has to decide which characters that sequence of bytes corresponds with, and the Control Panel setting you're talking about is how it does that.

So for example, a non-unicode program that outputs a byte with value 0xE4 on a PC set to use Windows Western will display the character ä, whereas one set up for Hebrew will display the character ה.

RichieHindle
And where in "ANSI API" <i>one byte</i> means <i>one character</i> on screen. In Unicode a character on screen can be represented by more than one byte.
Amigable Clark Kant
@Amigable Clark Kant: Not always true - "double-byte character sets" (see http://msdn.microsoft.com/en-us/library/dd317794%28VS.85%29.aspx) still use the ANSI API. Otherwise there could have been no Chinese version of Windows before Unicode!
RichieHindle
It should also be noted that Microsoft could easily add UTF-8 as a supported multibyte character set and make the whole problem go away, but they *refuse to do so*.
R..
@RichieHindle: Nice explanation. As you said, when application call windows API it just passes in "a sequence of bytes". So is the "sequence of bytes" in encoding same as their source code? I mean if the source code is written in UTF-8, then they are UTF-8; if source code is in GBK, then the sequence of bytes is in GBK. which means ANSI C does not have an fixed internal encoding just like what Java does(utf-16).
Guoqin
@Guoqin: No, C does not define a standard encoding for its source code, or for string literals. A string literal output by a non-Unicode program will consist of the same bytes that were present in the source code, whatever encoding it used.
RichieHindle
@RichieHindle: Actually, the compiler has to translate from the _source character set_ to the _execution character set_, so technically, a string literal output by a non-Unicode program _doesn't need to_ consist of the same bytes present in the source code.
ninjalj
@Guoguin: the character set (and the encoding!) of C source doesn't need to be the same as the character set used on object files. In fact, properly internationalizable C source for Win32 ANSI will typically be pure ASCII (i.e: 0-127), and characters outside ASCII will appear only on resource files.
ninjalj
@ninjalj: Then how is the string literal in source converted when compiled to object file. How is the execution character set decided?
Guoqin
@Guoqin: the "source character set" and "execution character set", as far as the compiler is concerned, usually only include the subset of ASCII which is mandated to exist (but not necessarily with ASCII encoding) by the standard. Since this will have the same (ASCII) encoding regardless of what locale/codepage junk is selected, it's largely irrelevant. Source/execution character set differences would only come into play if you had a cross-compiler running on an ASCII-based machine compiling binaries for an EBCDIC-based machine, or vice-versa.
R..
A: 

RichieHindle correctly explains that there are two variants of most API's, a *W (Unicode) and a *A (ANSI) variant. But after that he's slightly wrong.

It's important to know that the *A variants (such as MessageBoxA) are just wrappers for the *W versions (such as MessageBoxW). They take the input strings and convert them to Unicode; they take the output strings and convert them back.

In the Windows SDK, for all such A/W pairs, there is a #ifdef UNICODE block such that MessageBox() is a macro that expands to either MessageBoxA() or MessageBoxW(). Because all macros use the same condition, many programs use either 100% *A functions or 100% *W functions. "non-Unicode" applications are then those that have not defined UNICODE, and therefore use the *A variants exclusively.

However, there is no reason why you can't mix-and-match *A and *W functions. Would programs that mix *A and *W functions be considered "Unicode", "non-Unicode" or even something else? Actually, the answer is also mixed. When it comes to that Clock, Language, and Region setting, an application is considered a Unicode application when it's making a *W call, and a non-Unicode application when it's making a *A call - the setting controls how the *A wrappers translate to *W calls. And in multi-threaded programs, you can therefore be both at the same time (!)

So, to come back to RichieHindle's example, if you call a *A function with value (char)0xE4, the wrapper will forward to the *W function with either L'ä' or L'ה' depending on this setting. If you then call the *W function directly with the value (WCHAR)0x00E4, no translation happens.

MSalters