views:

375

answers:

3

This is quite a low-level (low in the sense of "closer to the metal") question.

I was wondering if any of you could point me to documentation, explanations, etc. of how, upon receiving a Unicode character (or any character code, but I'm particularly interested in the Unicode Standard) the console in Windows, good ol' cmd.exe (using, say, codepage 65001) and xterm in Linux started with, say, LC_CTYPE=en_US.UTF-8 look up the corresponding glyph (and where).

I know it may be harder to know in Windows, but I can't really find much information.

Thank you.

+2  A: 

As far as I can tell, cmd.exe is bound to whatever 256-character code page you defined as the "codepage for non-Unicode programs" or whatever it was called.

To elaborate, if I set the above setting to Japanese, cmd.exe suddenly replaces backslashes with yen signs (as does every other non-Unicode app on the system) and correctly interprets ShiftJIS codes, for example. Setting it to Dutch gives me an accented I (I forgot which), while another codepage would give a half-filled vertical solid instead on the same character.

Not Unicode. Unicode would let me do all three at the same time.

Kawa
+1  A: 

The console uses a TextWriter with an encoding created from the codepage. That means that the characters written are encoded into bytes using the specific Encoding object for the codepage.

Guffa
He's quite specifically talking about cmd.exe, which is not, last I checked, a .Net application so it logically does not use TextWriter. Unless there is another TextWriter I don't know about.
Kawa
Yeah, I just checked. It doesn't exactly show up in yellow in Process Explorer.
Kawa
Well, if we narrow it down to the console itself, it doesn't support unicode characters at all. If the current encoding itn't UTF-8 and you try to display a UTF-8 file it will decode it using the current encoding instead, which of course makes a mess of anything outside the ASCII character range. If the current encoding is UTF-8, it still doesn't support unicode characters, only characters encoded as UTF-8.
Guffa
A: 

the console doesn't support Unicode. :)

CoDeR
Yeah, that's what I implied earlier.
Kawa
That's not true on recent Linux systems.
Joachim Sauer
For a definition of recent which goes back at least to 1996 for the console. If you meant terminal emulators, support for UTF-8 was added to XTerm in 1999 and was already present in some other terminal emulators before.
AProgrammer
Terminal emulators where faster, but the console (as in: the environment you usually see if X is not running) learned it only in the last few years.
Joachim Sauer
I'm pretty sure I played with that on the console before I moved in January 1998 and the related documentation is present in kernel 2.0.1. If my memory is correct, I hadn't to compile the kernel with special option, just send the correct escape sequences. What could occurred more recently is the switch to active it by default in common distributions (ISTR that they switched globally to UTF-8 around 2005).
AProgrammer