Hello,
there are already a few questions relating to this problem. I think my question is a bit different because I don't have an actual problem, I'm only asking out of academic interest. I know that Windows's implementation of UTF-16 is sometimes contradictory to the Unicode standard (e.g. collation) or closer to the old UCS-2 than to UTF-16, but I'll keep the “UTF-16” terminology here for reasons of simplicity.
Background: In Windows, everything is UTF-16. Regardless of whether you're dealing with the kernel, the graphics subsystem, the filesystem or whatever, you're passing UTF-16 strings. There are no locales or charsets in the Unix sense. For compatibility with medieval versions of Windows, there is a thing called “codepages” that is obsolete but nonetheless supported. AFAIK, there is only one correct and non-obsolete function to write strings to the console, namely WriteConsoleW
, which takes an UTF-16 string. Also, a similar discussion applies to input streams, which I'll ignore, too.
However, I think this represents a design flaw in the Windows API: there is a generic function that can be used to write to all stream objects (files, pipes, consoles…) called WriteFile
, but this function is byte-oriented and doesn't accept UTF-16 strings. The documentation suggests using WriteConsoleW
for console output, which is text oriented, and WriteFile
for everything else, which is byte oriented. Since both console streams and file objects are represented by kernel object handles and console streams can be redirected, you have to call a function for every write to a standard output stream that checks whether the handle represents a console stream or a file, breaking polymorphy. OTOH, I do think that Windows's separation between text strings and raw bytes (which is mirrored in many other systems like Java or Python) is conceptually superior to Unix's char*
approach that ignores encodings and doesn't distinguish between strings and byte arrays.
So my questions are: What to do in this situation? And why isn't this problem solved even in Microsoft's own libraries? Both the .NET Framework and the C and C++ libraries seem to adhere to the obsolete codepage model. How would you design the Windows API or an application framework to circumvent this issue?
I think that the general problem (which is not easy to solve) is that all libraries assume that all streams are byte-oriented, and implement text-oriented streams on top of that. However, we see that Windows does have special text-oriented streams on the OS level, and the libraries are unable to deal with this. So in any case we must introduce significant changes to all standard libraries. A quick and dirty way would be to treat the console as a special byte-oriented stream that accepts only one encoding. This still requires that the C and C++ standard libraries must be circumvented because they don't implement the WriteFile
/WriteConsoleW
switch. Is that correct?