Why isn't UTF-8 allowed as the "ANSI" code page?

views:

175

answers:

+6 Q:

Why isn't UTF-8 allowed as the "ANSI" code page?

The Windows _setmbcp function allows any valid code page...

(except UTF-7 and UTF-8, which are not supported)

OK, not supporting UTF-7 makes sense: Characters have non-unique representations and that introduces complexity and security risks.

But why not UTF-8?

As I understand it, the "ANSI" versions of the Windows API functions convert their arguments to UTF-16, call the equivalent "W" function, and convert any strings in the output to "ANSI". This is what I've been doing manually. So why can't Windows do it for me?

+3 A:

The "ANSI" codepage is basically legacy: Windows 9X era. All modern software should be Unicode (that is, UTF-16) based anyway.

Basically, when the Ansi code page stuff was originally designed, UTF-8 wasn't even invented and so support for multi-byte encodings was rather haphazard (i.e. most Ansi code pages are single byte, with the exception of some East Asian code pages which are one-or-two byte). Adding support for "proper" multi-byte encodings was probably deemed not worth the effort when all new development should be done in UTF-16 anyway.

Dean Harding 2010-06-08 06:09:45

I agree that all new development should be in *Unicode*. But I had reasons to propose using UTF-8 instead of UTF-16. (1) My team wrote a million lines of non-Unicode-aware code before anyone gave a damn about it, and now it would be a massive effort to change all those char-based strings to wchar_t-based ones. (2) We have plans to port our product to Linux, on which UTF-8 tends to be preferred.

dan04 2010-06-08 06:53:36

_setmbcp() is a VC++ RTL function, not a Win32 API function. It only affects how the RTL interprets strings. It has no effect whatsoever on Win32 API A functions. When they call their W counterparts internally, the A functions always use MultiByteToWideChar() and WideCharToMultiByte() specifying codepage 0 (CP_ACP) to use the system default Ansi codepage for the conversions.

Remy Lebeau - TeamB 2010-07-21 22:00:00

ansaurus

tags:

views:

answers:

Why isn't UTF-8 allowed as the "ANSI" code page?

related questions