views:

127

answers:

4

I'm trying to gain a basic understanding of what is meant by a Windows code page. I kind of get the feeling it's a translation between a given 8 bit value and some 'abstraction' for a given character graphic.

I made the following experiment. I created a "" character literal with two versions of the letter u with an umlaut. One created using the ALT 129 (uses code page 437) value and one using the ALT 0252 (uses code page 1252) value. When I examined the literal both characters had the value 252.

Is 252 the universal 8 bit abstraction for u with an umlaut? Is it the Unicode value?

Aside from keyboard input are there any library routines or system calls that use code pages? For example is there a function to translate a string using a given code table (as above for the ALT 129 value)?

A: 

A Windows code page is similar to a code set such as ISO 8859-1. It maps certain numbers (how characters are stored on disk) to certain glyphs (characters as they appear on the screen, in an abstract way). It does not correspond to a font directly - though a font may support a given code set or code page. For example, both Courier New and Times Roman fonts may be used to display CP1252 and they look different on the screen, even though the data on disk may be the same.

The first 256 code points of Unicode are the same as the code points of ISO 8859-1. In ISO 8859-1, code point 252 (0xFC) is LATIN SMALL LETTER U WITH DIAERESIS (colloquially, u-with-umlaut, or 'ü').

There are code set conversion functions; the ICU supports some. There are Windows-specific code set converters to, I have no doubt; I just don't know what their names are. It will depend, in part, on which language(s) you are using.

Jonathan Leffler
A: 

A windows code page is a means for translating an 8 bit value to a character. Most Windows computers in the US use Windows-1252.

Newer Windows programs typically use UTF-8 to store text files and internally use wide strings which are UTF-16. This eliminates code page issues, so a text file written in Hungary will look the same when opened in the US.

Stephen Nutt
+1  A: 

Windows code-pages are a relic of pre-unicode days, when languages with different characters would still attempt to represent them using one (or two in the case of Asian) bytes. This is where the concept of a character set comes into play. English, for instance, is "windows-1252". The various code pages can be installed through the Regional & Language Options control panel. A list of code-pages can be found here - http://msdn.microsoft.com/en-us/goglobal/bb964654.aspx

Within .NET, code-pages are accessed through the System.Text.Encoding class. This provides a method for converting from one code page to another. For instance, to convert a string in windows-1252 to utf8 (admittedly usually a fairly pointless exercise), you could use this code:

using System.Text;

public string GetUtf8StringFromDefaultEncoding(string target, string codePage) {
     Encoding windows = Encoding.GetEncoding(codePage);
     byte[] windowsBytes = windows.GetBytes("Hello World");
     string utf8String = new UTF8Encoding().GetString(windowsBytes);
     return utf8String;
}

public static void Main() {
     Console.Out.WriteLine(GetUtf8StringFromDefaultEncoding("Hello World", 
                           "windows-1252"));
}
John Christensen
are there any windows system routines or library functions callable from c++ to work with code pages?
Mike D
I'm not entirely sure, but a quick look at the msdn site suggests this link - http://msdn.microsoft.com/en-us/library/dd374085%28VS.85%29.aspx
John Christensen
+1  A: 

Here is a must-read explanation of Unicode and Characters Sets (including code pages) from Joel Spolsky

PabloG
+1 for the excellent Spolsky link. That really is the minimal information every programmer should know, presented in an amusing manner. And the simplifications don't really amount to lies, as often happens with simplifications.
Adrian McCarthy