What's the difference between code page and character encoding?

views:

answers:

+7 Q:

What's the difference between code page and character encoding?

My ASP.NET application imports CSV files. They are mostly saved in spreadsheet or notepad that asks for 'character set', for example: ISO-8859-2, Windows-1210, DOS-852 or Unicode(UTF-8).

Wiki says UTF-8 is a character encoding but Windows-1210 and ISO-8859-2 are code pages. Are these terms interchangeable?

.NET reads files saved in UTF-8 fine. Does it discover encoding itself?

+1 A:

Quotes from wiki:

"Code page is another name for character encoding. It consists of a table of values that describes the character set for a particular language."

http://en.wikipedia.org/wiki/Code_page

and:

"Windows code pages are sets of characters or code pages (known as character encodings in other operating systems) used in Microsoft Windows systems from the 1980s and 1990s."

lasseespeholt 2010-08-25 20:42:35

+2 A:

You might want to check out Joel Spolsky's article and this post here

nonnb 2010-08-25 20:48:43

+1 Thanks for the article link!

jdv 2010-08-25 20:51:44

+1 A:

I think it is largely historic, but there is one clear distinction. A code page is a look-up table, one particular byte maps to a specific character. Different code pages use different mappings. In the olden days, those mappings weren't actually performed. Which required you to also have fonts that had glyphs to match the code page. Still a problem today btw, console windows have a code page.

There is no mapping in a Unicode encoding. They merely needs to squeeze 32-bits into an efficient format. Different Unicode encodings use different ways to squeeze the bits. The character always has a fixed value (codepoint in Unicode speak).

UTF encoded text files should have a BOM, allowing the reader to autodetect the encoding. No such convention exists for text files that were encoded with a code page. Getting good text out of them is a bit of a crap shoot. It's an evil that should die already :)

Hans Passant 2010-08-25 21:04:44

Although UTF-8, UTF-16, and UTF-32 are purely algorithmic, there exist Unicode encodings like GB18030 and UTF-EBCDIC that do include mapping tables. Also, a BOM is NOT required or recommended for UTF-8.

dan04 2010-08-25 21:26:22

Brrr, still looks like bit encodings to me. Similar to how UTF-8 favors ASCII. Yes, a BOM is not required, it is merely incredibly stupid not to include it. The point is that there's a well defined standard if you *do* include it. As opposed to having *no* standard for code-page encoded text.

Hans Passant 2010-08-25 22:13:45

Come to think of it, also highlights the natural state. Using a weirdo encoding or intentionally omitting a BOM is a 'competitive advantage'.

Hans Passant 2010-08-25 22:17:08

A BOM is very useful for UTF-16. It's unneeded for UTF-8 and UTF-32 which can be detected by validation.

dan04 2010-08-26 00:39:31

There's about a billion Chinese that don't think much of that idea. "Bush hid the facts" is legendary.

Hans Passant 2010-08-26 00:46:46

.NET classes such as StreamReader default to UTF-8 encoding; no it's not magically detected.

Jerome 2010-08-25 21:38:00

ansaurus

tags:

views:

answers:

What's the difference between code page and character encoding?

related questions