views:

73

answers:

4

My ASP.NET application imports CSV files. They are mostly saved in spreadsheet or notepad that asks for 'character set', for example: ISO-8859-2, Windows-1210, DOS-852 or Unicode(UTF-8).

Wiki says UTF-8 is a character encoding but Windows-1210 and ISO-8859-2 are code pages. Are these terms interchangeable?

.NET reads files saved in UTF-8 fine. Does it discover encoding itself?

+1  A: 

Quotes from wiki:

"Code page is another name for character encoding. It consists of a table of values that describes the character set for a particular language."

http://en.wikipedia.org/wiki/Code_page

and:

"Windows code pages are sets of characters or code pages (known as character encodings in other operating systems) used in Microsoft Windows systems from the 1980s and 1990s."

lasseespeholt
+2  A: 

You might want to check out Joel Spolsky's article and this post here

nonnb
+1 Thanks for the article link!
jdv
+1  A: 

I think it is largely historic, but there is one clear distinction. A code page is a look-up table, one particular byte maps to a specific character. Different code pages use different mappings. In the olden days, those mappings weren't actually performed. Which required you to also have fonts that had glyphs to match the code page. Still a problem today btw, console windows have a code page.

There is no mapping in a Unicode encoding. They merely needs to squeeze 32-bits into an efficient format. Different Unicode encodings use different ways to squeeze the bits. The character always has a fixed value (codepoint in Unicode speak).

UTF encoded text files should have a BOM, allowing the reader to autodetect the encoding. No such convention exists for text files that were encoded with a code page. Getting good text out of them is a bit of a crap shoot. It's an evil that should die already :)

Hans Passant
Although UTF-8, UTF-16, and UTF-32 are purely algorithmic, there exist Unicode encodings like GB18030 and UTF-EBCDIC that do include mapping tables. Also, a BOM is NOT required or recommended for UTF-8.
dan04
Brrr, still looks like bit encodings to me. Similar to how UTF-8 favors ASCII. Yes, a BOM is not required, it is merely incredibly stupid not to include it. The point is that there's a well defined standard if you *do* include it. As opposed to having *no* standard for code-page encoded text.
Hans Passant
Come to think of it, also highlights the natural state. Using a weirdo encoding or intentionally omitting a BOM is a 'competitive advantage'.
Hans Passant
A BOM is very useful for UTF-16. It's unneeded for UTF-8 and UTF-32 which can be detected by validation.
dan04
There's about a billion Chinese that don't think much of that idea. "Bush hid the facts" is legendary.
Hans Passant
A: 

.NET classes such as StreamReader default to UTF-8 encoding; no it's not magically detected.

Jerome