ansaurus

Question

Answer 1

+6 A:

You cannot represent accents and umlauts in an ASCII encoded file simply because these characters are not defined in the standard ASCII charset.

Darin Dimitrov 2009-12-07 13:47:00

right. so, if I make a statement like I can't give you your special characters because I have a requirement that states I must produce these reports ASCII encoded, I'm making a true statement.

jim 2009-12-07 13:50:43

Yes, absolutely.

Darin Dimitrov 2009-12-07 13:52:14

Just ensure that the people that gave you that requirement understands what "ASCII encoded" really mean. A typical non-unicode-knowledgeable person might consider "ASCII encoded" as "text file".

Lasse V. Karlsen 2009-12-07 13:56:17

Thanks Lasse, I think my umlautes and accents broke their import process so I'm almost sure the mean ASCII and not anything else

jim 2009-12-07 14:00:41

Answer 2

+1 A:

You are correct.

Pure US ASCII is a 7-bit encoding, featuring English characters only.
You need a different encoding to capture characters from other alphabets. UTF-8 is a good choice.

unwind 2009-12-07 13:50:39

Answer 3

+1 A:

UTF-8 is backward compatible with ASCII, so if you encode your files as UTF-8, then ASCII clients can read whatever is in their character set, and Unicode clients can read all the extended characters.

There's no way to get all the accents you want in ASCII; some accented characters (like ü) are however available in the "extended ASCII" (8-bit) character set.

Aaronaught 2009-12-07 13:53:34

Is there a way to Encode.Ascii using the 8-bit version instead of the 7 bit?

jim 2009-12-07 14:01:15

There is always a way. The encoding you probably want is ANSI 1252 or Windows-1252, which you can get to using Encoding.GetEncoding(1252). This is the standard "Windows" encoding.

Aaronaught 2009-12-07 14:10:09

Answer 4

+2 A:

The ASCII characer set only contains A-Z in upper and lowe case, digits, and some punctuation. No greek characters, no umlauts, no accents.

You can use a character set from the group that is sometimes referred to as "extended ASCII", which uses 256 characters instead of 128.

The problem with using a different character set than ASCII is that you have to use the correct one, i.e. the one that the receiving part is expecting, or it will fail to interpret any of the extended characters correctly.

You can use Encoding.GetEncoding(...) to create an extended encoding. See the reference for the Encoding class for a list of possible encodings.

Guffa 2009-12-07 13:59:35

Thanks Guffa, the GetEncoding is interesting, there just isn't anyway of telling what they are using on the other end.

jim 2009-12-07 14:02:38

Answer 5

+3 A:

Before Unicode this was handled by "code pages", you can think of a code page as a mapping between Unicode characters and the 256 values that can fit into a single byte (obviously, in every code page most of the Unicode characters are missing).

The original ASCII code page includes only English letters - but it's unlikely someone really wants the original 7-bit code page, they probably call any 8-bit character set ASCII.

The English code page known as Latin-1 is ISO-8859-1 or Windows-1252 (the first is the ISO standard, the second is the closest code page supported by Windows).

To support characters not in Latin-1 you need to encode using different code pages, for example:

874 — Thai
932 — Japanese
936 — Chinese (simplified) (PRC, Singapore)
949 — Korean
950 — Chinese (traditional) (Taiwan, Hong Kong)
1250 — Latin (Central European languages)
1251 — Cyrillic
1252 — Latin (Western European languages)
1253 — Greek
1254 — Turkish
1255 — Hebrew
1256 — Arabic
1257 — Latin (Baltic languages)
1258 — Vietnamese

UTF-8 is something completely different, it encodes the entire Unicode character set by using variable number of bytes per characters, numbers and English letters are encoded the same as ASCII (and Windows-1252) most other languages are encoded at 2 to 4 bytes per character.

UTF-8 is mostly compatible with ASCII systems because English is encoded the same as ASCII and there are no embedded nulls in the strings.

Converting between .net strings (UTF-16LE) and other encoding is done by the System.Text.Encoding class.

IMPORTANT NOTE: the most important thing is that the system on the receiving end will use the same code page and teh system on the sending end - otherwise you will get gibberish.

Nir 2009-12-07 14:19:23

Answer 6

+1 A:

Various of the encodings mentioned by other answers can be loosely described as extended ASCII.

When your users are asking for ASCII encoding, they are probably asking for one of these.

A statement like "if the requirement is ASCII encoding we can't have the accents and umlauts represented correctly" risks sounding pedantic to a non-technical user. An alternative is to get a sample of what they want (probably either the ANSI or OEM code page of their PC), determine the appropriate code page, and specify that.

Joe 2009-12-07 14:24:19

ansaurus

tags:

views:

answers:

ASCII Encoding and Umlauts and Accents

related questions