tags:

views:

787

answers:

6

Hello, I have a requiremnt to produce text files with ASCII encoding. I have a database full of Greek, French, and German characters with Umlauts and Accents. Is this even possible?

string reportString = report.makeReport();
Dictionary<string, string> replaceCharacters = new Dictionary<string, string>();
byte[] encodedReport = Encoding.ASCII.GetBytes(reportString);
Response.BufferOutput = false;
Response.ContentType = "text/plain";
Response.AddHeader("Content-Disposition", "attachment;filename=" + reportName + ".txt");
Response.OutputStream.Write(encodedReport, 0, encodedReport.Length);
Response.End();

When I get the reportString back the characters are represented faithfully. When I save the text file I have ? in place of the special characters.

As I understand it the ASCII standard is for American English only and something UTF 8 would be for the international audience. Is this a correct?

I'm going to make the statement that if the requirement is ASCII encoding we can't have the accents and umlauts represented correctly.

Or, am I way off and doing/saying something stupid?

Thanks for your help.

+6  A: 

You cannot represent accents and umlauts in an ASCII encoded file simply because these characters are not defined in the standard ASCII charset.

Darin Dimitrov
right. so, if I make a statement like I can't give you your special characters because I have a requirement that states I must produce these reports ASCII encoded, I'm making a true statement.
jim
Yes, absolutely.
Darin Dimitrov
Just ensure that the people that gave you that requirement understands what "ASCII encoded" really mean. A typical non-unicode-knowledgeable person might consider "ASCII encoded" as "text file".
Lasse V. Karlsen
Thanks Lasse, I think my umlautes and accents broke their import process so I'm almost sure the mean ASCII and not anything else
jim
+1  A: 

You are correct.

  • Pure US ASCII is a 7-bit encoding, featuring English characters only.
  • You need a different encoding to capture characters from other alphabets. UTF-8 is a good choice.
unwind
+1  A: 

UTF-8 is backward compatible with ASCII, so if you encode your files as UTF-8, then ASCII clients can read whatever is in their character set, and Unicode clients can read all the extended characters.

There's no way to get all the accents you want in ASCII; some accented characters (like ü) are however available in the "extended ASCII" (8-bit) character set.

Aaronaught
Is there a way to Encode.Ascii using the 8-bit version instead of the 7 bit?
jim
There is always a way. The encoding you probably want is ANSI 1252 or Windows-1252, which you can get to using Encoding.GetEncoding(1252). This is the standard "Windows" encoding.
Aaronaught
+2  A: 

The ASCII characer set only contains A-Z in upper and lowe case, digits, and some punctuation. No greek characters, no umlauts, no accents.

You can use a character set from the group that is sometimes referred to as "extended ASCII", which uses 256 characters instead of 128.

The problem with using a different character set than ASCII is that you have to use the correct one, i.e. the one that the receiving part is expecting, or it will fail to interpret any of the extended characters correctly.

You can use Encoding.GetEncoding(...) to create an extended encoding. See the reference for the Encoding class for a list of possible encodings.

Guffa
Thanks Guffa, the GetEncoding is interesting, there just isn't anyway of telling what they are using on the other end.
jim
+3  A: 

Before Unicode this was handled by "code pages", you can think of a code page as a mapping between Unicode characters and the 256 values that can fit into a single byte (obviously, in every code page most of the Unicode characters are missing).

The original ASCII code page includes only English letters - but it's unlikely someone really wants the original 7-bit code page, they probably call any 8-bit character set ASCII.

The English code page known as Latin-1 is ISO-8859-1 or Windows-1252 (the first is the ISO standard, the second is the closest code page supported by Windows).

To support characters not in Latin-1 you need to encode using different code pages, for example:

874 — Thai
932 — Japanese
936 — Chinese (simplified) (PRC, Singapore)
949 — Korean
950 — Chinese (traditional) (Taiwan, Hong Kong)
1250 — Latin (Central European languages)
1251 — Cyrillic
1252 — Latin (Western European languages)
1253 — Greek
1254 — Turkish
1255 — Hebrew
1256 — Arabic
1257 — Latin (Baltic languages)
1258 — Vietnamese

UTF-8 is something completely different, it encodes the entire Unicode character set by using variable number of bytes per characters, numbers and English letters are encoded the same as ASCII (and Windows-1252) most other languages are encoded at 2 to 4 bytes per character.

UTF-8 is mostly compatible with ASCII systems because English is encoded the same as ASCII and there are no embedded nulls in the strings.

Converting between .net strings (UTF-16LE) and other encoding is done by the System.Text.Encoding class.

IMPORTANT NOTE: the most important thing is that the system on the receiving end will use the same code page and teh system on the sending end - otherwise you will get gibberish.

Nir
+1  A: 

Various of the encodings mentioned by other answers can be loosely described as extended ASCII.

When your users are asking for ASCII encoding, they are probably asking for one of these.

A statement like "if the requirement is ASCII encoding we can't have the accents and umlauts represented correctly" risks sounding pedantic to a non-technical user. An alternative is to get a sample of what they want (probably either the ANSI or OEM code page of their PC), determine the appropriate code page, and specify that.

Joe