ansaurus

Question

manually converting between ASCII and .NET characters

Answer 1

A:

I get question marks for all 3 of those in a console app (.NET 3.5SP1). They should all be equivalent, as far as I know. John Knoeller is correct regarding ASCII vs ANSI.

Have you tried using one of the Encoding classes' GetBytes() on the original string and iterating through, removing (by copying "good" bytes to another buffer) the values you don't want?

e.g. (using Linq):

byte[] original = System.Text.Encoding.ASCII.GetBytes(badString);
byte[] clean = (from b in original where b < 145 || b > 148 select b).ToArray<byte>();
string cleanString = System.Text.Encoding.ASCII.GetString(clean);

ASCII is probably the wrong one to use here, to be honest; if the original text is Unicode it could conceivably do bad things (if you get passed UTF-16 for example).

technophile 2010-02-05 19:38:28

Answer 2

+3 A:

.NET uses unicode (UCS-2) which is the same as ASCII only for values below 128.

ASCII doesn't define values above 127.

I think you may be thinking of ANSI, which defines values above 127 as (mostly) language characters needed for most European languages. or OEM (the original IBM pc character set) which defines characters > 127 as (mostly) symbols.

The difference in how the characters above 127 are interpreted is called a code page, or an encoding. (hence System.Text.Encoding). So you could probably get test 3 working if you used a different encoding, perhaps System.Text.Encoding.Default.

Edit: Ok, now that we know that the encoding you want is ANSI, it's clearer what is happening.

The rule for character conversions is to replace characters that can't be represented in encoding as some other character - usually a box. But for ASCII, there is no box character, so it uses a ? instead. This explains test 3.

test1 and 2 are both using Convert.ToChar with an integer constant. Which will interpret the input as a UNICODE character, not an ANSI character, so no conversion is being applied. Unicode character 147 is a non-printing character.

John Knoeller 2010-02-05 19:38:59

Answer 3

+7 A:

Character 147 is U+0093 SET TRANSMIT STATE. Like all the Unicode characters in the range 0-255, it is the same as the ISO-8859-1 character of the same number. ISO-8859-1 assigns 147 to this invisible control code.

What you are thinking of is not ‘ASCII’ or even ‘ISO-8859-1’, but Windows code page 1252. This is a non-standard encoding that is like 8859-1, but assigns the characters 128-159 to various typographical extensions such as smart quotes instead of the largely-useless control codes. In code page 1252, character 147 is “, aka U+201C LEFT DOUBLE QUOTATION MARK.

If you want to convert Windows code pages (often misleadingly known as ‘ANSI’) to Unicode characters you will need to specify the code page you want, for example:

System.Text.Encoding.getEncoding(1252).GetChars(new byte[] { 147 })

System.Text.Encoding.Default will give you the default encoding on your server. For a server in the Western European locale, that'll be 1252. Elsewhere, it won't be. It's generally not a good idea to have a dependency on the locale's default code page in a server application.

In any case, you should not be getting bytes like 147 representing a “ in the input to a web application. That will only happen if your page itself is in code page 1252 encoding (and just to confuse and mislead even more, when you say your page is in ISO-8859-1 format, browsers will silently use code page 1252 instead). Your page may also be in 1252 if you've failed to specify any encoding for it (the browser guesses; other locales will guess different code pages so it'll all be a big mess).

Make sure you use UTF-8 for all encodings in your web app, and mark your pages as such. Today, all web apps should be using UTF-8.

bobince 2010-02-05 19:57:13

@bobince - Great information, thank you very much. I don't suppose you would have any links to documentation about this kind of stuff? I'm just trying to learn as much as possible about this issue before putting a fix into place.

Justin C 2010-02-05 20:03:20

The Spolsky article usually gets wheeled out at this point! (http://www.joelonsoftware.com/articles/Unicode.html)... I have my reservations about some of the material in this, but I suppose it's a reasonable enough primer.

bobince 2010-02-05 20:43:10

@bobince - Is there any chance that a user copy and pasting from a word processor would send the values in to the web interface? This is a pretty rare problem, but each user I have interviewed said they were copying and pasting from their word processor on their mac.

Justin C 2010-02-05 20:54:05

oh, and the page is tagged UTF-8

Justin C 2010-02-05 20:58:00

A web browser DOM's whole content model, including `input.value`, is natively Unicode based. A `“` character pasted into an input field will always submit encoded according the page's declared `charset`, so as byte 0x93 if the page is encoded in cp1252, or as bytes 0xE2, 0x80, 0x9C in UTF-8. Whilst it is technically possible to submit a real character 147 from a UTF-8-encoded page (as sequence 0xC2 0x93), it is very unlikely anyone would input a character 147.

bobince 2010-02-05 20:59:43

If a broken copy-and-paste facility did somehow manage to insert a character 147 into an input (and I've never seen that happen with Office), then the user would already not be able to see the character 147 in the input field (it'd be an invisible control character already), so it should be no surprise to them when it doesn't appear at all in the output either.

bobince 2010-02-05 21:00:58

@bobince - I just tested it using a iBook G4 running Firefox. I copied and pasted a regular " along with some other text into my webapp and hit save. When I went back to the form, the web app is showing curly quotes, not the regular straight ones. My users all want to copy and paste content from Word, so I need to manually scrub the data and switch these characters myself it seems.

Justin C 2010-02-06 01:47:55

I prefer to leave any pasted ‘smart quotes’ or other typographical niceties (like –, —) as they are. It's generally a bad idea to try to automatically convert straight-quotes to smart quotes, because it's not a process that can be done reliably. I'd turn that feature off. Anyway, you don't get anything from ‘scrubbing’ those particular characters; if your app can't handle them, it probably can't handle any other non-ASCII characters either, which is something that would need fixing.

bobince 2010-02-06 12:44:05

ansaurus

tags:

views:

answers:

manually converting between ASCII and .NET characters

related questions