views:

49

answers:

1

I've just stumbled over another question in which someone suggested to use new ASCIIEncoding().GetBytes(someString) to convert from a string to bytes. For me it was obvious that it shouldn't work for non-ASCII characters. But as it turns out, ASCIIEncoding happily replaces invalid characters with '?'. I'm very confused about this because this kind of breaks the rule of least surprise. In Python, it would be u"some unicode string".encode("ascii") and the conversion is strict by default so that non-ASCII characters would lead to an exception in this example.

Two questions:

  1. How can strings be strictly converted to another encoding (like ASCII or Windows-1252), so that an exception is thrown if invalid characters occur? By the way I don't want a foreach loop converting each Unicode number to a byte, and then checking the 8th bit. This is supposed to be done by a great framework like .NET (or Python ^^).
  2. Any ideas on the rationale behind this default behavior? For me, it makes more sense to do strict conversions by default or at least define a parameter for this purpose (Python allows "replace", "ignore", "strict").
+7  A: 

.Net offers the option of throwing an exception if the encoding conversion fails. You'll need to use the EncoderExceptionFallback class (throws a EncoderFallbackException if an input character cannot be converted to an encoded output byte sequence) to create an encoding. The following code is from the documentation for that class:

Encoding ae = Encoding.GetEncoding(
              "us-ascii",
              new EncoderExceptionFallback(), 
              new DecoderExceptionFallback());

then use that encoding to perform the conversion:

// The input string consists of the Unicode characters LEFT POINTING 
// DOUBLE ANGLE QUOTATION MARK (U+00AB), 'X' (U+0058), and RIGHT POINTING 
// DOUBLE ANGLE QUOTATION MARK (U+00BB). 
// The encoding can only encode characters in the US-ASCII range of U+0000 
// through U+007F. Consequently, the characters bracketing the 'X' character
// cause an exception.

string inputString = "\u00abX\u00bb";
byte[] encodedBytes = new byte[ae.GetMaxByteCount(inputString.Length)];
int numberOfEncodedBytes = 0;
try
{
    numberOfEncodedBytes = ae.GetBytes(inputString, 0, inputString.Length, 
                                       encodedBytes, 0);
}
catch (EncoderFallbackException e)
{
    Console.WriteLine("bad conversion");
}

This MSDN page, "Character Encoding in the .NET Framework" discusses, to some degree, the rationale behind the default conversion behavior. In summary, they didn't want to disturb legacy applications that depend on this behavior. They do recommend overriding the default, though.

Michael Petrotta
Great explanation. I had seen the sentence "You might want to consider having your application set EncoderFallback or DecoderFallback to EncoderExceptionFallback or DecoderExceptionFallback to prevent sequences with the 8th bit set." in the documentation but it wasn't obvious to me that it could be used for strict conversions.
AndiDog