I've just stumbled over another question in which someone suggested to use new ASCIIEncoding().GetBytes(someString)
to convert from a string to bytes. For me it was obvious that it shouldn't work for non-ASCII characters. But as it turns out, ASCIIEncoding happily replaces invalid characters with '?'. I'm very confused about this because this kind of breaks the rule of least surprise. In Python, it would be u"some unicode string".encode("ascii")
and the conversion is strict by default so that non-ASCII characters would lead to an exception in this example.
Two questions:
- How can strings be strictly converted to another encoding (like ASCII or Windows-1252), so that an exception is thrown if invalid characters occur? By the way I don't want a foreach loop converting each Unicode number to a byte, and then checking the 8th bit. This is supposed to be done by a great framework like .NET (or Python ^^).
- Any ideas on the rationale behind this default behavior? For me, it makes more sense to do strict conversions by default or at least define a parameter for this purpose (Python allows "replace", "ignore", "strict").