ansaurus

Question

Answer 1

+1 A:

The xml string must not (!) contain the BOM, the BOM is only allowed in byte data (e.g. streams) which is encoded with UTF-8. This is because the string representation is not encoded, but already a sequence of unicode characters.

It therefore seems that you load the string wrong, which is in code you unfortunatley didn't provide.

Edit:

Thanks for posting the serialization code.

You should not write the data to a MemoryStream, but rather to a StringWriter which you can then convert to a string with ToString. Since this avoids passing through a byte representation it is not only faster but also avoids such problems.

Something like this:

private static string SerializeResponse(Response response)
{
    var output = new StringWriter();
    var writer = XmlWriter.Create(output);
    new XmlSerializer(typeof(Response)).Serialize(writer, response);
    return output.ToString();
}

Lucero 2010-06-23 17:49:40

I've made exactly that change, and it works perfectly. Thanks!

arootbeer 2010-06-23 18:07:36

Answer 2

A:

The BOM shouldn't be in the string in the first place.
BOMs are used to detect the encoding of a raw byte array; they have no business being in an actual string.

What does the string come from?
You're probably reading it with the wrong encoding.

SLaks 2010-06-23 17:49:54

I made sure I was at least using the right encoding :)I've added the serialization code to my question.

arootbeer 2010-06-23 18:03:08

Answer 3

A:

Strings in C# are encoded as UTF-16, so the BOM would be wrong. As a general rule, always encode XML to byte arrays and decode it from byte arrays.

Stephen Cleary 2010-06-23 17:50:47

This is not exactly true. While the memory format is usually similar to UTF-16, strings are an "abstract" sequence of characters with a specific number of characters. Note that there have been discussions in the CLR team to change strings to have another in-memory representation in order to make them more efficient. Anyways, since it is an abstract view and not a byte sequence, a BOM must not exist in the string.

Lucero 2010-06-23 17:53:53

I've added the serialization code. I am already using UTF-8 explicitly.

arootbeer 2010-06-23 17:58:37

@Stephen, I think the thing with alternative in-memory string representations was in the following Channel 9 video: http://channel9.msdn.com/shows/Going+Deep/Vance-Morrison-CLR-Through-the-Years/

Lucero 2010-06-23 19:04:54

@Lucero: the [String class documentation](http://msdn.microsoft.com/en-us/library/system.string.aspx) clearly states that it uses UTF-16 encoding. You can get the sequence of Unicode characters via `StringInfo.GetTextElementEnumerator`; the `Char` values in a `string` may contain surrogate pairs.

Stephen Cleary 2010-06-23 19:07:04

@Stephen, the docs say: "A string is a sequential collection of Unicode characters that is used to represent text." and later "Each Unicode character in a string is defined by a Unicode scalar value, also called a Unicode code point or the ordinal (numeric) value of the Unicode character. Each code point is encoded using UTF-16 encoding, and the numeric value of each element of the encoding is represented by a Char object.", the point being that the string is not a serialized representation but a sequence of unicode characters made up of UTF16 code points. Its a character sequence abstraction.

Lucero 2010-06-23 19:19:08

(cont.) The BOM is used to detect the binary (byte) serialization of a unicode string. Since this is an abstractions of charachters using code points, you never come across a byte representation, which also means that a BOM is neither used nor supported for the internal string representation. Note that BOM's are mostly used to detect UTF16 little endian and big endianness in byte sequences, and the usage in UTF8 is less prominent outside the Microsoft world and only servers to "tag" a byte sequence as being UTF8 opposed to ASCII or ANSI.

Lucero 2010-06-23 19:22:19

@Lucero: As you quoted, the string class does use UTF-16 encoding. If it was intended to be an abstraction of characters, it is a very, very leaky abstraction, since iterating over the string yields UTF-16 bytes.

Stephen Cleary 2010-06-23 20:18:25

@Stephen, that's the part which you got wrong: it does not yield bytes, but (16 bit) characters which are endian-invariant. This is a very important difference.

Lucero 2010-06-23 21:00:32

@Lucero: good catch with the endianness! But I still interpret the docs as declaring UTF-16 encoding (just with unspecified endianness).

Stephen Cleary 2010-06-24 13:22:40

@Stephen, the endianness is only meaningful when loading integer entities larger than a byte for instance into a processor register. Basically it defines whether the most or the least meaningful byte comes first for anything larger than a byte. So since we're dealing with 16-bit entities already, the endianness is meaningless here and by consequence a BOM has no function here. See also http://unicode.org/faq/utf_bom.html#BOM - "What should I do with U+FEFF in the middle of a file?" (note that the discussion in the FAQ is about *data streams*, not code point sequences as we have it in memory).

Lucero 2010-06-24 13:45:51

@Lucero: I agree that the BOM should not be in the string. However, endianness is not meaningless with UTF-16; there are LE and BE UTF-16 encodings, and when written to a byte stream these *require* a BOM.

Stephen Cleary 2010-06-24 13:49:09

@Stephen, sorry, but you're completely wrong here. LE and BE are predefined in their endianness when written to a byte stream, and therefore don't use the BOM. As soon as you deal with 16-bit codes which have already been loaded from a byte representation, the endianness is meaningless. See the beforementionned FAQ, "Is Unicode a 16 bit encoding" and "What is a UTF?" and "What are some of the differences between the UTFs?".

Lucero 2010-06-24 14:05:04

@Lucero: I refer you to the [XML spec](http://www.w3.org/TR/REC-xml/#charencoding), which clearly states that an XML document in a UTF-16 encoding *requires* a BOM.

Stephen Cleary 2010-06-24 14:20:58

@Stephen: Yes, a XML document (which is read from a byte stream) requires a BOM when UTF-16 is the encoding. But don't confuse the regular UTF-16 with UTF-16BE or UTF-16LE - those *must not* have a BOM (and are seldom used for XML files)! See also http://www.ietf.org/rfc/rfc3023.txt page 14.

Lucero 2010-06-24 15:11:34

Answer 4

+2 A:

In my request handler I'm serializing a response object and sending it back as a string. The serialization process adds a UTF-8 BOM to the front of the string, which causes the same code to break when parsing the response.

So you want to prevent the BOM from being added as part of your serialization process. Unfortunately, you don't provide what your serialization logic is.

What you should do is provide a UTF8Encoding instance created via the UTF8Encoding(bool) constructor to disable generation of the BOM, and pass this Encoding instance to whichever methods you're using which are generating your intermediate string.

jonp 2010-06-23 17:57:07

Thanks! I'd come across that bit of wisdom during my research, but I couldn't find any explicit directions on including or excluding the BOM.

arootbeer 2010-06-23 18:00:51

ansaurus

tags:

views:

answers:

XmlReader breaks on UTF-8 BOM

related questions