views:

125

answers:

4

I have the following XML Parsing code in my application:

    public static XElement Parse(string xml, string xsdFilename)
    {
        var readerSettings = new XmlReaderSettings
        {
            ValidationType = ValidationType.Schema,
            Schemas = new XmlSchemaSet()
        };
        readerSettings.Schemas.Add(null, xsdFilename);
        readerSettings.ValidationFlags |= XmlSchemaValidationFlags.ProcessInlineSchema;
        readerSettings.ValidationFlags |= XmlSchemaValidationFlags.ProcessSchemaLocation;
        readerSettings.ValidationFlags |= XmlSchemaValidationFlags.ReportValidationWarnings;
        readerSettings.ValidationEventHandler +=
            (o, e) => { throw new Exception("The provided XML does not validate against the request's schema."); };

        var readerContext = new XmlParserContext(null, null, null, XmlSpace.Default, Encoding.UTF8);

        return XElement.Load(XmlReader.Create(new StringReader(xml), readerSettings, readerContext));
    }

I am using it to parse strings sent to my WCF service into XML documents, for custom deserialization.

It works fine when I read in files and send them over the wire (the request); I've verified that the BOM is not sent across. In my request handler I'm serializing a response object and sending it back as a string. The serialization process adds a UTF-8 BOM to the front of the string, which causes the same code to break when parsing the response.

System.Xml.XmlException : Data at the root level is invalid. Line 1, position 1.

In the research I've done over the last hour or so, it appears that XmlReader should honor the BOM. If I manually remove the BOM from the front of the string, the response xml parses fine.

Am I missing something obvious, or at least something insidious?

EDIT: Here is the serialization code I'm using to return the response:

private static string SerializeResponse(Response response)
{
    var output = new MemoryStream();
    var writer = XmlWriter.Create(output);
    new XmlSerializer(typeof(Response)).Serialize(writer, response);
    var bytes = output.ToArray();
    var responseXml = Encoding.UTF8.GetString(bytes);
    return responseXml;
}

If it's just a matter of the xml incorrectly containing the BOM, then I'll switch to

var responseXml = new UTF8Encoding(false).GetString(bytes);

but it was not clear at all from my research that the BOM was illegal in the actual XML string; see e.g. http://stackoverflow.com/questions/581318/c-detect-xml-encoding-from-byte-array

+1  A: 

The xml string must not (!) contain the BOM, the BOM is only allowed in byte data (e.g. streams) which is encoded with UTF-8. This is because the string representation is not encoded, but already a sequence of unicode characters.

It therefore seems that you load the string wrong, which is in code you unfortunatley didn't provide.

Edit:

Thanks for posting the serialization code.

You should not write the data to a MemoryStream, but rather to a StringWriter which you can then convert to a string with ToString. Since this avoids passing through a byte representation it is not only faster but also avoids such problems.

Something like this:

private static string SerializeResponse(Response response)
{
    var output = new StringWriter();
    var writer = XmlWriter.Create(output);
    new XmlSerializer(typeof(Response)).Serialize(writer, response);
    return output.ToString();
}
Lucero
I've made exactly that change, and it works perfectly. Thanks!
arootbeer
A: 

The BOM shouldn't be in the string in the first place.
BOMs are used to detect the encoding of a raw byte array; they have no business being in an actual string.

What does the string come from?
You're probably reading it with the wrong encoding.

SLaks
I made sure I was at least using the right encoding :)I've added the serialization code to my question.
arootbeer
A: 

Strings in C# are encoded as UTF-16, so the BOM would be wrong. As a general rule, always encode XML to byte arrays and decode it from byte arrays.

Stephen Cleary
This is not exactly true. While the memory format is usually similar to UTF-16, strings are an "abstract" sequence of characters with a specific number of characters. Note that there have been discussions in the CLR team to change strings to have another in-memory representation in order to make them more efficient. Anyways, since it is an abstract view and not a byte sequence, a BOM must not exist in the string.
Lucero
I've added the serialization code. I am already using UTF-8 explicitly.
arootbeer
@Stephen, I think the thing with alternative in-memory string representations was in the following Channel 9 video: http://channel9.msdn.com/shows/Going+Deep/Vance-Morrison-CLR-Through-the-Years/
Lucero
@Lucero: the [String class documentation](http://msdn.microsoft.com/en-us/library/system.string.aspx) clearly states that it uses UTF-16 encoding. You can get the sequence of Unicode characters via `StringInfo.GetTextElementEnumerator`; the `Char` values in a `string` may contain surrogate pairs.
Stephen Cleary
@Stephen, the docs say: "A string is a sequential collection of Unicode characters that is used to represent text." and later "Each Unicode character in a string is defined by a Unicode scalar value, also called a Unicode code point or the ordinal (numeric) value of the Unicode character. Each code point is encoded using UTF-16 encoding, and the numeric value of each element of the encoding is represented by a Char object.", the point being that the string is not a serialized representation but a sequence of unicode characters made up of UTF16 code points. Its a character sequence abstraction.
Lucero
(cont.) The BOM is used to detect the binary (byte) serialization of a unicode string. Since this is an abstractions of charachters using code points, you never come across a byte representation, which also means that a BOM is neither used nor supported for the internal string representation. Note that BOM's are mostly used to detect UTF16 little endian and big endianness in byte sequences, and the usage in UTF8 is less prominent outside the Microsoft world and only servers to "tag" a byte sequence as being UTF8 opposed to ASCII or ANSI.
Lucero
@Lucero: As you quoted, the string class does use UTF-16 encoding. If it was intended to be an abstraction of characters, it is a very, very leaky abstraction, since iterating over the string yields UTF-16 bytes.
Stephen Cleary
@Stephen, that's the part which you got wrong: it does not yield bytes, but (16 bit) characters which are endian-invariant. This is a very important difference.
Lucero
@Lucero: good catch with the endianness! But I still interpret the docs as declaring UTF-16 encoding (just with unspecified endianness).
Stephen Cleary
@Stephen, the endianness is only meaningful when loading integer entities larger than a byte for instance into a processor register. Basically it defines whether the most or the least meaningful byte comes first for anything larger than a byte. So since we're dealing with 16-bit entities already, the endianness is meaningless here and by consequence a BOM has no function here. See also http://unicode.org/faq/utf_bom.html#BOM - "What should I do with U+FEFF in the middle of a file?" (note that the discussion in the FAQ is about *data streams*, not code point sequences as we have it in memory).
Lucero
@Lucero: I agree that the BOM should not be in the string. However, endianness is not meaningless with UTF-16; there are LE and BE UTF-16 encodings, and when written to a byte stream these *require* a BOM.
Stephen Cleary
@Stephen, sorry, but you're completely wrong here. LE and BE are predefined in their endianness when written to a byte stream, and therefore don't use the BOM. As soon as you deal with 16-bit codes which have already been loaded from a byte representation, the endianness is meaningless. See the beforementionned FAQ, "Is Unicode a 16 bit encoding" and "What is a UTF?" and "What are some of the differences between the UTFs?".
Lucero
@Lucero: I refer you to the [XML spec](http://www.w3.org/TR/REC-xml/#charencoding), which clearly states that an XML document in a UTF-16 encoding *requires* a BOM.
Stephen Cleary
@Stephen: Yes, a XML document (which is read from a byte stream) requires a BOM when UTF-16 is the encoding. But don't confuse the regular UTF-16 with UTF-16BE or UTF-16LE - those *must not* have a BOM (and are seldom used for XML files)! See also http://www.ietf.org/rfc/rfc3023.txt page 14.
Lucero
+2  A: 

In my request handler I'm serializing a response object and sending it back as a string. The serialization process adds a UTF-8 BOM to the front of the string, which causes the same code to break when parsing the response.

So you want to prevent the BOM from being added as part of your serialization process. Unfortunately, you don't provide what your serialization logic is.

What you should do is provide a UTF8Encoding instance created via the UTF8Encoding(bool) constructor to disable generation of the BOM, and pass this Encoding instance to whichever methods you're using which are generating your intermediate string.

jonp
Thanks! I'd come across that bit of wisdom during my research, but I couldn't find any explicit directions on including or excluding the BOM.
arootbeer