ansaurus

Question

Answer 1

+4 A:

If the variable xml is of type string, you did something wrong already - in a character string, the BOM should not be represented as three separate characters, but as a single code point. Instead of using DownloadString, use DownloadData, and parse byte arrays instead. The XML parser should recognize the BOM itself, and skip it (except for auto-detecting the document encoding as UTF-8).

Martin v. Löwis 2009-08-23 04:48:34

XDocument.Parse does not have an overload that accepts a byte array.I find the statement "you did something wrong" condescending. I would have expected DownloadString to detect the BOM and select the correct encoding.

TrueWill 2009-08-23 16:42:39

I think you can get the XDocument also through .Load, passing an XmlReader, which you can get by passing a Stream, for which you can use a MemoryStream. I didn't mean to be condescending; I only tried to point out that the intermediate result that you got is seemingly incorrect, so that the real problem is not that you have to strip those characters, but that they are present in the first place. Perhaps it is the case that there is a flaw in DownloadString, in which case you shouldn't be using it. Perhaps the flaw is in the web server reporting the wrong charset.

Martin v. Löwis 2009-08-23 21:38:57

OK, thanks. I did find I didn't have the client Encoding set correctly for DownloadString, which gave me a single code point (as you mentioned). It's somewhat moot at this point, as the company providing the "REST" service decided to remove the redundant (for XML in utf-8) BOM.

TrueWill 2009-08-24 18:14:29

@Martin good call. Using XDocument.Load worked out quite well for me. It's not necessary to use the XmlReader, though, as XDocument.Load takes a stream for an argument.

Steven Oxley 2010-10-27 22:10:20

Answer 2

+2 A:

Pass the byte buffer (via DownloadData) to string Encoding.UTF8.GetString(byte[]) to get the string rather than download the buffer AS a string. You probably have more problems with your current method than just trimming the byte order mark. Unless you're properly decoding it as I suggest here, unicode characters will probably be misinterpreted, resulting in a corrupted string.

Edit: Martin's answer is better, since it avoids allocating an entire string for XML that still needs to be parsed anyway. The answer I gave best applies to general strings that don't need to be parsed as XML.

Andrew Arnott 2009-08-23 04:49:04

Thank you for your response; unfortunately this did not work. I used DownloadData and that worked; however, Encoding.UTF8.GetString(byte[]) did not strip the BOM. I tried variants with new UTF8Encoding(false) and (true) without success.Please note that this is UTF-8 data - encoding="utf-8" is specified in the XML header, and it parses correctly once the BOM is removed.

TrueWill 2009-08-23 16:47:23

Answer 3

+1 A:

I had some incorrect test data, which caused me some confusion. Based on How to avoid tripping over UTF-8 BOM when reading files I found that this worked:

private readonly string _byteOrderMarkUtf8 =
    Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());

public string GetXmlResponse(Uri resource)
{
    string xml;

    using (var client = new WebClient())
    {
        client.Encoding = Encoding.UTF8;
        xml = client.DownloadString(resource);
    }

    if (xml.StartsWith(_byteOrderMarkUtf8))
    {
        xml = xml.Remove(0, _byteOrderMarkUtf8.Length);
    }

    return xml;
}

Setting the client Encoding property correctly reduces the BOM to a single character. However, XDocument.Parse still will not read that string. This is the cleanest version I've come up with to date.

TrueWill 2009-08-23 18:38:04

Answer 4

+1 A:

This works as well

int index = xmlResponse.IndexOf('<');
if (index > 0)
{
    xmlResponse = xmlResponse.Substring(index, xmlResponse.Length - index);
}

Vivek Ayer 2010-07-19 16:22:54

+1 for ingenuity!

TrueWill 2010-07-19 17:23:36

Answer 5

+1 A:

I had a very similar problem (I needed to parse an XML document represented as a byte array that had a byte order mark at the beginning of it). I used one of Martin's comments on his answer to come to a solution. I took the byte array I had (instead of converting it to a string) and created a MemoryStream object with it. Then I passed it to XDocument.Load, which worked like a charm. For example, let's say that xmlBytes contains your XML in UTF8 encoding with a byte mark at the beginning of it. Then, this would be the code to solve the problem:

var stream = new MemoryStream(xmlBytes);
var document = XDocument.Load(stream);

It's that simple.

If starting out with a string, it should still be easy to do (assume xml is your string containing the XML with the byte order mark):

var bytes = Encoding.UTF8.GetBytes(xml);
var stream = new MemoryStream(bytes);
var document = XDocument.Load(stream);

Steven Oxley 2010-10-27 22:15:15

ansaurus

tags:

views:

answers:

Strip Byte Order Mark from string in C#

related questions