views:

2212

answers:

5

Hi, I've read similar posts on this and they don't answer my question.

In C#, I have a string that I'm obtaining from WebClient.DownloadString. I've tried setting client.Encoding to new UTF8Encoding(false), but that's made no difference - I still end up with a byte order mark for UTF-8 at the beginning of the result string. I need to remove this (to parse the resulting XML with LINQ), and want to do so in memory.

So I have a string that starts with \x00EF\x00BB\x00BF, and I want to remove that if it exists. Right now I'm using

if (xml.StartsWith(ByteOrderMarkUtf8))
{
    xml = xml.Remove(0, ByteOrderMarkUtf8.Length);
}

but that just feels wrong. I've tried all sorts of code with streams, GetBytes, and encodings, and nothing works. Can anyone provide the "right" algorithm to strip a BOM from a string?

Thank you!

+4  A: 

If the variable xml is of type string, you did something wrong already - in a character string, the BOM should not be represented as three separate characters, but as a single code point. Instead of using DownloadString, use DownloadData, and parse byte arrays instead. The XML parser should recognize the BOM itself, and skip it (except for auto-detecting the document encoding as UTF-8).

Martin v. Löwis
XDocument.Parse does not have an overload that accepts a byte array.I find the statement "you did something wrong" condescending. I would have expected DownloadString to detect the BOM and select the correct encoding.
TrueWill
I think you can get the XDocument also through .Load, passing an XmlReader, which you can get by passing a Stream, for which you can use a MemoryStream. I didn't mean to be condescending; I only tried to point out that the intermediate result that you got is seemingly incorrect, so that the real problem is not that you have to strip those characters, but that they are present in the first place. Perhaps it is the case that there is a flaw in DownloadString, in which case you shouldn't be using it. Perhaps the flaw is in the web server reporting the wrong charset.
Martin v. Löwis
OK, thanks. I did find I didn't have the client Encoding set correctly for DownloadString, which gave me a single code point (as you mentioned). It's somewhat moot at this point, as the company providing the "REST" service decided to remove the redundant (for XML in utf-8) BOM.
TrueWill
@Martin good call. Using XDocument.Load worked out quite well for me. It's not necessary to use the XmlReader, though, as XDocument.Load takes a stream for an argument.
Steven Oxley
+2  A: 

Pass the byte buffer (via DownloadData) to string Encoding.UTF8.GetString(byte[]) to get the string rather than download the buffer AS a string. You probably have more problems with your current method than just trimming the byte order mark. Unless you're properly decoding it as I suggest here, unicode characters will probably be misinterpreted, resulting in a corrupted string.

Edit: Martin's answer is better, since it avoids allocating an entire string for XML that still needs to be parsed anyway. The answer I gave best applies to general strings that don't need to be parsed as XML.

Andrew Arnott
Thank you for your response; unfortunately this did not work. I used DownloadData and that worked; however, Encoding.UTF8.GetString(byte[]) did not strip the BOM. I tried variants with new UTF8Encoding(false) and (true) without success.Please note that this is UTF-8 data - encoding="utf-8" is specified in the XML header, and it parses correctly once the BOM is removed.
TrueWill
+1  A: 

I had some incorrect test data, which caused me some confusion. Based on How to avoid tripping over UTF-8 BOM when reading files I found that this worked:

private readonly string _byteOrderMarkUtf8 =
    Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());

public string GetXmlResponse(Uri resource)
{
    string xml;

    using (var client = new WebClient())
    {
        client.Encoding = Encoding.UTF8;
        xml = client.DownloadString(resource);
    }

    if (xml.StartsWith(_byteOrderMarkUtf8))
    {
        xml = xml.Remove(0, _byteOrderMarkUtf8.Length);
    }

    return xml;
}

Setting the client Encoding property correctly reduces the BOM to a single character. However, XDocument.Parse still will not read that string. This is the cleanest version I've come up with to date.

TrueWill
+1  A: 

This works as well

int index = xmlResponse.IndexOf('<');
if (index > 0)
{
    xmlResponse = xmlResponse.Substring(index, xmlResponse.Length - index);
}
Vivek Ayer
+1 for ingenuity!
TrueWill
+1  A: 

I had a very similar problem (I needed to parse an XML document represented as a byte array that had a byte order mark at the beginning of it). I used one of Martin's comments on his answer to come to a solution. I took the byte array I had (instead of converting it to a string) and created a MemoryStream object with it. Then I passed it to XDocument.Load, which worked like a charm. For example, let's say that xmlBytes contains your XML in UTF8 encoding with a byte mark at the beginning of it. Then, this would be the code to solve the problem:

var stream = new MemoryStream(xmlBytes);
var document = XDocument.Load(stream);

It's that simple.

If starting out with a string, it should still be easy to do (assume xml is your string containing the XML with the byte order mark):

var bytes = Encoding.UTF8.GetBytes(xml);
var stream = new MemoryStream(bytes);
var document = XDocument.Load(stream);
Steven Oxley