tags:

views:

306

answers:

2

I know this is probably simple and has probably been asked before, but I'm having trouble coming up with a solution.

I am parsing some RSS feeds which include HTML as CDATA blocks. One example is here: http://g.msn.com/1ewenus50/news2

The feed changes a lot, but there are almost always some extended characters in it. For example if I make a simple console app and use WebClient.DownloadString and look at the result, I see things like

"learned of the alleged attempted Flight 253 bomber’s extremist links while he was mid-flight on Christmas Day. NBC’s Savannah Guthrie reports. (Today Show)"

However those weird characters should be apostrophes, quote marks, em dashes, etc.

What is the trick for getting these to decode correctly?

If it wasn't clear, I'm using C# / .NET for this. In the end this content will be rendered in Silverlight, but I'm seeing the issue in the full .NET 3.5 runtime as well.

A: 

Download it in binary form and parse it as XML. That should get it right - the XML document should be self-describing in terms of the encoding, but I wouldn't put it past some webservers to advertise it (in headers) as having a different encoding, which would confuse DownloadString.

In general, when XML is involved it's worth doing as much as possible within an XML API rather than with the raw data.

Jon Skeet
There you go. Thanks. This works:byte[] bar = w.DownloadData(new Uri("http://g.msn.com/1ewenus50/news2"));string baz = new UTF8Encoding().GetString(bar);var x = XDocument.Parse(baz);
Josh Santangelo
A: 

You are probably using the wrong text encoding... I'm not sure which one you are using or which is the right one, but this might put you on the path.

Michael Bray