views:

212

answers:

2

Hi, I am trying to fetch data from rss feed (feed location is http://www.bgsvetionik.com/rss/ ) in c# win form. Take a look at the following code:

public static XmlDocument FromUri(string uri) 
     {

        XmlDocument xmlDoc;
        WebClient webClient = new WebClient();

        using (Stream rssStream = webClient.OpenRead(uri))
        {
            XmlTextReader reader = new XmlTextReader(rssStream);
            xmlDoc = new XmlDocument();
            xmlDoc.XmlResolver = null;
            xmlDoc.Load(reader);
        }
        return xmlDoc;
   }

Although xmlDoc.InnerXml contains XML definition with UTF-8 encoding, I get š instead of š etc.

How can I solve it?

+2  A: 

The feed's data is incorrect. The š is inside a CDATA section, so it isn't being treated as an entity by the XML parser.

If you look at the source XML, you'll find that there's a mixture of entities and "raw" characters, e.g. čišćenja in the middle of the first title.

If you need to correct that, you'll have to do it yourself with a Replace call - the XML parser is doing exactly what it's meant to.

EDIT: For the replacement, you could get hold of all the HTML entities and replace them one by one, or just find out which ones are actually being used. Then do:

string text = element.Value.Replace("š", "š")
                           .Replace(...);

Of course, this means that anything which is actually correctly escaped and should really be that text will get accidentally replaced... but such is the problem with broken data :(

Jon Skeet
@Jon Skeet Great answer; you just beat me to it ;-). Pardon my doing a bit of SO.meta, here (we can remove these comments in a few minutes), but I'm wondering why you reply in community wiki. I'm new to SO and would like to know the difference / accepted practice in this area.
mjv
If it wasn't in a CDATA section would it not simply error anyway since XML has no idea what that entity refers to? As far as I was aware XML only understand a very limited subset of entities that work in HTML. Its not uncommon for RSS feeds to abuse the description element by including html content in the description.
AnthonyWJones
+1, it's what Hanselman calls "angle-bracket-delimited" data and not XML at all. BTW any reason why this is community wiki?
MarkJ
@AnthonyWJones: I haven't checked whether the entity is being declared or not - but yet, I agree it's probably just a badly written feed. @mvj/MarkJ: I'm having a "rep holiday" until Monday, making all my posts CW. Pay no attention to that :)
Jon Skeet
There are no entities declared anywhere in feed document. Furthermore, it is declared to be in windows-1250 encoding (in the xml declaration)
VoidPointer
@VoidPointer: Not sure what feed you were looking at but there were definitely entities in the feed I looked at.
AnthonyWJones
A: 

Thanks Jon! So, the only way to solve it is to make some Replacer() method which would replace all data from CDATA section?

Nikolan
@Nikolan: Since this is question directed to a specific answer you should use the "add comment" feature under that question. SO will let the originator know that there are outstanding comments on their answer, that way you more likely to get a response to your supplementary question.
AnthonyWJones