views:

730

answers:

8

I have an XML reader on this XML string:

<?xml version="1.0" encoding="UTF-8" ?>
<story id="1224488641nL21535800" date="20 Oct 2008" time="07:44">
<title>PRESS DIGEST - PORTUGAL - Oct 20</title>
<text>
<p>    LISBON, Oct 20 (Reuters) - Following are some of the main
 stories in Portuguese newspapers on Monday. Reuters has not
verified these stories and does not vouch for their accuracy. </p>
<p>More HTML stuff here</p>
</text>
</story>

I created an XSD and a corresponding class for deserialization.

[System.Xml.Serialization.XmlRootAttribute(Namespace="", IsNullable=false)]
public class story {
    [System.Xml.Serialization.XmlAttributeAttribute()]
    public string id;
    [System.Xml.Serialization.XmlAttributeAttribute()]
    public string date;
    [System.Xml.Serialization.XmlAttributeAttribute()]
    public string time;
    public string title;
    public string text;
}

I then create an instance of the class using the Deserialize method of XmlSerializer.

XmlSerializer ser = new XmlSerializer(typeof(story));
return (story)ser.Deserialize(xr);

Now, the text member of story is always null. How do I change my story class so that the XML is parsed as expected?

EDIT:

Using an XmlText does not work and I have no control over the XML I'm parsing.

A: 

Looks to me that the XML is incorrect. Since you use HTML tags within the text tag the HTML tags are interpreted as XML. You should use CDATA to correctly interpret the data or escape < and >.

Sani Huttunen
I don't have any controls over the way the XML is put together.
Sklivvz
+1  A: 

I found a very unsatisfactory solution.

Change the class like this (ugh!)

// ...
[XmlElement("HACK - this should never match anything")]
public string text;
// ...

And change the calling code like this (yuck!)

XmlSerializer ser = new XmlSerializer(typeof(story));
string text = string.Empty;
ser.UnknownElement += delegate(object sender, XmlElementEventArgs e) {
    if (e.Element.Name != "text")
     throw new XmlException(
              string.Format(CultureInfo.InvariantCulture, 
          "Unknown element '{0}' cannot be deserialized.",
          e.Element.Name));
    text += e.Element.InnerXml;
};

story result = (story)ser.Deserialize(xr);
result.text = text;
return result;

This is a really bad way of doing it because it breaks encapsulation. Is there a better way of doing it?

Sklivvz
A: 

Since you do not have control over the XML you could use StreamReader instead. XmlReader interprets the HTML tags as XML which is not what you want.

XmlSerializer will however strip the HTML tags within the text tag.

Sani Huttunen
+1  A: 

The suggestion that I was going to make if the text tag only ever contained p tags was the following, it may be useful in the short term.

Instead of story having the text field as a string, you could have it as an array of strings. You could then use the right XmlArray attributes (can't remember the exact names, something like XmlArrayItemAttribute), with the right parameters to make it look like:

<text>
   <p>blah</p>
   <p>blib</p>
</text>

Which is a step closer, but not completely what you need.

Another option is to make a class like:

public class Text //Obviously a bad name for a class...
{
   public string[] p;
   public string[] pre;
}

And again use the XmlArray attributes to get it to look right, not sure if they are as configurable as that because I've only used them for simple types before.

Edit:

Using:

[System.Xml.Serialization.XmlRootAttribute(Namespace = "", IsNullable = false)]
    public class story
    {
     [System.Xml.Serialization.XmlAttributeAttribute()]
     public string id;
     [System.Xml.Serialization.XmlAttributeAttribute()]
     public string date;
     [System.Xml.Serialization.XmlAttributeAttribute()]
     public string time;
     public string title;

     [XmlArrayItem("p")]
     public string[] text;

    }

Works well with the supplied XML, but having the class seems a little more complicated. It ends up as something similar to:

    <text>
       <p>
          <p>qwertyuiop</p>
          <p>asdfghjkl</p>
       </p>
       <pre>
          <pre>stuff</pre>
          <pre>nonsense</pre>
       </pre>
   </text>

which is obviously not what is desired.

Carl
A: 

Perhaps using the XmlAnyElement attribute instead of handling the UnknownElement event may be more elegant.

Santiago Palladino
+1  A: 

You could implement IXmlSerializable for your class and handle the inner elements there, this means that you keep the code for deserializing your data inside the target class (thus avoiding your problem with encapsulation). It's a simple enough data type that the code should be trivial to write.

Simon Steele
A: 

Have you tried xsd.exe? It allows you to create xsd's from xml doc's and then generate classes from the xsd that should be ripe for xml deserialization.

Peter Walke
A: 

Please also take a look at a similar question I asked... it might help answer your question

Peter Walke