views:

827

answers:

3

Hi,

I have an xml file from which I am extracting html using LINQ to XML. This is a sample of the file:

<?xml version="1.0" encoding="utf-8" ?>
<tips>
    <tip id="0">
 This is the first tip.
</tip>
<tip id="1">
 Use <b>Windows Live Writer</b> or <b>Microsoft Word 2007</b> to create and publish content.
</tip>
<tip id="2">
 Enter a <b>url</b> into the box to automatically screenshot and index useful webpages.
</tip>
<tip id="3">
 Invite your <b>colleagues</b> to the site by entering their email addresses.  You can then share the content with them!
</tip>
</tips>

I am using the following query to extract a 'tip' from the file:

Tip tip = (from t in tipsXml.Descendants("tip")
       where t.Attribute("id").Value == nextTipId.ToString()
       select new Tip()
       {
         TipText= t.Value,
      TipId = nextTipId
       }).First();

The problem I have is that the Html elements are being stripped out. I was hoping for something like InnerHtml to use instead of Value, but that doesn't seem to be there.

Any ideas?

Thanks all in advance,

Dave

+4  A: 

Call t.ToString() instead of Value. That will return the XML as a string. You may want to use the overload taking SaveOptions to disable formatting. I can't check right now, but I suspect it will include the element tag (and elements) so you would need to strip this off.

Note that if your HTML isn't valid XML, you will end up with an invalid overall XML file.

Is the format of the XML file completely out of your control? It would be nicer for any HTML inside to be XML-encoded.

EDIT: One way of avoiding getting the outer part might be to do something like this (in a separate method called from your query, of course):

StringBuilder builder = new StringBuilder();
foreach (XNode node in element.Nodes)
{
    builder.Append(node.ToString());
}

That way you'll get HTML elements with their descendants and interspersed text nodes. Basically it's the equivalent of InnerXml, I strongly suspect.

Jon Skeet
heh, snap on the edit. Encoding HTML inside XML is common and convenient for this kind of case; the alternative would be to use valid XHTML, declaring the XHTML xmlns as default and putting the tip/tips elements in a different namespace to avoid confusing the two.
bobince
A: 

TipText= t.Value,

XElement.value returns only the text that is directly inside the element. Text in nested elements - HTML or otherwise - will not be included, and of course any &-entity-references will appear in their decoded form.

If you want the content as a string with markup you could call XElement.ToString(), possibly with SaveOptions.DisableFormatting. But note this includes the wrapping < tip> element - that is, in web browser DOM terms, it's the outerHTML not the innerHTML. To get the innerHTML you would have to join together all the ToString()s of the child XElement.Nodes.

bobince
A: 

Thanks both for the replies! :D

David Gouge