views:

2973

answers:

5

I'm writing a utility to export evernote notes into Outlook on a schedule. The Outlook API's need plain text, and Evernote outputs a XHTML doc version of the plain text note. What I need is to strip out all the Tags and unescape the source XHTML doc embedded in the Evernote export file.

Basically I need to turn;

<note>
  <title>Test Sync Note 1</title> 
  <content>
  <![CDATA[ <?xml version="1.0" encoding="UTF-8"?>
   <!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml.dtd"&gt;

<en-note bgcolor="#FFFFFF">
<div>Test Sync Note 1</div>
<div>This i has some text in it</div>
<div>&nbsp;</div>
<div>&nbsp;</div>
<div>and a second line</div>
</en-note>

  ]]> 
  </content>
  <created>20081028T045727Z</created> 
  <updated>20081028T051346Z</updated> 
  <tag>Test</tag> 
</note>

Into


    Test Sync Note 1
    This i has some text in it


    and a second line

I can easily parse out the CDATA section and get just the 4 lines of text, but I need a reliable way to strip the div's, unescape and deal with any extra HTML that might have snuck in there.

I'm assuming that there's some MS API combo that will do the job, but I don't know it.

A: 

As far as I know there isn't anything to do that specific job but you might want to look at using XSLT or walking through an IXPathNavigable.

Andrew Kennan
A: 
        string xml = @"<note>
          <title>Test Sync Note 1</title> 
          <content>
          <![CDATA[ <?xml version=""1.0"" encoding=""UTF-8""?>
           <!DOCTYPE en-note SYSTEM ""http://xml.evernote.com/pub/enml.dtd""&gt;

        <en-note bgcolor=""#FFFFFF"">
        <div>Test Sync Note 1</div>
        <div>This i has some text in it</div>
        <div> </div>
        <div> </div>
        <div>and a second line</div>
        </en-note>

          ]]> 
          </content>
          <created>20081028T045727Z</created> 
          <updated>20081028T051346Z</updated> 
          <tag>Test</tag> 
        </note>
        ";
        XPathDocument doc = new XPathDocument(new StringReader(xml));
        XPathNavigator nav = doc.CreateNavigator();

        // Compile a standard XPath expression

        XPathExpression expr;
        expr = nav.Compile("/note/content");
        XPathNodeIterator iterator = nav.Select(expr);

        // Iterate on the node set

        try
        {
            while (iterator.MoveNext())
            {
                //Get the XML in the CDATA
                XPathNavigator nav2 = iterator.Current.Clone();
                XPathDocument doc2 = new XPathDocument(new StringReader(nav2.Value.Trim()));

                //Parse the XML in the CDATA
                XPathNavigator nav3 = doc2.CreateNavigator();
                expr = nav3.Compile("/en-note");
                XPathNodeIterator iterator2 = nav3.Select(expr);
                iterator2.MoveNext();
                XPathNavigator nav4 = iterator2.Current.Clone();

                //Output the value directly, does not preserve the formatting
                Console.WriteLine("Direct Try:");
                Console.WriteLine(nav4.Value);

                //This works, but is ugly
                Console.WriteLine("Ugly Try:");
                Console.WriteLine(nav4.InnerXml.Replace("<div>","").Replace("</div>",Environment.NewLine));
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex.Message);
        }
Vinko Vrsalovic
Yes. Since asking I've worked out that I can use HttpUtility.HtmlDecode to unescape the CDATA section and I'll probably just walk all the <div> nodes and used the InnerText.
A: 

I would use a regular expression to strip out all the HTML tags, this one is pretty basic, I am sure if you may be able to tweak it if it doesn't work as you exactly want.

Regex.Replace("<div>your html in here</div>",@"<(.|\n)*?>",string.Empty);

Xian
+1  A: 

You can also use an xslt transformation to convert the xml into a text document.

Rune Grimstad
+1  A: 

You can use HTML Agility Pack.

Sunny