I'm writing a utility to export evernote notes into Outlook on a schedule. The Outlook API's need plain text, and Evernote outputs a XHTML doc version of the plain text note. What I need is to strip out all the Tags and unescape the source XHTML doc embedded in the Evernote export file.
Basically I need to turn;
<note>
<title>Test Sync Note 1</title>
<content>
<![CDATA[ <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml.dtd">
<en-note bgcolor="#FFFFFF">
<div>Test Sync Note 1</div>
<div>This i has some text in it</div>
<div> </div>
<div> </div>
<div>and a second line</div>
</en-note>
]]>
</content>
<created>20081028T045727Z</created>
<updated>20081028T051346Z</updated>
<tag>Test</tag>
</note>
Into
Test Sync Note 1 This i has some text in it and a second line
I can easily parse out the CDATA section and get just the 4 lines of text, but I need a reliable way to strip the div's, unescape and deal with any extra HTML that might have snuck in there.
I'm assuming that there's some MS API combo that will do the job, but I don't know it.