ansaurus

Question

C# code to convert XHTML doc to plain text

Answer 1

A:

As far as I know there isn't anything to do that specific job but you might want to look at using XSLT or walking through an IXPathNavigable.

Andrew Kennan 2008-10-28 05:55:02

Answer 2

A:

        string xml = @"<note>
          <title>Test Sync Note 1</title> 
          <content>
          <![CDATA[ <?xml version=""1.0"" encoding=""UTF-8""?>
           <!DOCTYPE en-note SYSTEM ""http://xml.evernote.com/pub/enml.dtd""&gt;

        <en-note bgcolor=""#FFFFFF"">
        <div>Test Sync Note 1</div>
        <div>This i has some text in it</div>
        <div> </div>
        <div> </div>
        <div>and a second line</div>
        </en-note>

          ]]> 
          </content>
          <created>20081028T045727Z</created> 
          <updated>20081028T051346Z</updated> 
          <tag>Test</tag> 
        </note>
        ";
        XPathDocument doc = new XPathDocument(new StringReader(xml));
        XPathNavigator nav = doc.CreateNavigator();

        // Compile a standard XPath expression

        XPathExpression expr;
        expr = nav.Compile("/note/content");
        XPathNodeIterator iterator = nav.Select(expr);

        // Iterate on the node set

        try
        {
            while (iterator.MoveNext())
            {
                //Get the XML in the CDATA
                XPathNavigator nav2 = iterator.Current.Clone();
                XPathDocument doc2 = new XPathDocument(new StringReader(nav2.Value.Trim()));

                //Parse the XML in the CDATA
                XPathNavigator nav3 = doc2.CreateNavigator();
                expr = nav3.Compile("/en-note");
                XPathNodeIterator iterator2 = nav3.Select(expr);
                iterator2.MoveNext();
                XPathNavigator nav4 = iterator2.Current.Clone();

                //Output the value directly, does not preserve the formatting
                Console.WriteLine("Direct Try:");
                Console.WriteLine(nav4.Value);

                //This works, but is ugly
                Console.WriteLine("Ugly Try:");
                Console.WriteLine(nav4.InnerXml.Replace("<div>","").Replace("</div>",Environment.NewLine));
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex.Message);
        }

Vinko Vrsalovic 2008-10-28 06:18:56

Yes. Since asking I've worked out that I can use HttpUtility.HtmlDecode to unescape the CDATA section and I'll probably just walk all the <div> nodes and used the InnerText.

2008-10-28 07:11:49

Answer 3

A:

I would use a regular expression to strip out all the HTML tags, this one is pretty basic, I am sure if you may be able to tweak it if it doesn't work as you exactly want.

Regex.Replace("<div>your html in here</div>",@"<(.|\n)*?>",string.Empty);

Xian 2008-10-28 07:15:59

Answer 4

+1 A:

You can also use an xslt transformation to convert the xml into a text document.

Rune Grimstad 2008-10-28 07:42:24

Answer 5

+1 A:

You can use HTML Agility Pack.

Sunny 2008-10-28 15:36:29

ansaurus

tags:

views:

answers:

C# code to convert XHTML doc to plain text

related questions