ansaurus

Question

Answer 1

+2 A:

This will leave you with just the text:

class Program
{
    static void Main(string[] args)
    {
        var blah =  new System.IO.StringReader(sourceDoc);
        var reader = System.Xml.XmlReader.Create(blah);
        StringBuilder result = new StringBuilder();

        while (reader.Read())
        {
            result.Append( reader.Value);
        }
        Console.WriteLine(result);
    }

    static string sourceDoc = "<html><body><p>this is a paragraph</p><p>another paragraph</p></body></html>";
}

David Silva Smith 2009-06-26 19:25:26

Answer 2

+1 A:

Or you can use a regular expression:

public static string StripHtml(String htmlText)
{
    // replace all tags with spaces...
   htmlText = Regex.Replace(htmlText, @"<(.|\n)*?>", " ");

   // .. then eliminate all double spaces
   while (htmlText.Contains("  "))
   {
       htmlText = htmlText.Replace("  ", " ");
    }

   // clear out non-breaking spaces and & character code
   htmlText = htmlText.Replace("&nbsp;", " ");
   htmlText = htmlText.Replace("&amp;", "&");

   return htmlText;
}

ProKiner 2009-06-26 20:09:15

Answer 3

A:

Can you use something like this which uses lynx and perl to render the html and then convert that to plain text?

2009-06-26 20:12:46

Answer 4

A:

See this answer to a similar question on SO:

How can I Convert HTML to Text in C#

Tim Henigan 2009-06-26 20:16:51

Answer 5

+1 A:

Since you have the XML source, consider writing an XSL that will give you the output you want without the intermediate HTML step. It would be far more reliable than trying to transform the HTML.

ScottSEA 2009-06-26 21:53:03

Answer 6

A:

This is a great use-case for XSL:FO and FOP. FOP isn't just for PDF output, one of the other major outputs that is supported is text. You should be able to construct a simple xslt + fo stylesheet that has the specifications (i.e. line width) that you want.

This solution will is a bit more heavy-weight that just using xml->xslt->text as ScottSEA suggested, but if you have any more complex formatting requirements (e.g. indenting), it will become much easier to express in fo, than mocking up in xslt.

I would avoid regexs for extracting the text. That's too low-level and guaranteed to be brittle. If you just want text and 80 character lines, the default xslt template will only print element text. Once you have only the text, you can apply whatever text processing is necessary.

Incidentally, I work for a company who produces CDAs as part of our product (voice recognition for dications). I would look into an XSLT that transforms the 3.0 directly into 2.5. Depending on the fidelity you want to keep between the two versions, the full XSLT route will probably be your easiest bet if what you really want to achieve is conversion between the formats. That's what XSLT was built to do.

16bytes 2009-06-29 17:09:44

Answer 7

A:

Hi,

 I am trying to generate HL7 CDA xml document from entity classes generated from CDA.XSD. Can any body help me on this.

Thanks, James,

james 2010-02-02 20:48:08

ansaurus

tags:

views:

answers:

Convert XML to Plain Text

related questions