tags:

views:

748

answers:

7

My goal is to build an engine that takes the latest HL7 3.0 CDA documents and make them backward compatible with HL7 2.5 which is a radically different beast.

The CDA document is an XML file which when paired with its matching XSL file renders a HTML document fit for display to the end user.

In HL7 2.5 I need to get the rendered text, devoid of any markup, and fold it into a text stream (or similar) that I can write out in 80 character lines to populate the HL7 2.5 message.

So far, I'm taking an approach of using XslCompiledTransform to transform my XML document using XSLT and product a resultant HTML document.

My next step is to take that document (or perhaps at a step before this) and render the HTML as text. I have searched for a while, but can't figure out how to accomplish this. I'm hoping its something easy that I'm just overlooking, or just can't find the magical search terms. Can anyone offer some help?

FWIW, I've read the 5 or 10 other questions in SO which embrace or admonish using RegEx for this, and don't think that I want to go down that road. I need the rendered text.

using System;
using System.IO;
using System.Xml;
using System.Xml.Xsl;
using System.Xml.XPath;

public class TransformXML
{

    public static void Main(string[] args)
    {
        try
        {

      string sourceDoc = "C:\\CDA_Doc.xml";
      string resultDoc = "C:\\Result.html";
      string xsltDoc = "C:\\CDA.xsl";

            XPathDocument myXPathDocument = new XPathDocument(sourceDoc);
            XslCompiledTransform myXslTransform = new XslCompiledTransform();

            XmlTextWriter writer = new XmlTextWriter(resultDoc, null);
            myXslTransform.Load(xsltDoc);

            myXslTransform.Transform(myXPathDocument, null, writer);

            writer.Close();

            StreamReader stream = new StreamReader (resultDoc);

        }

        catch (Exception e)
        {
            Console.WriteLine ("Exception: {0}", e.ToString());
        }
    }
}
+2  A: 

This will leave you with just the text:

class Program
{
    static void Main(string[] args)
    {
        var blah =  new System.IO.StringReader(sourceDoc);
        var reader = System.Xml.XmlReader.Create(blah);
        StringBuilder result = new StringBuilder();

        while (reader.Read())
        {
            result.Append( reader.Value);
        }
        Console.WriteLine(result);
    }

    static string sourceDoc = "<html><body><p>this is a paragraph</p><p>another paragraph</p></body></html>";
}
David Silva Smith
+1  A: 

Or you can use a regular expression:

public static string StripHtml(String htmlText)
{
    // replace all tags with spaces...
   htmlText = Regex.Replace(htmlText, @"<(.|\n)*?>", " ");

   // .. then eliminate all double spaces
   while (htmlText.Contains("  "))
   {
       htmlText = htmlText.Replace("  ", " ");
    }

   // clear out non-breaking spaces and & character code
   htmlText = htmlText.Replace("&nbsp;", " ");
   htmlText = htmlText.Replace("&amp;", "&");

   return htmlText;
}
ProKiner
A: 

Can you use something like this which uses lynx and perl to render the html and then convert that to plain text?

A: 

See this answer to a similar question on SO:

How can I Convert HTML to Text in C#

Tim Henigan
+1  A: 

Since you have the XML source, consider writing an XSL that will give you the output you want without the intermediate HTML step. It would be far more reliable than trying to transform the HTML.

ScottSEA
A: 

This is a great use-case for XSL:FO and FOP. FOP isn't just for PDF output, one of the other major outputs that is supported is text. You should be able to construct a simple xslt + fo stylesheet that has the specifications (i.e. line width) that you want.

This solution will is a bit more heavy-weight that just using xml->xslt->text as ScottSEA suggested, but if you have any more complex formatting requirements (e.g. indenting), it will become much easier to express in fo, than mocking up in xslt.

I would avoid regexs for extracting the text. That's too low-level and guaranteed to be brittle. If you just want text and 80 character lines, the default xslt template will only print element text. Once you have only the text, you can apply whatever text processing is necessary.

Incidentally, I work for a company who produces CDAs as part of our product (voice recognition for dications). I would look into an XSLT that transforms the 3.0 directly into 2.5. Depending on the fidelity you want to keep between the two versions, the full XSLT route will probably be your easiest bet if what you really want to achieve is conversion between the formats. That's what XSLT was built to do.

16bytes
A: 

Hi,

 I am trying to generate HL7 CDA xml document from entity classes generated from CDA.XSD. Can any body help me on this.

Thanks, James,

james