views:

1742

answers:

2

I have colleagues working on a .NET 1.1 project, where they obtain XML files from an external party and programmatically instruct iTextSharp to generate PDF content based on the XML data.

The tricky part is, within this XML are segments of arbitrary HTML content. These are HTML code users copied and pasted from their Office applications. Still looks ok on a web browser, but when this HTML is fed into iTextSharp's HTMLWorker object to parse and convert into PDF objects, the formatting and alignment run all over the place in the generated PDF document. E.g.

<span id="mceBoundaryType" class="portrait"></span>
<table border="0" cellspacing="0" cellpadding="0" width="636" class="MsoNormalTable"
    style="margin: auto auto auto 4.65pt; width: 477pt; border-collapse: collapse">
    <tbody>
     <tr style="height: 15.75pt">
      <td width="468" valign="bottom" style="padding-right: 5.4pt; padding-left: 5.4pt;
       padding-bottom: 0in; width: 351pt; padding-top: 0in; height: 15.75pt; background-color: transparent;
       border: #ece9d8">
       <p style="margin: 0in 0in 0pt" class="MsoNormal">
        <font face="Times New Roman">&nbsp;</font></p>
      </td>
      <td colspan="3" width="168" valign="bottom" style="padding-right: 5.4pt; padding-left: 5.4pt;
       padding-bottom: 0in; width: 1.75in; padding-top: 0in; height: 15.75pt; background-color: transparent;
       border: #ece9d8">
       <p style="margin: 0in 0in 0pt; text-align: center" class="MsoNormal" align="center">
        <u><font face="Times New Roman">Group</font></u></p>
      </td>
     </tr>

The tags are full of Style attributes, and iTextSharp does not support CSS and interpreting that attribute. What are some alternatives other iTextSharp users have tried to workaround this, or other feasible HTML-to-PDF components?

A: 

I don't have any solid answers, but I'll give you two directions to explore, both of which I have used before.

1 - use something like HtmlAgilityPack to cleanse your HTML - you can traverse the DOM and remove styles and classes, which could obviously screw up the layout to a certain degree. It is not clear to me whether you need to retain this styling or not. Then, you could use iTextSharp or an alternate program like HtmlDoc (which also does not support CSS) to render to PDF. We wrote a simple wrapper with a method that takes a URL, and then calls Htmldoc to generate the PDF.

2 - render the HTML server-side using a WebBrowser control, generate an image from that, then convert the image to PDF using PDFsharp or the library of your choice. This will obviously not give you PDFs that you can search or copy text from. There is some pretty good sample code here for converting the rendered page to an image (note: you can get full-height images, not just what you can see without scrolling).

Edit: I don't think the WebBrowser control is available in .NET 1.1.

RedFilter
Yes I have been trying out .NET 1.1 edition of HtmlAgilityPack as well, but it has some bugs that remove sections of paragraph content which I need to debug on another day.
icelava
Yes, the styling has to be retained for the HTML tags - those are the ones keeping the tables aligned properly in the first place. So removing them is somewhat the same as the current situation, where they are being ignored.
icelava
Well if you can live with images in your PDF, I suggest option 2.
RedFilter
+1  A: 

I have found .NET 2.0-based components like ExpertPDF and ABCpdf do a fairly good job interpreting the CSS styles and aligning the tables properly in PDF. Right now I am suggesting to my colleagues the use of a separate .NET 2.0 web service that can use such components, which will be informed by the ASP.NET 1.1 web application to go ahead and scrape a generated web page that is essentially the report in HTML view.

UPDATE:

This is the answer as it is the recommended approach provided to the application team.

icelava