views:

943

answers:

6

There is a standard two-pass algorithm mentioned in RFC 1942: http://www.ietf.org/rfc/rfc1942.txt however I haven't seen any good real-world implementations. Anyone know of any? I haven't been able to find anything useful in the Mozilla or WebKit code bases, but I am not entirely sure where to look.

I guess this might actually be a deeper problem with having to actually render HTML (the contents of table cells) but just to keep it simple - plaintext HTML table as an image. Even an HTML table rendering algorithm ignoring the "as an image" part...

A: 

html table rendering is non-trivial due to the various ways that the sizes of the cells may be specified, tables nested within tables, etc.

if all you want is the image, a simple solution would be the .NET browser control (which is basically the COM component for IE) and a screen-capture function

if you want to get some source to manipulate, the Mozilla source should still be available

Steven A. Lowe
Thanks! I am looking for something a little more low-level however...this is in C btw.
Christopher Lang
@[christopher Lang]: the Mozilla source is in C/C++ and should have what you want in it. http://developer.mozilla.org/en/Download_Mozilla_Source_Code
Steven A. Lowe
Steven, thank you for the link! After more searching through the code I found - http://mxr.mozilla.org/mozilla-central/source/layout/tables/nsTablePainter.cppThis is *close* to what I need...I am going to try it from scratch and refer back to their codebase. Thanks again!
Christopher Lang
A: 

I'm not sure if this will meet your constraints or not, but you can try using IE or an IE control with MSHTML and the IHTMLElementRender interface to render the table to a device context.

Gerald
Thanks for the suggestion - unfortunately this solution is going to be cross platform (and mobile!) so I'm trying to get a general purpose algorithm without third party libs. Thanks again!
Christopher Lang
+1  A: 

If a commercial tool is an option, look at:

HtmlCapture ActiveX Control V2.0 (originally named HtmlSnap)

Some features they claim:

  • By calling SnapHtmlString(), you can take a snapshot for a html string.
  • Get snapshot images rendered by either Microsoft IE or Mozilla Firefox.
  • Just by calling SnapUrl() and SaveImage(), you can take a snapshot of a webpage into various images, such as BMP, JPG, JPEG, GIF, PNG, TIF, TGA and PCX.
  • Convert html to vector image format like EMF and WMF.
  • Self contained ActiveX control with no third party dependencies.
  • Support custom gdi output of the resulting image.
  • Support saving resulting image both to file and in memory.
  • Support saving both full-size web page and thumbnail one.
  • Take a snapshot of a whole webpage into one image without scrollbars.
  • Make grayscale or B&W images with efficient algorithms to keep the quality.
  • Support JPEG compression level, compression method selection of TIFF and GIF.
  • Support setting color depth in images while keeping the quality of the image as much as possible.
  • Selectively save activeX, image, java applets, scripts and videos on a web page as you want.
  • Send custom cookies, http headers, credentials in snapshot requests.
  • Take snapshots of webpages via a Proxy server.
  • More than 30 samples written in VC, C- , Delphi, VB, C++ Builder, Java, JScript, Perl, VBScript, ASP, ASP.net and PHP are provided.
micahwittman
Thanks! This is helpful and close. There is a free library I found that does something similar: http://www.terrainformatica.com/htmlayout/main.whtmThe only problem is I'd like to do this at a low level without any external libs, especially not commercial. Thank you though!
Christopher Lang
@christopher-lang Good find; HTMLayout looks interesting.
micahwittman
A: 

If you have XHTML, not plain HTML, you should be able to retrieve the content of those cells along with information about the table's structure: colspan, rowspan, etc. Using this information, you can render the table using your own border, padding and margin values.

Things get complex when you also want to render the user defined dimensions. But for retrieving the table data and drawing it, you could use an XML parser. PHP's parser is here: http://ca3.php.net/xml

Dimitry Z
Thanks, but getting the actual table information isn't a problem - actually performing the render operation for what the information entails is! Thanks again for the suggestion though!
Christopher Lang
A: 

One tool that comes close is: http://www.terrainformatica.com/htmlayout/main.whtm

This library offers a way to capture rendered HTML to an image, however it is not open source (but free!). Hope it is useful to some!

Unfortunately my app is cross platform, C/C++ with no MFC or platform dependencies (nightmare!). I'm hopefully looking to find a general purpose algorithm for table rendering. I think the 2-pass option from the RFC comes pretty close so I'm probably going to just dig in and work against that. I'll be sure to blog about it and post my eventual solution here if I can!

Christopher Lang
A: 

Take a look at Prince XML - it's a commercial tool to render CSS-styled XML (including XHTML) documents to PDFs. This tool is conform with major W3C standards such as XHTML and CSS2.1. You can try the free demo version from their Homepage!

Since you want an image: It shouldn't be a big problem to convert the generated PDFs programatically to an images.

Christoph Schiessl
Thanks Christoph! This looks to be an interesting tool - I wonder how they are capturing the HTML...hmm. Unfortunately it is commercial! I'd like to implement this from scratch or at least use an open-source library I could learn from.
Christopher Lang
Check out this thread for something open source that does a similar job - http://stackoverflow.com/questions/597348/foss-html-to-pdf-in-python-net-or-command-line/597355#597355
Daniel Von Fange