views:

1042

answers:

13

I'm building a desktop application right now that presents its human-readable output as XHTML displayed in a WebBrowser control. Eventually, this output is going to have to be converted from an XHTML file to a document image in an imaging system. Unlike XHTML documents, the document image has to be divided into physical pages; additionally - and this is the part that's killing me - there need to be headers and footers on these pages.

Much as I would like to, I can't simply make the WebBrowser print to a file - the header/footer options it supports aren't anywhere near sophisticated enough. So I'm casting about trying to figure out what the right technology is for generating these images.

It seems likely to me (though it's not mandatory) that what I'll end up doing is producing PDF versions of the HTML documents (so that I can add headers and footers) and then rendering the PDFs as TIFFs, which is the ultimate format that the imaging system wants. So what I'm considering:

  • Use some kind of XHTML-to-PDF conversion software. The problem with this is that without doing a lot of evaluation and testing I can't figure out if the products I've looked at even have the ability to do what I need, which is to take existing XHTML documents, decorate them with headers and footers and paginate them.

  • Use XSL-FO to generate the PDFs. Being a ninja-level XSLT geek helps here (that's how I'm producing the XHTML in the first place), but it still seems like an awkward and slow solution with a lot of moving parts. Also this means I'm sticking a big clunky Java program into the middle of my nice clean .NET system, though I'm certainly enough of a grownup to do that if it's the right answer.

  • Use some other technology that I haven't even thought of yet, like LaTeX. Maybe there's some miraculous page-imaging tool that turns XHTML directly into TIFFs with page headers and footers. That would be ideal.

My primary concerns are:

  • I'm building a commercial product; whatever technology I use needs to be affordable and supportable. It doesn't have to be free.

  • I don't want to disappear down a rabbit hole for three months banging on this stuff to get it to work. This intuitively seems like the kind of problem space where I can lose a lot of time just evaluating and rejecting tools.

  • Whatever solution I adopt needs to be relatively immune to formatting changes in the XHTML. The whole reason I'm using XSLT and producing XHTML in the first place is that the documents I'm producing are being dynamically assembled using business rules that change all the time.

I've spent a lot of time searching for alternatives and haven't found anything that's obviously the answer. But maybe one of you fine people has already solved this problem, and if so, I would like to stand on your shoulders.

+1  A: 

have you thought about using postscript?

ps: what kind of headers/footers you need - your custom ones to put pages in between? if so, postscript or PDF is probably the best. but it will be very difficult to create xhtml+css to pdf converter. basically, you would need to have library that is able to parse both xhtml and css (+any objects such as images, flash etc.)

dusoft
+1  A: 

PrinceXML is an XHTML/CSS to PDF converter. It seems to have the features you need:

  • Page headers/footers, page numbering and duplex printing.

I realize you'll probably want more extensive answers than this one (I'm sorry, but I haven't evaluated the product), but nevertheless, I hope it helps!

onnodb
This was startlingly easy to implement in my prototype. Pity the server licensing is so pricey.
Robert Rossney
Yeah, I was also taken back by the high prices. Perhaps you could contact their sales department to see if you can get a special deal? Seems to work sometimes...
onnodb
Yeah, we could conceivably get OEM pricing. But even with a 50% discount, I'm adding $2K to the price of my software (or, more realistically, reducing my profits by $2K.) I'm pretty strongly motivated to find another solution. Though everything else about Prince is perfect.
Robert Rossney
+2  A: 

If tiff is your goal, this might be a free and low risk approach:

  1. Use a component to create an image for a given url. I'm not sure which tool we used for it, but GIYF: I just stumbled upon SmallSharpTool's WebPreview that seems to do the job
  2. Make sure it can create an image of the entire page, ie the entire's scrollable area.
  3. Use ImageMagick to do all the image manipulation, such as cutting it into multiple pages, adding your own headers, footers and page numbering and conversion to tiff.

I have personally used the above techniques separately in C# projects (console apps and websites) with success so I can almost guarantee this will work.

Martin Kool
+4  A: 

Edit (2009-03-29 9:00 AM PST) Posted sample conversion.

Edit (2009-03-23 12:30 PM PST, published to CodePlex) I developed a solution for this and posted it to CodePlex. The published version 2.0 is written using the WPF MVVP pattern. TIFF files (one per page) are output to c:\Temp\XhtmlToTiff. XAML and XPS formats are created as well. A compiled,installable version is available at CricketSoft.com


Have you tried the "Microsoft XPS Document Writer"? This a software-only printer that generates paged output from a variety of sources, including web pages.

There is an SDK for working with XPS documents and Open XML docs in general. Here is a How-to article by Beth Massi: "Accessing Open XML Document Parts with the Open XML SDK".

+tom

Tom A
I need more control over formatting than I can get by simply redirecting IE's printed output to a driver, unfortunately. Generating the underlying XPS seems, to put it mildly, non-trivial.
Robert Rossney
ah, i may have a bit of help for you here. I decided to code up a sample. Pls hold... (and thx for the "Answered".)
Tom A
Well the "answered" was done automatically when the bounty expired. Not actually what I intended, but the system works the way it works.
Robert Rossney
hmm, there must be more detail on how it works -- the automatic bounty was only 50%.
Tom A
Ah, my temp link didn't take...
Tom A
+1  A: 

It all depends on how important quality is for the generated documents. It also matters what other operations you need to do with the document.

I'm building a desktop application right now that presents its human-readable output as XHTML displayed in a WebBrowser control. Eventually, this output is going to have to be converted from an XHTML file to a document image in an imaging system.

Looks like your application is a soft-form of sorts. You generate filled-in forms and save them.

[...]there need to be headers and footers on these pages.

This is the easy part. You can use templates and merge the data with the static header/footer template. You sound as if you are doing VDP. Hm. Let's move on.

I can't simply make the WebBrowser print to a file - the header/footer options it supports aren't anywhere near sophisticated enough.

Why so? All you need is a capable driver.

It seems likely to me (though it's not mandatory) that what I'll end up doing is producing PDF versions of the HTML documents

Again, it is not clear why you would want PDF right away. PDF is a document interchange format. Not a PDL per se. PostScript is a much better choice. Yes, I know there are things like XPS, PCL and what not. However, the amount of rendering control and quality you get with PS is far too much to risk a cheaper solution. I say cheaper, because, you also need to keep in mind the sort of printing you can avail of. PostScript printers (not the ones with the clone RIPs) are costlier in general.

Now, back to your PDF thing. Yes, of course you can generate PDF. It has certain advantages like:

  • Better support for transparency (and in general quality)
  • Archival
  • Interchange
  • Share it across for review
  • Preview/Preflight/Correct
  • Security
  • Stream encryption (for both security and the amount of data you transfer to the printer)
  • Use templates

But remember do you have any printers to do native PDF ripping? Because you are otherwise doing a lossy PDF to PS/PCL conversion. And you've just lost the game. Which brings me back to PostScript ;)

dirkgently
Interchange and archiving are the most compelling arguments for PDF. I'm not sure how important rendering control and quality are - a lot of the documents this system is replacing are Word documents covered with handwritten amendments, so user expectations are presently pretty low.
Robert Rossney
Does that mean you are taking the Word docs through OCR? In that case, the OCR engine will generate tiffs for you. Or, do you need to generate the different (C,M,Y,K) planes as well?
dirkgently
No, the customer's not presently imaging the Word documents. Producing PDF isn't *really* the requirement at this point - producing TIFFs of the formatted documents is. So I could conceivably use PS. What sort of tools do I need? I'm a babe in the woods with PS.
Robert Rossney
PS drivers come by default with Windows. CUPS (on *nix and Mac) can also generate PS. That's all. Create a virtual minidriver and you're done. Print happily ever after.
dirkgently
+3  A: 

Just my 2p but if you are an XSLT ninja I'd suggest sticking with that. You can avoid the nasty java program by looking at nFop which is a C# port of the apache FOP project. What's great is that you can simply take the assembly and use directly passing your XML and XSLT to it to get the PDF output you want.

http://sourceforge.net/projects/nfop/

Hope that helps.

Chris Meek
It never occurred to me that some clever person would redo FOP in .Net. I may have to do a little more looking into XSL-FO. I know I can get it to work at least.
Robert Rossney
+1  A: 

You can use PISA for Python. It uses the reportlab toolkit to generate a pdf from html (using html5lib)

jle
It's remarkable how poorly organized the documentation for PISA is. (Like, there's not even a link to it on the PISA site. And never mind getting a complete list of dependencies.) But it does seem to work, eventually.
Robert Rossney
I found an example that took me right through it... I do remember the documentation being a bit skimpy.
jle
I spent an hour and a half yesterday just writing down the procedure my non-technical colleagues would have to follow to get pisa installed. But functionally it's very close to what I need. Wish it supported floating elements. Another hidden cost of table-less layout.
Robert Rossney
+1  A: 

You could also try using PDFCreator and simply printing the document to PDF. PDFCreator acts like any normal printer and uses ghostscript to convert printer output to pdf, tiff, jpeg, or whatever you want. I think you can change header and footer items through IE's com interface and print directly from IE. PDFCreator has examples for different languages in the com folder of the install directory. I have used it and can vouch for it. Windows only though.

jle
An interesting idea, except that IE doesn't give you the ability to (say) define a DIV as your page footer, which is really the level of formatting control I need.
Robert Rossney
You might be able to add that with PDFCreator...
jle
+1  A: 

Do you really need to use XHTML/Web browser?

I have been in this exact dilemma trying to generate good looking HTML reports and the solution I found is .... to drop HTML and use a "real" report generator, there are a lot of them out there, they all support all the pagination and header/footer options you can think about they can usually print to pdf and sometimes directly to images.

HTML is just not the right technology for reports.

Nir
It's not the right technology for reports, agreed. It is without question the right technology for the documents my program's producing.
Robert Rossney
+1  A: 

Use some other technology that I haven't even thought of yet, like LaTeX.

TexML, which is LaTeX semantics with XML syntax. To use that you can create XSLT, which would decorate your XHTML with TexML commands (see example)

vartec
That's...daunting. It may be a very good answer for someone who knows LaTeX. I don't, so that's two hills to climb. There's also this: http://www.w3.org/2004/04/xhlt91/.
Robert Rossney
Ok, as you mentioned it, I've assumed that you know it. ;-)As of [X]HTML to LaTeX tools, most of them create documents that are too plain, often even ugly.
vartec
+1  A: 

ExpertPDF HtmlToPdf Converter (www.html-to-pdf.net) should be able to do exactly what you need. It's really simple to use, just reference the assembly in your project and start using it. I've used this product with great success in a couple of work projects.

sfid
I've already started evaluating this. The great problem with this component is that you have to do a lot of manipulation in code; you can't (for instance) use markup in the document to provide content to headers and footers.
Robert Rossney
A: 

You mentioned your current desktop app exports results in xhtml. Since xhtml is well formed xml, you should get away with using xsl fo to export it to pdf.

XML -> XSL-FO = PDF

Here's a beginner's guide: http://www.devx.com/xml/Article/16430

My company has used this technique in a java+cocoon webaplication for the Dutch government.

Martin Kool
Right, that's why I listed it as a possibility. I've used XSL-FO before. It works, but it's slow and ungainly.
Robert Rossney
A: 

http://iecapt.sourceforge.net/

quoting from above website:

IECapt is a small command-line utility to capture Internet Explorer's rendering of a web page into a BMP, JPEG or PNG image file. The C++ version also has experimental support for Enhanced Metafile vector graphic output. IECapt is available in a C++ and a C# version.

mangokun