tags:

views:

26693

answers:

14

Ok, I'm now banging my head against a brick wall with this one.

I have an HTML (not XHTML) document that renders fine in Firefox 3 and IE 7. It uses fairly basic CSS to style it and renders fine in HTML.

I'm now after a way of converting it to PDF. I have tried:

  • DOMPDF: it had huge problems with tables. I factored out my large nested tables and it helped (before it was just consuming up to 128M of memory then dying--thats my limit on memory in php.ini) but it makes a complete mess of tables and doesn't seem to get images. The tables were just basic stuff with some border styles to add some lines at various points;
  • HTML2PDF and HTML2PS: I actually had better luck with this. It rendered some of the images (all the images are Google Chart URLs) and the table formatting was much better but it seemed to have some complexity problem I haven't figured out yet and kept dying with unknown node_type() errors. Not sure where to go from here; and
  • Htmldoc: this seems to work fine on basic HTML but has almost no support for CSS whatsoever so you have to do everything in HTML (I didn't realize it was still 2001 in Htmldoc-land...) so it's useless to me.

I tried a Windows app called Html2Pdf Pilot that actually did a pretty decent job but I need something that at a minimum runs on Linux and ideally runs on-demand via PHP on the Webserver.

I really can't believe I'm this stuck. Am I missing something?

+2  A: 

Fine rendering doesn't mean anything. Does it validate?

All browsers do the most they can to just show something on the screen, no matter how bad the input. And of course they do not do the same thing. If you want the same rendering as FireFox, you could use its rendering engine. There are pdf generators for it. It is an awful lot of work, though.

Stephan Eggermont
Yes it validates.
cletus
+2  A: 

Perhaps you might try and use Tidy before handing the file to the converter. If one of the renderer chokes on some HTML problem (like unclosed tag), it might help it.

PhiLho
Yes a valid point but I've thought of this already. There are no unmatched nor nonstandard tags in my HTML.
cletus
PhiLho: that remark helped me out today!
jerrygarciuh
+2  A: 

There's a tutorial on Zend's devzone on generating pdf from php (part 1, part 2) without any external libraries. I never implemented this sort of solution, but since it's all php, you might find it more flexible to implement and debug.

yoavf
+2  A: 

Well if you want to find a perfect XHTML+CSS to PDF converter library, forget it, it's far from possible. Because it's just like finding a perfect browser (XHTML+CSS rendering engine). Do we have one? IE or FF?

I have had some success with DOMPDF. The thing is that you have to modify your HTML+CSS code to go with the way the library is meant to work. Other than that, I have pretty good results. See below.

Original HTML: http://www.nutquote.com/quote/William_Shakespeare/66/simple

PDF: http://www.converthub.com/htmltopdf.php?html=http://www.nutquote.com/quote/William_Shakespeare/66/simple

kavoir.com
+11  A: 

After some investigation and general hair-pulling the solution seems to be HTML2PDF. DOMPDF did a terrible job with tables, borders and even moderately complex layout and htmldoc seems reasonably robust but is almost completely CSS-ignorant and I don't want to go back to doing HTML layout without CSS just for that program.

HTML2PDF looked the most promising but I kept having this weird error about null reference arguments to node_type. I finally found the solution to this. Basically, PHP 5.1.x worked fine with regex replaces (preg_replace_*) on strings of any size. PHP 5.2.1 introduced a php.ini config directive called pcre.backtrack_limit. What this config parameter does is limits the string length for which matching is done. Why this was introduced I don't know. The default value was chosen as 100,000. Why such a low value? Again, no idea.

A bug was raised against PHP 5.2.1 for this, which is still open almost two years later.

What's horrifying about this is that when the limit is exceeded, the replace just silently fails. At least if an error had been raised and logged you'd have some indication of what happened, why and what to change to fix it. But no.

So I have a 70k HTML file to turn into PDF. It requires the following php.ini settings:

  • pcre.backtrack_limit = 2000000; # probably more than I need but that's OK
  • memory_limit = 1024M; # yes, one gigabyte; and
  • max_execution_time = 600; # yes, 10 minutes.

Now the astute reader may have noticed that my HTML file is smaller than 100k. The only reason I can guess as to why I hit this problem is that html2pdf does a conversion into xhtml as part of the process. Perhaps that took me over (although nearly 50% bloat seems odd). Whatever the case, the above worked.

Now, html2pdf is a resource hog. My 70k file takes approximately 5 minutes and at least 500-600M of RAM to create a 35 page PDF file. Not quick enough (by far) for a real-time download unfortunately and the memory usage puts the memory usage ratio in the order of 1000-to-1 (600M of RAM for a 70k file), which is utterly ridiculous.

Unfortunately, that's the best I've come up with.

cletus
Nice report, cletus. WTG!
Seb
+20  A: 

Have a look at PrinceXML.

It's definitely the best HTML/CSS to PDF converter out there, although it's not free (But hey, your programming is not free either, so if it saves you 10 hours of work, you're home free.)

Oh yeah, did I mention that this is the first (and probably only) HTML2PDF solution that does full ACID2!?

http://princexml.com/samples/

SchizoDuckie
Well it seems you can only download the desktop version. I'd reeally like to try the server version. But the desktop version did a superb job (equal to my final html2pdf version but virtually instantaneous). Thanks for the recommendation.
cletus
After more testing... Prince XML is seriously cool. Nuff said.
cletus
I've already used it for a big project. Very great tool and the support exists. Just go for it !
Kaaviar
PrinceXML is really awesome. Only if it was not that expensive :-(
acme
+3  A: 

Just to bump the thread, I've tried DOMPDF and it worked perfectly. I've used divs and other block level elements to position everythign. Kept it strictly CSS2.1 and it played nicely.

kRON
+1  A: 

I am using fpdf to produce pdf files using php. It's working well for me so far to produce simple outputs.

+1  A: 

Checkout TCPDF. It has some HTML to PDF functionality that might be enough for what you need. It's also free!

Darryl Hein
+2  A: 

I dont think a php class will be the best for render an xHtml page with css.

What happen when a new css rule come out? (soon css 3.0...)

The best way to render an html page is, obvisiuly, a browser. Firefox 3.0 can natively 'print' in pdf format, torisugary developed an extension (command line print) to use it. Here you'll find it.

Anyway, there are still many problmes runninr firefox just as a pdf converter...

At the moment, i think that wkhtmltopdf is the best (that is the one used by the safari browser), fast, quick, awesome. Yes, opensource as well... Give it a look

DaNieL
+22  A: 

Have a look at WKHTMLTOPDF. It is open source, based on webkit and free.

We wrote a small tutorial here.

Mic
Better than anything else I've used, simple and free.
MGOwen
This one operates on the best premise IMO. Boostrap conversion off an existing renderer instead of writing one from scratch - not a trivial task. Furthermore, Webkit is written in C++ and therefore much faster and much less of a resource hog than PHP based implementation.
Koobz
@Mic Right approach. Perfect results. Tnx!
mac
+1  A: 

Hi friend, Why dont you try MPDF version 2.0?.. I used for creating PDF document.Its working fine...

Karthick
+2  A: 

i think that wkhtmltopdf rocks, is the best, fast, quick, awesome. Yes, open source as well... Try it

Vaibhav Malushte
A: 

Darryl Hein's mention above of TCPDF (http://www.tecnick.com/public/code/cp_dpage.php?aiocp_dp=tcpdf) is likely a great idea. Nicola Asuni's code is pretty handy and powerful. The only killer is if you ever plan on merging PDF files with your generated PDF it doesn't have those features. You would have to create the PDF and then merge it using something like PDFTK by Sid Steward (www.pdflabs.com/tools/pdftk-the-pdf-toolkit/).

Arachnid