views:

716

answers:

5

I need to have the ability to convert and merge various documents into a single Pdf.

The documents could be of varying types, such as Word, Open Office, Images, Text, Web pages (by URL) and the PDF would usually consist of 2-3 documents.

At the moment, we are using BCL Technologies easyPDF with Microsoft Office installed onto the Server. This handles most documents but we haven't had it doing Open Office ones yet.

We currently produce around 100-1000 of these PDF's per day.

The reason I am asking the question is that performance is a key issue. The PDF is generated for users on the fly and so the waiting times we are currently getting of 30-60 seconds is becoming unacceptable.

We have done some caching around documents when they are intially uploaded so the main tasks that happens when a User requests a Pdf is merging a number of already generated Pdf's.

Does anyone else have any other tools they have used that work reliably for most common document types and above all, quickly? When put like that, it seems like I'm asking a lot!

Edit: Thanks for all the great advice, I'll look into some of these and compare performance.

Just to add to all this, money is not really an object. We're more than happy to pay for different applications to perform each task as well as looking into various hardware options to distribute the load as much as possible.

+1  A: 

We use ABC Pdf. I don't know if it will be fast enough for your needs, but it seems to work for our use.

Kevin
A: 

While I have no PDF conversion suggestions I can say that this problem sounds like one which could be distributed over a number of nodes. Do you find that the PDF generation is CPU-bound or are there other limiting factors? Before expending too much effort on rewriting the PDF library interface you might want to see what the bottlenecks are.

stimms
+1  A: 

I had a very similar issue where we had documents that were already existing in PDF format and needed to allow the user to see them all combined together. We purchased the PDF4NET product which was about $500 from what I recall. It was extremely easy to use and they provide awesome examples of how to use the tools.

O2 Solutions - PDF4NET

Here is the code sample that they provide for merging. The top line looks like it just outputs the file, the second 2 lines allow for streaming the content back to the user.

     PDFFile.MergeFilesToDisk( "append.pdf", "unicode.pdf", "multicolumntextandimages.pdf" );
     PDFDocument doc =  PDFFile.MergeFilesToDoc( "append.pdf", "unicode.pdf", "multicolumntextandimages.pdf" );
  doc.SaveToStream( stream );
RSolberg
+1  A: 

Merging multiple PDF documents is normally simple enough (as long as they don't need to be merged on the same page) - you could compare your merge performance with something like iTextSharp (.NET version of iText) to be sure it isn't a bottleneck - otherwise the conversion from other formats to PDF is likely the bottleneck.

In almost all cases, the method used to convert X to PDF is to execute the applications print command, targeted at a software PDF printer, to create a temporary PDF file.

This means:

  • The target application (for example Office) is opened and closed
  • The document has to travel through the printing service

In your situation, are you converting arbitrary documents submitted by the users, or do the documents come from a stored library of files? If it's a library, you could make a PDF copy of each file as it is added to the library (instead of when the user makes a request), and then only merge the PDF files.

David
+1  A: 

You say you're using Microsoft Office to open these files, I would imagine this is the bottleneck rather than the actual PDF creation.

Is it possible to distill these documents into a more accessible format (html/xml/database), so that it's not necessary to open office every time a PDF needs to be created?

-MBirchmeier

MBirchmeier