views:

228

answers:

6

I have a requirement to dynamically generate and compress large batches of PDF files.

I am considering the usual algorithms

  • Zip
  • Ace
  • Rar

Any other suggestion are welcome.

My question is which algorithm is likely to give me the smallest file size. Speed and efficency are also important factors but size is my primary concern.

Also does it make a difference whether I have many small files, or fewer larger files in each archive.

Most of my processing will be done in PHP, but I'm happy to interface with third party executables if needed.

Edit:

The documents are primarily invoices and shouldn't contain any other images except for the company logo

+1  A: 

I think 7z is the best currently with RAR being the second, but I would recommend you trying both to find out what works best for you.

dusoft
+1  A: 

LZMA is the best if you need smallest file size.

And of course PDF can be compressed itself.

silent
Thanks, seems new 7z versions actually use LZMA
Neil Aitken
Yes, 7zip uses LZMA method.
silent
+1  A: 

I doubt you'll get much/any reduction in filesize by compressing PDFs. However, if all you're doing is collecting multiple files into one, why not tar it?

Skilldrick
+1  A: 

We've done this in the past for large (and many) PDFs that store lots of text - Training Packages for Training Organisations in Australia. Its about 96% text (course info etc) and a few small diagrams. Sizes vary from 1-2Mb to 8 or 9Mb and they usually come in volumes of 4 or more.

We've found compressing with Zip OK to get good compression as the PDF format is already heavily compressed, it was more of a ease of use for our users to download it all as a batch instead of worry about the filesizes. To give you an idea, a 2.31Mb file - lots of text, several full page diagrams - compressed to 1.92Mb in ZIP and 1.90Mb in RAR.

I'd recommend using LZMA to get the best - looking at resource usage on compressing and uncompressing too.

How big are these files? Get a copy of WinRAR, WinAce and 7Zip and give it ago.

Thushan Fernando
Thanks for the thorough info.I'm currently playing with the different algorithms to see which one gives good rates.7z running in LZMA seems to be the best so far
Neil Aitken
+1  A: 

Combine my nifty tool Precomp with 7-Zip. It decompresses the zLib streams inside the PDF so 7-Zip (or any other compressor) can handle them better. You will get filesizes about 50% of the original size lossless. This tool works especially well for PDF files, but is also nice for other compressed (zLib/LZW) streams as ZIP/GZip/JAR/GIF/PNG...

For result examples have a look here or here. Speed can be slow for the precompression (PDF->PCF) part, but will be very fast for the recompression/reconstruction (PCF->PDF) part.

For even better results than with Precomp + 7-Zip, you can try lprepaq and prepaq variants, but beware, especially prepaq is slooww :) - the bright side is that prepaq offers the best (PDF) compression currently available.

schnaader
Thanks I'll look into this. May have to persuade the bosses to use an unknown tool though.
Neil Aitken
The current version still is a test version, but works fine. To be on the safe side you can make sure that reconstructed PDFs have the same md5sum or compare them elsewhere.
schnaader
Good idea, we are using MD5 to validate the integrity of the imports anyway so storing a hash of the generated file isn't a problem.
Neil Aitken
Interesting- make it bigger in order to make it smaller!
RichardOD
+1  A: 

I have not had much success compressing PDFs. As pointed out, they are already compressed when composed (although some PDF composition tools allow you to specify a 'compression level'). If at all possible, the first approach you should take is to reduce the size of the composed PDFs.

If you keep the PDFs in a single file, they can share any common resources (images, fonts) and so can be significantly smaller. Note that this means one large PDF file, not one large ZIP with multiple PDFs inside.

In my experience it is quite difficult to compress the images within PDFs, and that images make by far the biggest impact on file size. Ensure that you have optimised images before you start. It is even worth running a test run without your images simply to see how much size the images are contributing.

The other component is font, and if you are using multiple embedded fonts then you are packing more data into the file. Just use one font to keep size down, or use fonts that are commonly installed so that you don't need to embed them.

Kirk Broadhurst