tags:

views:

510

answers:

4

pyPdf is a great library to split, merge PDF files. I'm using it to split pdf documents into 1 page documents. pyPdf is pure python and spends quite a lot of time in the _sweepIndirectReferences() method of the PdfFileWriter object when saving the extracted page. I need something with better performance. I've tried using multi-threading but since most of the time is spent in python code there was no speed gain because of the GIL (it actually ran slower).

Is there any library written in c that provides the same functionality? or does anyone have a good idea on how to improve performance (other than spawning a new process for each pdf file that I want to split)

Thank you in advance.

Follow up. Links to a couple of command line solutions, that can prove sometimes faster than pyPDF:

I modified pyPDF PdfWriter class to keep track of how much time has been spent on the _sweepIndirectReferences() method. If it has been too long (right now I use the magical value of 3 seconds) then I revert to using ghostscript by making a call to it from python.

Thanks for all your answers. (codelogic's xpdf reference is the one that made me look for a different approach)

A: 

pdfLaTex can do a lot of PDF managing and is very fast.

i've used it for some quite complex imposition worflows. the TeX language is really alien to programming, but it's easy to write a python script that generates the needed LaTex layout and processes it.

Javier
+2  A: 

mbtPdfAsm is a fast, open source command line tool for PDF processing.

Xpdf is also worth mentioning since it's GPL and written in C++. The source code is well modularized and allows for writing command line tools.

codelogic
I need to disassemble the pdf. If I understand properly mbtPdfAsm assembles pdfs.
Nathan
It can be used for assembling and disassembling pdfs.
codelogic
+1  A: 

Have you tried using Psyco with pyPdf?

John Fouhy
psyco is not available on my 64bit ubuntu install
Nathan
+1  A: 

Does it have to be python? My pure-Perl library CAM::PDF is pretty fast at appending and deleting PDF document pages. It saves the sweeping for the very end, where possible.

Chris Dolan