pyPdf is a great library to split, merge PDF files. I'm using it to split pdf documents into 1 page documents. pyPdf is pure python and spends quite a lot of time in the _sweepIndirectReferences() method of the PdfFileWriter object when saving the extracted page. I need something with better performance. I've tried using multi-threading but since most of the time is spent in python code there was no speed gain because of the GIL (it actually ran slower).
Is there any library written in c that provides the same functionality? or does anyone have a good idea on how to improve performance (other than spawning a new process for each pdf file that I want to split)
Thank you in advance.
Follow up. Links to a couple of command line solutions, that can prove sometimes faster than pyPDF:
- http://multivalent.sourceforge.net/Tools/pdf/Split.html
- http://www.linuxsolutions.fr/how-to-extract-pages-from-a-pdf/
I modified pyPDF PdfWriter class to keep track of how much time has been spent on the _sweepIndirectReferences() method. If it has been too long (right now I use the magical value of 3 seconds) then I revert to using ghostscript by making a call to it from python.
Thanks for all your answers. (codelogic's xpdf reference is the one that made me look for a different approach)