views:

161

answers:

2

Does a library exist that will remove "owner" passwords from PDF documents so that the text can then be programmatically extracted from them? Something like PDF Technologies' Password Recovery tool, but callable from the command line or from Python. A GUI interface is not really useful to me, since the number of documents is so large.

Please, no comments on the legality of the process. The PDFs in question are owned, and the text needs to be extracted in order to form keyword clouds for the document set.

+2  A: 

I do not know about python libraries, but for batch removal of passwords from PDF documents, my coleagues have had good experience with PwdRemover (it is not free though).

ldigas
This is perfect, thank you. The command-line utility will work best for me.
Mike Cialowicz
+1  A: 

Here are two other (open-source) tools for command-line processing:

QPDF: A Content-Preserving PDF Transformation System:

qpdf --password=PASSWORD --decrypt SECURED.pdf UNSECURED.pdf

pdftk - the pdf toolkit:

pdftk SECURED.pdf input_pw PASSWORD output UNSECURED.pdf
rcs