views:

559

answers:

6

so I have a huge collection of PDF files that I need to extract text from. The files are encrypted, but I know the password for them. I'm looking for a way to automate the process of extracting the text.

I can manually open the file in Acrobat professional, remove security by typing in the password, and then save as .txt file. But there's no way to automate that with batch processing for the 600 files.

I'm looking for a some tool to help with this. I'm good with Perl, so I tried the various PDF handling modules from CPAN, but they're failing to read the encrypted documents. Anyone has any solution for this?

A: 

If you can't find any decent pure programmatic way to do it, an alternative is AutoIt.

It is "a freeware BASIC-like scripting language designed for automating the Windows GUI", which can do all that pointing and clicking for you while you go have a cup of coffee.

Deestan
+3  A: 

pdftotext should be able to do that. It comes with the poppler library, and can also be found with xpdf (poppler came from xpdf).

CesarB
A: 

I concur with Desstan, AutoIt or AutoHotkey can be used to automate any task with GUI which cannot be automated by other means. Although it can be slow and might stop on unexpected situation (plus there is a learning curve, but at least the AutoHotkey forum is very helpful, although one need to have Acrobat Professional to write a script for it...).

And indeed, Xpdf seems to be an interesting tool, including a text extractor and supporting decryption.

PhiLho
+4  A: 

Take a look at pdftk. It's console-based and handles password-secured PDF files.

Andreas Scherer
A: 

CAM::PDF is an open source Perl library that can encrypt and decrypt PDFs. Currently it can only do 40-bit encryption where the owner and user passwords are the same, but just today (coincidentally) a user submitted a patch to allow 128-bit encryption and decryption. I hope to release a new version next week with that enhancement.

CAM::PDF is not very good at extracting text, though.

Chris Dolan
A: 

try pdftk:

pdftk secured.pdf input_pw foopass output unsecured.pdf

rpilkey