tags:

views:

231

answers:

4

Hello, how can I open a PDF file and read some of it's contents with Python (this language is preferred, however Ruby, Perl or PHP are fine too) (in case it is recognized (not just an image)) or report that it's impossible without OCR? TIA

Update: thanks for the solutions, I'm sure some of them will suit me fine.

@RichH, I have a pdf file, and don't know whether it is image- or text-based. I'm looking for a tool to help me find that out and in case it's text-based extract some of it's contents.

+1  A: 

You might find this thread useful.

jkndrkn
A: 

Google is your friend.

http://pybrary.net/pyPdf/

John Smith
That looks to let you merge and handle page level page operations, but not extract content other than the document info. Am I reading it wrong?
RichH
@RichH, it's component `PageObject`, has the `extractText` method, so `pdf.getPage(0).extractText()` prints the text on the first page (tried 2 minutes ago) :)
roddik
A: 

Parsing PDF and making something useful out of it is hard as the format is focused on keeping the layout so text can be stored in a way that each letter is positioned individually, depending on the font the text might also be stored as graphic.

libraries to read PDFs I know include the Zend Framework which has a PDF component which includes a PDF parser which can be used from PHP and gives more or less usaable results and the commercial PDFlib which offers quite usable results and offers binding to different languages.

johannes
+5  A: 

For Perl, check out these modules:

Ether