How to open PDF and read it?

views:

231

answers:

+1 Q:

How to open PDF and read it?

Hello, how can I open a PDF file and read some of it's contents with Python (this language is preferred, however Ruby, Perl or PHP are fine too) (in case it is recognized (not just an image)) or report that it's impossible without OCR? TIA

Update: thanks for the solutions, I'm sure some of them will suit me fine.

@RichH, I have a pdf file, and don't know whether it is image- or text-based. I'm looking for a tool to help me find that out and in case it's text-based extract some of it's contents.

+1 A:

You might find this thread useful.

jkndrkn 2009-11-08 20:04:49

Google is your friend.

http://pybrary.net/pyPdf/

John Smith 2009-11-08 20:04:57

That looks to let you merge and handle page level page operations, but not extract content other than the document info. Am I reading it wrong?

RichH 2009-11-08 20:11:13

@RichH, it's component `PageObject`, has the `extractText` method, so `pdf.getPage(0).extractText()` prints the text on the first page (tried 2 minutes ago) :)

roddik 2009-11-08 20:41:52

Parsing PDF and making something useful out of it is hard as the format is focused on keeping the layout so text can be stored in a way that each letter is positioned individually, depending on the font the text might also be stored as graphic.

libraries to read PDFs I know include the Zend Framework which has a PDF component which includes a PDF parser which can be used from PHP and gives more or less usaable results and the commercial PDFlib which offers quite usable results and offers binding to different languages.

johannes 2009-11-08 20:18:31

+5 A:

For Perl, check out these modules:

Ether 2009-11-08 20:49:18

ansaurus

tags:

views:

answers:

How to open PDF and read it?

related questions