views:

316

answers:

2

Can anyone help with extracting text from a page in a pdf?

<?php
$pdf = Zend_Pdf::load('example.pdf');
$page = $pdf->page[0];

I would assume a page method would exist but I could not find anything to let me extract the contents.

Example: $page->getContents(); $page->toString(); $page->extractText();

...Help!!!! This is driving me crazy!

A: 

From the manual it doesn't appear that this functionality is supported. Also, new text is written using the drawText() function, which appears to write images, not plain "decodable" text.

Andy
It does write 'text' rather than images but you're certainly correct, at the moment parts of a PDF can't be extracted or modified.
David Caunt
+1  A: 

I agree with Andy that this does not appear to be supported. As an alternative, take a look at Shaun Farrell's solution to extracting text from a PDF for use with Zend_Search_Lucene. He uses XPDF, which might also meet your needs.

Cal Jacobson
xpdf will extract the text from PDFs, as long as your PDFs actually contain text of course (as opposed to scanned images). On the other hand, you might try the following as well : http://www.webcheatsheet.com/php/reading_clean_text_from_pdf.php.
wimvds