tags:

views:

42

answers:

3

To preface this, I know there are discussions on this in various places. Half of what I read is outdated, buggy or simply unrelated to my situation.

This is why I am bringing it to the community that I know will have the answers.

Question: I have a directory (online is ideal) of around 70,000 pages in PDF documents (documents range from 20 - 100s of pages, add up to around 70,000 pages).

I am looking for a method, script or idea for the easiest way to search these PDFs for products. The PDFs all have a text layer that was created by OCR in Acrobat.

Any ideas, whether they be elaborate or inventive, are more than welcome.

+2  A: 

Use a search engine like Lucene or Sphinx to index and tag the PDFs. The Zend Framework has both, a component to read and write PDF files and a Lucene implementation.

Gordon
Sphinx can't index a PDF file directly. You'd still need something to export the text. It could be combined with the pdftotext solution I offered below. I don't know about Lucene though as I haven't used it.
Cfreak
Is there any chance this could be done outside of Zend? CodeIgniter, for instance? Otherwise I could probably take a gander down Zend lane and see how that goes.
gamerzfuse
@gamerzfuse Zend_PDF and Zend_Search_Lucene can be used in isolation. You do not have to use ZF's MVC or any other components. You can use CodeIgniter and the ZF components together.
Gordon
+2  A: 

XPDF has a utility called pdftotext which often is installed on linux distributions. I would create a tool that uses that to create an index of words to the document they appear in. You could store the index in a database and then write a search against that.

It would take a little more space but it would be simple to include a sentence of context as well to show in the search results.

Cfreak
This was one of my original thoughts, but then I figured that it would be spacious and wouldn't be ideal for products as much as it would be for informative PDFs that are mostly Text.
gamerzfuse
+2  A: 

My recommendation would be Apache Solr (a search server built using Lucene) and is dead simple to use using it RESTful interface. It also has a subproject called Tika which extracts metadata and structured text content from multiple formats (incl. PDF).

Mikos
This is definitely something I will look into. It may end up being a very viable option. Thanks!
gamerzfuse