Project Thoughts: Searching Directory of PDFs

views:

answers:

+1 Q:

Project Thoughts: Searching Directory of PDFs

To preface this, I know there are discussions on this in various places. Half of what I read is outdated, buggy or simply unrelated to my situation.

This is why I am bringing it to the community that I know will have the answers.

Question: I have a directory (online is ideal) of around 70,000 pages in PDF documents (documents range from 20 - 100s of pages, add up to around 70,000 pages).

I am looking for a method, script or idea for the easiest way to search these PDFs for products. The PDFs all have a text layer that was created by OCR in Acrobat.

Any ideas, whether they be elaborate or inventive, are more than welcome.

+2 A:

Use a search engine like Lucene or Sphinx to index and tag the PDFs. The Zend Framework has both, a component to read and write PDF files and a Lucene implementation.

Gordon 2010-08-05 15:03:09

Sphinx can't index a PDF file directly. You'd still need something to export the text. It could be combined with the pdftotext solution I offered below. I don't know about Lucene though as I haven't used it.

Cfreak 2010-08-05 15:05:49

Is there any chance this could be done outside of Zend? CodeIgniter, for instance? Otherwise I could probably take a gander down Zend lane and see how that goes.

gamerzfuse 2010-08-05 15:06:59

@gamerzfuse Zend_PDF and Zend_Search_Lucene can be used in isolation. You do not have to use ZF's MVC or any other components. You can use CodeIgniter and the ZF components together.

Gordon 2010-08-05 15:09:11

+2 A:

XPDF has a utility called pdftotext which often is installed on linux distributions. I would create a tool that uses that to create an index of words to the document they appear in. You could store the index in a database and then write a search against that.

It would take a little more space but it would be simple to include a sentence of context as well to show in the search results.

Cfreak 2010-08-05 15:04:08

This was one of my original thoughts, but then I figured that it would be spacious and wouldn't be ideal for products as much as it would be for informative PDFs that are mostly Text.

gamerzfuse 2010-08-05 15:06:12

+2 A:

My recommendation would be Apache Solr (a search server built using Lucene) and is dead simple to use using it RESTful interface. It also has a subproject called Tika which extracts metadata and structured text content from multiple formats (incl. PDF).

Mikos 2010-08-05 15:07:25

This is definitely something I will look into. It may end up being a very viable option. Thanks!

gamerzfuse 2010-08-05 15:16:58

ansaurus

tags:

views:

answers:

Project Thoughts: Searching Directory of PDFs

related questions