tags:

views:

2587

answers:

10

Our company has thousands of PDF documents. How do we create a simple search engine using Lucene, Solr or Nutch? We'll provide a basic Java/JSP web page were people can type in words and perform basic and/or queries then show them the document links of all matching PDF's.

A: 

If you've a Linux server, you could use Beagle to index them, and then just use the search functionality that comes with it. It has an (experimental) web search interface, and it can be hooked into the FireFox search box as well.

It automatically indexes files as they're included, and I'd suspect that you'll find it much more efficient to enhance or fix beagle than to write your own search interface to Lucene.

Jamie Love
A: 

Answering such a broad question in this forum will be tough. I'd recommend you check out the book Lucene in Action, which covers the basics of indexing and searching in a quite readable fashion.

Given your application, it sounds like Nutch and Solr probably will not be necessary. Since all of your documents are available locally, Nutch probably won't be helpful. Solr may help you manage a cluster of searchers if you have a high query load, but Lucene is highly performant, and handles large document sets in a very scalable manner.

The one area that might consume a lot of your effort is the use of PDF. It's possible to index PDF documents, and there are Lucene contributions to facilitate the extraction of raw text from PDFs, but depending on the document, the quality of results can vary. Often, the context of a keyword in a PDF document is unclear because of formatting instructions, and that can make it hard to do proximity searches or show the context of a hit.

erickson
+2  A: 

Take a look at eprints. It includes a workflow for adding new documents, automatically indexes and thumbnails PDF's and has fairly comprehensive full text search functionality. It can also be easily customised and branded.

Why re-invent the wheel. Again.

Guy
Again.... lmmfao.. mod +1 for being right and funny at the same time.
A: 

Having the (imho) distinct advantage of being on a Mac, I use SearchLight on a somewhat older G5. nice web interface to spotlight, the Mac OS' built-in indexing service.

Kris
+1  A: 

Google Search Appliance http://www.google.com/enterprise/gsa/

Craig Wohlfeil
Why the downvotes?
the_drow
I don't understand the down votes either. A GSA is just what you need. Not only will it index all of your PDF's, it will also index your entire intranet and it will provide much better search results than Lucene will.
GateKiller
+1 downvotes were rather unfair. Except for the implication that the OP may be looking for a "free" solution, GSA is a worthy consideration for this type of application...
mjv
The downvotes where kind of hard. But I think the commenter could give little bit more info then just an url.
Alfred
+7  A: 

I have had good luck with lucene, but it is not click, install and search, it does require a bit of work.
If you need something that yo can download and install and be searching within 10 minutes, look at the free Ominifind Yahoo Edition http://omnifind.ibm.yahoo.net/, it uses Lucene, but is packaged such that it is configured and ready to run upon install, a much easier way to try Lucene.

Tony BenBrahim
+3  A: 

None of the projects in the Lucene family can natively process PDFs, but there are utilities you can drop in and well written examples on how to roll your own.

Lucene will do pretty much whatever you need it to do, but there is overhead in terms of your time, as Tony said above. Thousands of documents really isn't that many, so you might be able to get away with a lighter weight alternative.

That said, I would still recommend looking at Solr - it's much, much easier to set up than Lucene, has support for backups, replication, etc., as well as a nifty JSON interface which would fit your use case very well: http://wiki.apache.org/solr/SolJSON

Alabaster Codify
Solr 1.4 will parse PDFs and MS Word documents.
+2  A: 

I think you want a system to manage your PDF file. Please try to use dspace system. Dspace is a digital library, it supports Lucene based on. www.dspace.org.

Sorry, I have a mistake, http://www.dspace.org/.
+6  A: 

Nutch + Lucene + Pdf plugin enabled in Nutch is your solution. Nutch allows you to parse pdfs by enabling the pdf plugin.

Lucene will allow you to index the crawled and parsed data and Nutch has servelet which gives you a search interface.

We use the same for our internal lans.

Sumit Ghosh
+1  A: 

A great free search technology you might look at is the IBM Yahoo! free search. I'm not sure whether they followed through on plans to use Lucene under the covers, but it remains one of the really great, east to use free search technologies. It handles up to 500K documents, I believe, and it supports PDF and other non-text formats as well. Graphic user interface; easy to customize search results, and basic search analytics. Basic thesaurus, and powerful API so you can do pretty much whatever you want if the out of the box results are not to your liking. We've suggested this to a number of clients where there were fewer than half a million documents, and they love it.

Miles Kehoe