views:

387

answers:

2

I'm using Lucene.Net to create a website to search books, articles, etc, stored as PDFs. I need to be able to filter my search results based on author name, for example. Can this be done with just Lucene? Or do I need a DB to store the filter fields for each document?

Also, what's the best way to index my documents? I'll have about 50 documents to start with and periodically I'll have to add a bunch of documents to the index--may be through a web form. Should I use a DB to store the document paths?

Thanks.

+2  A: 

Lucene has a couple of different Analyzers that can scrub out the noise and do "stemming" which is helpful when you want to do fulltext searching, but you're still going to need to store the PDF itself somewhere. Lucene.Net is happy to build an index on the file system, and you could add a field to the Document it builds called something like "PATH" with the path to the document.

Andrew Theken
+1  A: 

Here is a list of what you need to do IMO:

  1. Extract raw text from PDF - please see this question which recommends iTextSharp for this purpose.
  2. For each PDF document, create a Lucene.net document that has several fields: author, title, document text and whatever you want to search. It is recommended to also have a unique id field per document. I suggest you also store a field with the path to the original PDF document.
  3. After indexing all the documents, you will have a Lucene index you can search by fields.
  4. You can add new documents by repeating step 2. It is easier to do this offline - incremental updates are tough.
Yuval F
Excellent answer, thanks for simplifying it. So, there's no need for a DB at all? If I'm going to do step 2 offline, and say I let my users add documents, would it help to send all requests to a DB and then I can have a separate process that indexes the ones that haven't already been indexed, and use the primary key id as the unique id in the index? Do you think it makes sense to have a DB? In case in the future I decide to have some "related information" or something like that for each document, a DB would help right?
Prabhu
You will need a DB if you need DB functionality, such as joins or sophisticated selects. This paper: http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Search-Engine-versus-DBMS addresses the issue of what to put in a database vs. what to put in a search engine. A DB may be the right place for additional information you only need to display, not search.
Yuval F