views:

3080

answers:

7

I'm developing a Desktop Search Engine in Visual Basic 9 (VS2008) using Lucene.NET (v2.0).

I use the following code to initialize the IndexWriter

Private writer As IndexWriter

writer = New IndexWriter(indexDirectory, New StandardAnalyzer(), False)

writer.SetUseCompoundFile(True)

If I select the same document folder (containing files to be indexed) twice, two different entries for each file in that document folder are created in the index.

I want the IndexWriter to discard any files that are already present in the Index.

What should I do to ensure this?

+2  A: 

To update a lucene index you need to delete the old entry and write in the new entry. So you need to use an IndexReader to find the current item, use writer to delete it and then add your new item. The same will be true for multiple entries which I think is what you are trying to do.Just find all the entries, delete them all and then write in the new entries.

Steve
Could you point me to a demo that shows the use of IndexReader (to find the current item and) to update the Index?
+2  A: 

If you want to delete all content in the index and refill it, you could use this statement

writer = New IndexWriter(indexDirectory, New StandardAnalyzer(), True)

The last parameter of the IndexWriter constructor determines whether a new index is created, or whether an existing index is opened for the addition of new documents.

splattne
+7  A: 

As Steve mentioned, you need to use an instance of IndexReader and call its DeleteDocuments method. DeleteDocuments accepts either an instance of a Term object or Lucene's internal id of the document (it is generally not recommended to use the internal id as it can and will change as Lucene merges segments).

The best way is to use a unique identifier that you've stored in the index specific to your application. For example, in an index of patients in a doctor's office, if you had a field called "patient_id" you could create a term and pass that as an argument to DeleteDocuments. See the following example (sorry, C#):

int patientID = 12;
IndexReader indexReader = IndexReader.Open( indexDirectory );
indexReader.DeleteDocuments( new Term( "patient_id", patientID ) );

Then you could add the patient record again with an instance of IndexWriter. I learned a lot from this article http://www.codeproject.com/KB/library/IntroducingLucene.aspx.

Hope this helps.

Ryan Ische
+2  A: 

Unless you're only modifying a small number of documents (say, less than 10% of the total) it's almost certainly faster (your mileage may vary depending on stored/indexed fields, etc) to reindex from scratch.

That said, I would always index to a temp directory, and then move the new one into place when it's done. That way, there's little downtime while the index is building, and if something goes wrong you still have a good index.

Bob King
+5  A: 

There are many out-of-date examples out there on deleting with an id field. The code below will work with Lucene.NET 2.4.

It's not necessary to open an IndexReader if you're already using an IndexWriter or to access IndexSearcher.Reader. You can use IndexWriter.DeleteDocuments(Term), but the tricky part is making sure you've stored your id field correctly in the first place. Be sure and use Field.Index.NOT_ANALYZED as the index setting on your id field when storing the document. This indexes the field without tokenizing it, which is very important, and none of the other Field.Index values will work when used this way:

IndexWriter writer = new IndexWriter("\MyIndexFolder", new StandardAnalyzer());
var doc = new Document();
var idField = new Field("id", "MyItemId", Field.Store.YES, Field.Index.NOT_ANALYZED);
doc.Add(idField);
writer.AddDocument(doc);
writer.Commit();

Now you can easily delete or update the document using the same writer:

Term idTerm = new Term("id", "MyItemId");
writer.DeleteDocuments(idTerm);
writer.Commit();
Ashley Tate
Even that signature for IndexWriter is obsolete now (and will be removed in Lucene 3.0). Suggested ctor would be new IndexWriter(directory, analyzer, maxFieldLength) and for analyzer, again the signature is obsolete. Suggested one is new StandardAnalyzer(Version).
mattRo55
A: 

One option is of course to remove a document and then to add the updated version of the document.

Alternatively you can also use the UpdateDocument() method of the IndexWriter class:

writer.UpdateDocument(new Term("patient_id", document.Get("patient_id")), document);

This of course requires you to have a mechanism by which you can locate the document you want to update ("patient_id" in this example).

I have blogged more details with a more complete source code example.

John
A: 

hi my question is

After optimizing the writer can we update Document ?..

Deepak