views:

266

answers:

2

Consider following assumptions:

  1. I have Java 5.0 Web Application for which I'm considering to use Lucene 3.0 for full-text searching
  2. There will be more than 1000K Lucene documents, each with 100 words (average)
  3. New documents must be searchable just after they are created (real time search)
  4. Lucene documents have frequently updating integer field named quality

Where to find code examples (simple but as complete as possible) of near real time search of Lucene 3.0?

Is it possible to obtain query results sorted by one of document fields (quality) which may be updated frequently (for already indexed document)? Such updating of document field will have to trigger Lucene index rebuilding? What is performance of such rebuilding? How to done it efficiently - I need some examples / documentation of complete solution.

If, however, index rebuilding is not necessarily needed in this case - how to sort search results efficiently? There may be queries returning lots of documents (>50K), so I consider it unefficient to obtain them unsorted from Lucene and then sort them by quality field and finally divide sorted list to pages for pagination.

Is Lucene 3.0 my best choice within Java or should I consider some other frameworks/solutions? Maybe full text search provided by SQL Server itself (I'm using PostgreSQL 8.3)?

+3  A: 

The Lucene API is capable of everything you're asking, but it won't be easy. It's a fairly low-level API, and making it do complicated things is quite an exercise in itself.

I can highly recommend Compass, which is a search/indexing framework built on top of Lucene. As well as a much friendlier API, it provides functionality such as object/XML/JSON mapping to Lucene indexes, as well as fully transactional behaviour. It should have no trouble with your requirements, such as realtime sorting of transactionally-updated documents.

Compass 2.2.0 is built upon Lucene 2.4.1, but a Lucene 3.0-based version is in the works. It's sufficiently abstracted from the Lucene API that the transition should be seamless, though.

skaffman
Compass seems interesting, so I will give it a try.
WildWezyr
Where do I find simplest possible example of adding some objects (resources/documents etc.) to Compass and then searching then with specified sort order? I tried it myself based on documentation (it was not very helpful) and one of examples from Compass distribution, but I failed. I don't know how to start and where to learn from...
WildWezyr
The Compass forum is pretty good, I've received good help there in the past.
skaffman
+1  A: 

Near Real Time Search is available in Lucene since 2.9. Lucid Imagination has an article about this capability (before 2.9 release). The basic idea is you can now get an IndexReader from IndexWriter. If you refresh this IndexReader at regular interval, you get most up to the date changes from the IndexWriter.

Update: I haven't seen any code, but here is the broad idea.

All the nw document will be written to an IndexWriter, preferably created with RAMDirectory, which will will not be closed frequently. (To persist this in-memory index, you may have to flush it to disk ocassionally.)

You will have some indexes on the disk on which individual IndexReaders will be created. A MultiReader and a Searcher can be created on top of these Readers. One of the Reader will be from the in-memory index.

At regular interval (say a few seconds), you will remove current Reader from the MultiReader, get the new Reader from IndexWriter and construct the MultiReader/Searcher with new set of Readers.

According to the article from Lucid Imagination (linked above), they have tried writing 50 documents per second, without heavy slowdown.

Shashikant Kore
Where do I find code examples for that? How and when exacly I must refesh IndexReader? How long will it take (performance)? Can I perform searches while IndexReader is refreshing?
WildWezyr
Thanks for your update. It gives me overview of complexity of using near real-time searching in Lucene itself. It is as skaffman said: "The Lucene API is capable of everything you're asking, but it won't be easy. It's a fairly low-level API, and making it do complicated things is quite an exercise in itself". Right now I'm looking into Compass as it promises to do the dirty job for me ;-).
WildWezyr
The real-time capabilites were added in Lucene 2.9. If Compass has previous versions of Lucene, you probably won't see the real time goodies.
Shashikant Kore