views:

684

answers:

4

I've used Lucene.net to implement search functionality (for both database content and uploaded documents) on several small websites with no problem. Now I've got a site where I'm indexing 5000+ documents (mainly PDFs) and the querying is becoming a bit slow.

I'm assuming the best way to speed it up would be to implement caching of some kind. Can anyone give my any pointers / examples on where to start? If you've got any other suggestions aside from caching (e.g should I be using multiple indexes?) I'd like to hear those too.

Edit:

Dumb user error responsible for the slow querying. I was creating highlights for the entire results set at once, instead of just the 'page' I was displaying. Oops.

A: 

Lucene uses its own internal "caching" mechanism to make index retrieval a fast operation. I don't think caching is your issue here, though.

A 5000-index document sounds trivial in size, but this largely depends on how you're constructing your index, what you're indexing/storing, how you're querying (operationally), document size, etc.

Please fill in the blanks with as much information as you can about your index.

jro
+1  A: 

First, Lucene itself supports an in-memory version of directories:

Lucene.Net.Store.RAMDirectory

You can use it like:

RAMDirectory idx = new RAMDirectory();

// Make an writer to create the index
IndexWriter writer =
    new IndexWriter(idx, new StandardAnalyzer(), true);

If this works for you but it is using too much ram, write a wrapper and expose it as an Interface or webservice. Or, if you simply want to cache what you are querying to control when entities drop out of the cache, you can write a wrapper around Lucene that caches the most common results for you based on the keywords obviously.

I prefer the forementioned. Create a webservice or service project that wraps around the Lucene store, using RAMDirectory. That way you can offload the webservice onto another server with lots of ram if the index is huge - and have near-instant results.

eduncan911
+2  A: 

I'm going to make a big assumption here and assume you're not hanging onto your index searchers in-between calls to query the index.

If that's true, then you should definitely share index searchers for all queries to your index. As the index becomes larger (and it doesn't really have to get very large for this to become a factor), rebuilding the index searcher will become more and more of an overhead. To make this work correctly, you'll need to synchronise access to the query parser class (it isn't thread safe).

BTW, the Java docs are (I've found) just as applicable to the .net version.

For more info on your problem, see here: http://wiki.apache.org/lucene-java/ImproveSearchingSpeed

Moleski
Nice link, thanks
Nick
No probs. Is everything working OK on your website now?
Moleski
A: 

Hi jro, in your post you mention a kind of "internal caching mechanism" for Lucene. What do you refer to? Do you know if (and how) it can be disabled?

I'm experiencing the following problem: I have two instances of Lucene which share a common index. Even if the index is correctly updated by the two applications, the changes made by an instance are not visible for the other one. It seem that Lucene reads the index file once and the does not check for changes anymore.

I'm using the java version of Lucene, but I hope they use similar basic structures.

Thank you.

Hi Michelle - If you add a comment to jro's answer above he/she will be notified, so you'll be more likely to get an answer.
Nick