views:

464

answers:

2

I use Lucene.net to index content and documents etc.. on our CMS. This has worked well so far, but now I've got to take account of the following additions to web pages:

  1. Publish date
  2. Expiry date
  3. Page 'is active'
  4. User authorisation

So the search results should only show pages that are within the Publish / Expiry window, are 'active' and that the current user is authorised to view.

Should I include the above information in the Lucene index? It will make the queries a little more complicated, but the hits collection will only return 'valid' documents which will make paging the results a lot easier.

On the other hand, I'll be repeating information that is already in the CMS database so I'll be risking the integrity of my data, and I'll have update the index whenever anything in the above list is changed as well as the actual content itself.

Anyone else had this problem? How did you solve it? Thanks.

Edit: I may need to use a 'FieldCache' (mentioned here) to pass the 'valid' doc ids into the lucene search?

A: 

..so the search results should only show pages that are within the Publish / Expiry window, are 'active' and that the current user is authorised to view.

There are a few ways to handle the authorization issue. You could maintain multiple indexes (one per permission level), filter the results with the query (by storing permission required) or filter the results before you display them. If there are only a few levels, I think that I would maintain separate indexes - it seems safest.

As for 'is active' - can you just rebuild your index with that in mind? Just rebuild your index in the background every so often and only add active content. You may have too much info to make that feasible - but Lucene is VERY fast.

Shane C. Mason
My preference would be to use a win service to periodically rebuild the index (lucene docs < 10,000), but app requirements dictate that changes made to content / pages etc.. are reflected ASAP in the search results
Nick
Yeah. Stupid requirements anyway :) Unless you get a chance to update your index periodically - it looks like you are stuck filtering your results after you get them out of the index.
Shane C. Mason
+1  A: 

Query the CMS database first, and build a BitSet with all the matching documents (you'll need a FieldCache to translate between your app's doc ID's with Lucene's internal doc ID's). Then you can run your Lucene query on your index using a Filter (wrapping the BitSet).

You keep all mutable data in your database (where it belongs), and you don't have to worry about updating or rebuilding your index. This will run very fast as well.

P.S. I've only used the Java version of Lucene, but this should work fine in Lucene.NET

bajafresh4life