views:

95

answers:

2

I'm a .NET developer and I need to learn Lucene so we can run a very large scale search service that removes entries that the end user doesn't have access to. (ie a User can search for all documents with clearance level 3 or higher, but not clearance level 2 or 1)

Where do I start learning, which products should I consider? To be honest, I'm a little overwhelmed, but I'm determined to figure it all out... eventually.

A: 

If you want a book that covers all the basics of Lucene, consider "Lucene in Action". Even though the code samples are Java, you can easily port them to .NET. Of course, there also are tonnes of resources on the web, such as SO and the Lucene mailing lists which should help you along.

For project you describe, you should look at Solr since it abstracts out lots of the issues of scalability etc. and via Solrnet can easily integrate into your .NET app. To restrict access by a level, your index documents should contain a field called "Level" (say) and in the background of your user query, you append the "Level:Level-1" query, using a boolean query construct.

At this stage, my recommendation would be to stay away from Hadoop (Apache Map-reduce implementation) for your project and stick with Solr. If you are however keen to learn about it. It too has a very useful book, you guessed it "Hadoop In Action" (also from Manning Publications).

Mikos
Thanks! Can you help me understand the difference between Hadoop and Solr? Do they serve the same requirement in different ways?
MakerOfThings7
They are Apples and Oranges. For most enterprise end applications Solr should suffice and scales well. Hadoop is a distributed computing platform used by organizations like Yahoo for their search indexes. Hadoop is also used for high-performance Machine learning tasks, Apache Mahout is one such project. Bottom-line: since you indicated you are newbie, my recommendation would be to stick to Solr. Unless I missing something, I think should more than suffice for your requirements.
Mikos
Since I have very large amounts of data that has to be indexed in realtime, perhaps I need Hadoop to process and index the data, and Solr to allow users to read the data? (Via REST?)
MakerOfThings7
Perhaps you are putting the cart before the horse. Could you define "large amounts of data"? Also it'd be wise to check if Solr scaling approaches (qv. http://bit.ly/90WhVo) work for you, before hypothetical problems and associated solutions. In most cases, the approaches in the link above should suffice a-plenty....
Mikos
+1  A: 

You seem to be confused about what exactly each project (Lucene/Solr/Hadoop/etc) does. So the first thing to do would be understanding the purpose of each project. Read the docs and blogs about them. If possible, buy and read books about them.

For example, MapReduce and Hadoop have nothing to do with your security requirements. Hadoop is a platform for distributed, scalable computing. But Solr is scalable on its own. You might want to use Hadoop to distribute a crawler though (e.g. Nutch).

Mauricio Scheffer