views:

277

answers:

3

Hi,

I have a Terabyte of data, maybe more, which I'd like to index and search with Lucene. I'd like to be able to split the index out to different machines, similar to what Solr does (if I understand Solr correctly).

Are there any existing tools to do this on the Windows platform?

Thanks!

Edit: I'm not very keen on running Java Lucene. I will most likely be making my own tweaks to Lucene so I have to stick to Lucene.Net since I don't know much about Java

A: 

As far as I know there's no porting of MultiPassIndexSplitter (http://lucene.apache.org/java/3_0_0/api/contrib-misc/org/apache/lucene/index/MultiPassIndexSplitter.html) class to Lucene.net, so probably this feature is not yet implemented.

mamoo
A: 

What you're looking for is Katta. Here's a graph of how it works: Katta

But since you already know Solr, why not just use its sharding capabilities directly?

Mauricio Scheffer
I will most likely be making my own tweaks to Lucene so I have to stick to Lucene.Net since I don't know much about Java.
@user72185 ok, then why not just use Solr?
Mauricio Scheffer
I haven't actually tried Solr, but wouldn't that mean I would have to change Java code if I wanted to make changes to the underlying Lucene?
what kind of things do you intend to change on Lucene?
Mauricio Scheffer
1) Search without scoring2) Faster fuzzy search3) Adding some parallelism with Task parallel library4) Custom analyzerI'm sure more will come up.
1) You can use boosting, sorting or function queries among others to customize the order of your search results. 2) Solr is quite fast as it is, it powers some of the biggest sites on the net. 3) No need to do that on Solr but you can do that client-side if you want/need. 4) Solr has pluggable analyzers (written in Java though)
Mauricio Scheffer
1) Completely disabling ordering seem to be an order of magnitude faster than the default, when the result set is large (e.g. > 1000000). Unfortunately it's not enough to change the collector (I tried making a NullCollector), it is the Scorer that spends a lot of time traversing every hit.
2) I don't think they have fuzzy search, at least not using the default implementation.
2) Solr supports the standard ~ Lucene operator for fuzzy searches
Mauricio Scheffer
2) What I meant is that if these large sites have huge amounts of data they probably changed the fuzzy implementation. See for instance here: http://www.nearinfinity.com/blogs/aaron_mccurry/what_happens_to_lucene_when.html (strange, link is down right now but usually works).
A: 

Solr is a Java app so it runs on Windows. You can find details on how to configure it as a Windows service here: http://blog.ianbattersby.com/archive/2010/02/09/apache-solr-as-a-windows-service

Pascal Dimassimo