tags:

views:

1399

answers:

2

Currently I do like this:

IndexSearcher searcher = new IndexSearcher(lucenePath);
Hits hits = searcher.Search(query);
Document doc;
List<string> companyNames = new List<string>();

for (int i = 0; i < hits.Length(); i++)
{
    doc = hits.Doc(i);
    companyNames.Add(doc.Get("companyName"));
}
searcher.Close();

companyNames = companyNames.Distinct<string>().Skip(offSet ?? 0).ToList();
return companyNames.Take(count??companyNames.Count()).ToList();

As you can see, I first collect ALL the fields (several thousands) and then distinct them, possibly skip some and take some out.

I feel like there should be a better way to do this.

A: 

I'm not sure there is, honestly, as Lucene doesn't provide 'distinct' functionality. I believe with SOLR you can use a facet search to achieve this, but if you want this in Lucene, you'd have to write some sort of facet functionality yourself. So as long as you don't run into any performance issues, you should be ok this way.

Razzie
Ok, thanks for letting me know.
borisCallens
+2  A: 

Tying this question to an earlier question of yours (re: "Too many clauses"), I think you should definitely be looking at term enumeration from the index reader. Cache the results (I used a sorted dictionary keyed on the field name, with a list of terms as the data, to a max of 100 terms per field) until the index reader becomes invalid and away you go.

Or perhaps I should say, that when faced with a similar problem to yours, that's what I did.

Hope this helps,

Moleski
Could you elaborate on what you mean with "Term Enumeration"? Do you mean enumerating all my documents and getting those fields so I can use C#'s StartsWith()?
borisCallens
+1 for seeing the question behind the question
borisCallens
Have a look at the Terms member function of the IndexReader class. BTW, I found out a good deal about this kind of thing by having a look at the Luke source code. Very interesting!
Moleski
I'm not a big fan of Luke actually. I don't know why, but it takes ages for each query to parse. Way slower then my own queries.
borisCallens