views:

901

answers:

4

I am working on a store search API using Lucene.

I need to show store search results for each City,State combination with its frequency in brackets....for example:

Los Angles,CA (450)
Atlanta,GA (212)
Boston, MA (78)
.
.
.

As of now, my search results return around 7000 Lucene documents, on average, if the user says "Show me all the stores". In this use case, I end up showing around 800 unique City,State records as shown above.

I am overriding the HitCollector class's Collect method and retrieving vectors as follows:

var vectors = _reader.GetTermFreqVectors(doc);

Then I iterate through this collection and calculate the frequency for each unique City,State combination.

But this is turning out to be very very slow in performance...is there any better way of grouping search results and calculating frequency in Lucene? A code snippet would be very helpful

Also, please suggest if I can optimize my Lucene search code using any other techniques/tips....

Thanks for reading!

+2  A: 

I don't believe you can do this OOTB in Lucene currently - searching for this functionality yields this open issue:

Jira Lucene Feature Request

The functionality is present OOTB with Solr however - which provides a faceting feature. A query such as the following:

http://localhost:8983/solr/select?q=ipod&rows=0&facet=true&facet.limit=-1&facet.field=cat&facet.field=inStock

would return the following result:

<response>
<responseHeader><status>0</status><QTime>2</QTime></responseHeader>
<result numFound="4" start="0"/>
<lst name="facet_counts">
 <lst name="facet_queries"/>
 <lst name="facet_fields">
  <lst name="cat">
        <int name="search">0</int>
        <int name="memory">0</int>
        <int name="graphics">0</int>
        <int name="card">0</int>
        <int name="music">1</int>
        <int name="software">0</int>
        <int name="electronics">3</int>
        <int name="copier">0</int>
        <int name="multifunction">0</int>
        <int name="camera">0</int>
        <int name="connector">2</int>
        <int name="hard">0</int>
        <int name="scanner">0</int>
        <int name="monitor">0</int>
        <int name="drive">0</int>
        <int name="printer">0</int>
  </lst>
  <lst name="inStock">
        <int name="false">3</int>
        <int name="true">1</int>
  </lst>
 </lst>
</lst>
</response>

More information on faceting can be found on the Solr website:

http://wiki.apache.org/solr/SimpleFacetParameters

EDIT: If you definitely don't want to go down the SOLR aproach to faceting you may be able to leverage the functionality in this patch described for Lucene:

http://sujitpal.blogspot.com/2007/01/faceted-searching-with-lucene.html

which provides an implementation of the faceting feature on top of Lucene 2.0 via a patch.

Jon
Can you please answer this one?http://stackoverflow.com/questions/899542/problem-using-same-instance-of-indexsearcher-for-multiple-requests
Steve Chapman
A: 

I'm not sure that I understood what you mean by "grouping", but if you just want to count the number of docs for each category, you should take a look at this question.

My answer there still stands, tough nobody seemed to like it enough to upvote me...

itsadok
A: 

thanks all for ur inputs...itsadok,here is what i mean by "grouping"..lets say my lucene index has 4 documents which have a field called "category" having values like "Category1", "Category2", etc.

document1---Category1 document2---Category1,Category2 document3---Category2 document4---Category1

when I search for all documents,I should see all 4 documents in search results with grouping like: Category1 (3) Category2 (2) where number above in brackets is the frequency of "CategoryX" in documents... hope this helps.

Steve Chapman
A: 

Steve, I believe you want faceted search. It does not come out of the box with Lucene. I suggest you try using SOLR, that has faceting as a major and convenient feature.

Yuval F
Can you please answer this one?http://stackoverflow.com/questions/899542/problem-using-same-instance-of-indexsearcher-for-multiple-requests
Steve Chapman