views:

3353

answers:

5

I am trying to use Lucene Java 2.3.2 to implement search on a catalog of products. Apart from the regular fields for a product, there is field called 'Category'. A product can fall in multiple categories. Currently, I use FilteredQuery to search for the same search term with every Category to get the number of results per category.

This results in 20-30 internal search calls per query to display the results. This is slowing down the search considerably. Is there a faster way of achieving the same result using Lucene?

A: 

You may want to consider looking through all the documents that match categories using a TermDocs iterator.

This example code goes through each "Category" term, and then counts the number of documents that match that term.

public static void countDocumentsInCategories(IndexReader reader) throws IOException {
    TermEnum terms = null;
    TermDocs td = null;


    try {
        terms = reader.terms(new Term("Category", ""));
        td = reader.termDocs();
        do {
            Term currentTerm = terms.term();

            if (!currentTerm.field().equals("Category")) {
                break;
            }

            int numDocs = 0;
            td.seek(terms);
            while (td.next()) {
                numDocs++;
            }

            System.out.println(currentTerm.field() + " : " + currentTerm.text() + " --> " + numDocs);
        } while (terms.next());
    } finally {
        if (td != null) td.close();
        if (terms != null) terms.close();
    }
}

This code should run reasonably fast even for large indexes.

Here is some code that tests that method:

public static void main(String[] args) throws Exception {
    RAMDirectory store = new RAMDirectory();

    IndexWriter w = new IndexWriter(store, new StandardAnalyzer());
    addDocument(w, 1, "Apple", "fruit", "computer");
    addDocument(w, 2, "Orange", "fruit", "colour");
    addDocument(w, 3, "Dell", "computer");
    addDocument(w, 4, "Cumquat", "fruit");
    w.close();

    IndexReader r = IndexReader.open(store);
    countDocumentsInCategories(r);
    r.close();
}

private static void addDocument(IndexWriter w, int id, String name, String... categories) throws IOException {
    Document d = new Document();
    d.add(new Field("ID", String.valueOf(id), Field.Store.YES, Field.Index.UN_TOKENIZED));
    d.add(new Field("Name", name, Field.Store.NO, Field.Index.UN_TOKENIZED));

    for (String category : categories) {
        d.add(new Field("Category", category, Field.Store.NO, Field.Index.UN_TOKENIZED));
    }

    w.addDocument(d);
}
Matt Quail
This just counts the documents tagged by each term in the Category field, which you could do much faster with terms.docFreq(). What's missing is the intersection with the hits from the user's search criteria.
erickson
+3  A: 

I don't have enough reputation to comment (!) but in Matt Quail's answer I'm pretty sure you could replace this:

int numDocs = 0;
td.seek(terms);
while (td.next()) {
    numDocs++;
}

with this:

int numDocs = terms.docFreq()

and then get rid of the td variable altogether. This should make it even faster.

Rowan
you'll be there in no time (commenting)
mattlant
A: 

So let me see if I understand the question correctly: Given a query from the user, you want to show how many matches there are for the query in each category. Correct?

Think of it like this: your query is actually originalQuery AND (category1 OR category2 or ...) except as well an overall score you want to get a number for each of the categories. Unfortunately the interface for collecting hits in Lucene is very narrow, only giving you an overall score for a query. But you could implement a custom Scorer/Collector.

Have a look at the source for org.apache.lucene.search.DisjunctionSumScorer. You could copy some of that to write a custom scorer that iterates through category matches while your main search is going on. And you could keep a Map<String,Long> to keep track of matches in each category.

Rowan
+5  A: 

Here's what I did, though it's a bit heavy on memory:

What you need is to create in advance a bunch of BitSets, one for each category, containing the doc id of all the documents in a category. Now, on search time you use a HitCollector and check the doc ids against the BitSets.

Here's the code to create the bit sets:

public BitSet[] getBitSets(IndexSearcher indexSearcher, 
                           Category[] categories) {
    BitSet[] bitSets = new BitSet[categories.length];
    for(int i=0; i<categories.length; i++)
    {
        Query query = categories[i].getQuery();
        final BitSet bitset = new BitSet()
        indexSearcher.search(query, new HitCollector() {
            public void collect(int doc, float score) {
                bitSet.set(doc);
            }
        });
        bitSets[i] = bitSet;
    }
    return bitSets;
}

This is just one way to do this. You could probably use TermDocs instead of running a full search if your categories are simple enough, but this should only run once when you load the index anyway.

Now, when it's time to count categories of search results you do this:

public int[] getCategroryCount(IndexSearcher indexSearcher, 
                               Query query, 
                               final BitSet[] bitSets) {
    final int[] count = new int[bitSets.length];
    indexSearcher.search(query, new HitCollector() {
        public void collect(int doc, float score) {
            for(int i=0; i<bitSets.length; i++) {
                if(bitSets[i].get(doc)) count[i]++;
            }
        }
    });
    return count;
}

What you end up with is an array containing the count of every category within the search results. If you also need the search results, you should add a TopDocCollector to your hit collector (yo dawg...). Or, you could just run the search again. 2 searches are better than 30.

itsadok
Other implementation for the getCategoryCount part: You could actually get a BitSet from your search (using a collector) and then intersect that resultsBitSet with whatever categoryBitSet you're interested in. Intersection should be faster than checking each doc, and you can also intersect multiple categories before intersecting with the results BitSet.
deathy
+2  A: 

Sachin, I believe you want faceted search. It does not come out of the box with Lucene. I suggest you try using SOLR, that has faceting as a major and convenient feature.

Yuval F