tags:

views:

59

answers:

5

Suppose I have a lucene query 'id1 OR id2 OR id3 ... idN'. How well does that scale as N increases?

The situation I'm looking at would be similar to someone doing a text search on products in their shopping cart, but they may have hundreds or thousands of items their shopping cart. The user wants to do a text search across all products in their shopping cart. Could I do a text query against all available products, then limit the items returned with a OR clause of product IDs in their cart?

A: 

There's a limit on the amount of boolean statements in your query.

Guillaume Lebourgeois
+3  A: 

The maximum number of clauses in a boolean query is 1024 by default. You can increase this limit. There would be performance penalty, though. I suppose, it would be efficient if you use Filters instead.

Shashikant Kore
Thanks. I am learning Lucene and noticed filters might also solve this, I need to look into this. Are you referring to a filter applied during tokenization, or another kind? Could you describe the performance difference (why is that approach more performant)?
Frank Schwieterman
I was referring to TermsFilter that @Kai Chan pointed in the next answer.
Shashikant Kore
+1  A: 

As @Shashikant Kore mentioned the limit is 1024 by default.

If you have a very large collection of text, you might want to look at the MoreLikeThis implementation - it uses some neat heuristics to generate a representative query from the content you have.

Mikos
+1  A: 

Use FilteredQuery during search time. Its constructor takes a query and a filter. Create the query from what the user enters (take a look at QueryParser). Create the filter from the list of product IDs (take a look at TermsFilter).

Kai Chan
+1  A: 

As some people already answered, there are practical limitations. However, if you are interested in the theory, there is really no difference between doing a bunch of OR'd terms versus a single term with a lot of possible results. If p is the number of postings (term/doc pairs) which match your query, and you want to find the k best matches, the query will run in O(p log k). See Doug's paper Space Optimizations for Total Ranking.

If you have q query terms OR'd together and t terms in your index total, it will actually be something like O(q log t + p log k), but for most applications, p log k will dominate that. (This formula came from the fact that it takes log t time to find the posting stream, and you have to do it once per query term.)

Xodarap
That's great, thanks.
Frank Schwieterman