tags:

views:

43

answers:

2

I have a question regarding performing a lucene query involving permutation.

Say I have two fields: "name" and "keyword" and the user searches for "joes pizza restaurant". I want some part of that search to match the full contents of the "name" field and some part to match the full content of the keyword field. It should match all the the supplied terms and should match the entire contents of the fields. For example it could match:

1) name:"joes restaurant" keyword:"pizza"
2) name:"joes pizza" keyword:"restaurant"
3) name:"pizza restaurant" keyword:"joes"
4) name:"pizza" keyword:"joes restaurant"
5) name:"pizza joes" keyword:"restaurant"

but it would not match

6) name:"big joes restaurant" keyword:"pizza" - because it's not a match on the full field
7) name:"joes pizza restaurant" keyword:"nomatch" - because at least one of the terms should match to the keyword field

I've thought about possible ways to implement this by calculating all the permutations of the fields and using boolean queries however this doesn't scale very well as the number of terms increases. Anyone have any clues how to implement this sort of query efficiently?

A: 

Let's divide your query into three parts:

  1. Both the 'name' field and the 'keyword' field should contain part of the query.
  2. Both matches should be to the full field.
  3. The union of the matches should cover the query completely.

I would implement it this way:

  1. Create a boolean query composed of the tokens in the original query. Make it a disjunction of 'MUST' terms. e.g. in the example something like:

    (name:joes OR name:restaurant OR name:pizza) AND (keyword:joes OR keyword:restaurant OR keyword:pizza) Any document matching this query has a part of the original query in each field. (This could be a ConstantScoreQuery to save time).

  2. Take the set of matches from the first query. Extract the field contents as tokens, and store them in String sets. Keep only the matches where the union of the sets equals the string set from your original query, and the sets have an empty intersection. (This handles the covering - item 3 above). For your first example, we will have the sets {"joes", "restaurant"} and {"pizza"} fulfilling both conditions.

  3. Take the set sizes from the matches left, and compare them to the field lengths. For your first example we will have set sizes of 2 and 1 which should correspond to field lengths of 2 and 1 respectively.

Note that my items 2 and 3 are not part of the regular Lucene scoring but rather external Java code.

Yuval F
A: 

Lucene docs recommend using separate field which is concatenation of 'name' and 'keyword' fields for queries spanning multiple fields. Do the search on this field.

Yaroslav
Combined with the answer to go this sounds like reasonable approach.
Glen