ansaurus

Question

Answer 1

+2 A:

SpanNearQuery lets you find terms that are within a certain distance of each other.

Example (from http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/):

Say we want to find lucene within 5 positions of doug, with doug following lucene (order matters) – you could use the following SpanQuery:

new SpanNearQuery(new SpanQuery[] {
  new SpanTermQuery(new Term(FIELD, "lucene")),
  new SpanTermQuery(new Term(FIELD, "doug"))},
  5,
  true);

alt text

In this sample text, Lucene is within 3 of Doug

But for your example, the only match I can see is that both your query and the target document have "cd" (and I am making the assumption that all of those terms are in a single field). In that case, you don't need to use any special query type. Using the standard mechanisms, you will get some non-zero weighting based on the fact that they both contain the same term in the same field.

Edit 3 - in response to latest comment, the answer is that you cannot use SpanNearQuery to do anything other than that which it is intended for, which is to find out whether multiple terms in a document occur within a certain number of places of each other. I can't tell what your specific use case / expected results are (feel free to post it), but in the last case if you only want to find out whether one or more of ("BAZ", "EXTRA") is in the document, a BooleanQuery will work just fine.

Edit 4 - now that you have posted your use case, I understand what it is you want to do. Here is how you can do it: use a BooleanQuery as mentioned above to combine the individual terms you want as well as the SpanNearQuery, and set a boost on the SpanNearQuery.

So, the query in text form would look like:

BAZ OR EXTRA OR "BAZ EXTRA"~100^5

(as an example - this would match all documents containing either "BAZ" or "EXTRA", but assign a higher score to documents where the terms "BAZ" and "EXTRA occur within 100 places of each other; adjust the position and boost as you like. This example is from the Solr cookbook so it may not parse in Lucene, or may give undesirable results. That's ok, because in the next section I show you how to build this using the API).

Programmatically, you would construct this as follows:

Query top = new BooleanQuery();

// Construct the terms since they will be used more than once
Term bazTerm = new Term("Field", "BAZ");
Term extraTerm = new Term("Field", "EXTRA");

// Add each term as "should" since we want a partial match
top.add(new TermQuery(bazTerm), BooleanClause.Occur.SHOULD);
top.add(new TermQuery(extraTerm), BooleanClause.Occur.SHOULD);

// Construct the SpanNearQuery, with slop 100 - a document will get a boost only
// if BAZ and EXTRA occur within 100 places of each other.  The final parameter means
// that BAZ must occur before EXTRA.
SpanNearQuery spanQuery = new SpanNearQuery(
                              new SpanQuery[] { new SpanTermQuery(bazTerm), 
                                                new SpanTermQuery(extraTerm) }, 
                              100, true);

// Give it a boost of 5 since it is more important that the words are together
spanQuery.setBoost(5f);

// Add it as "should" since we want a match even when we don't have proximity
top.add(spanQuery, BooleanClause.Occur.SHOULD);

Hope that helps! In the future, try to start off by posting exactly what results you are expecting - even if it is obvious to you, it may not be to the reader, and being explicit can avoid having to go back and forth so many times.

danben 2010-01-07 16:33:35

The in-line image explaining distance is a nice touch.

Brian Mansell 2010-01-07 20:56:57

That's what I initially assumed as well. However, the document in question does not get returned from my search.

Franz See 2010-01-07 22:11:37

Maybe you could post some code showing how you're searching?

danben 2010-01-07 22:40:13

Kindly see a simplified version of the problem I'm pertaining to.

Franz See 2010-01-08 08:24:51

I modified my post again and clarified the requested information - the first 3 passes and the fourth fails.

Franz See 2010-01-09 00:57:34

Re edit 2: Yes, exactly :) which brings us back to my original question which is how do I do partial matching using SpanNearQuery (or some proximity-aware query).

Franz See 2010-01-09 08:53:19

Re edit3: Re SpanNearQuery - thanks. Which is why I state it does not work and which is why I ask how to go around it? Re my specific use case: it is what it is :) Given the terms, I need to make matches wherein I give a higher score if they're together (means it's most likely what the user is searching for). Yet, I need it to be lax enough such that if not all the terms are found, they are returned in the search result (but still, higher proximity means higher score).

Franz See 2010-01-10 13:15:27

Thanks! That's what I did as well (except that my boost factor is equal to the number of tokens to compensate for the higher scores or-queries usually give out). Sometimes though the results makes sense, and sometimes it does not. I guess I need to find out what those other factors are. Thanks! Re your tip: Will keep that in mind. Thank you for your patience ! :-)

Franz See 2010-01-11 00:07:38

ansaurus

tags:

views:

answers:

Lucene SpanNearQuery partial matching

related questions