views:

326

answers:

2

I am using Lucene for Java, and need to figure out what the engine does when I execute some obscure queries. Take the following query:

+(foo -bar)

If I use QueryParser to parse the input, I get a BooleanQuery object that looks like this:

org.apache.lucene.search.BooleanQuery:
    org.apache.lucene.search.BooleanClause(required=true, prohibited=false):
        org.apache.lucene.search.BooleanQuery:
            org.apache.lucene.search.BooleanClause(required=false, prohibited=false):
                org.apache.lucene.search.TermQuery: foo
            org.apache.lucene.search.BooleanClause(required=false, prohibited=true):
                org.apache.lucene.search.TermQuery: bar

What does Lucene look for? Is it documents that MUST contain 'foo' but CANNOT contain 'bar'? What if I search for:

-(foo +bar)

Are those documents that CANNOT contain 'foo' and CANNOT contain 'bar'? Or perhaps ones that CANNOT contain 'foo' but MUST contain 'bar'?

If it helps any, here is what I used to peek into the QueryParser results:

QueryParser parser = new QueryParser("contents", new StandardAnalyzer());
Query query = parser.parse(text);
debug(query, 0);

public static void debug(Object o, int depth) {
    for(int i=0; i<depth; i++) System.out.print("\t");
    System.out.print(o.getClass().getName());

    if(o instanceof BooleanQuery) {
        System.out.println(":");
        for(BooleanClause clause : ((BooleanQuery)o).getClauses()) {
            debug(clause, depth + 1);
        }
    } else if(o instanceof BooleanClause) {
        BooleanClause clause = (BooleanClause)o;
        System.out.println("(required=" + clause.isRequired() + ", prohibited=" + clause.isProhibited() + "):");
        debug(clause.getQuery(), depth + 1);
    } else if(o instanceof TermQuery) {
        TermQuery term = (TermQuery)o;
        System.out.println(": " + term.getTerm().text());
    } else {
        throw new IllegalArgumentException("Unknown object type");
    }
}
A: 

By default, Lucene assumes an OR relationship between terms, so the first query is equivalent to

+(foo OR -bar)

which will match documents which contain (in the default field) "foo" or don't contain "bar"

In the second query, the "+" makes "bar" required, which makes the optional "foo" redundant, so it can be reduced to "-bar" which matches all documents that don't contain "bar"

KenE
Thanks, that makes sense!
A: 

Luke http://www.getopt.org/luke/ is very useful to understand what queries do

raticulin
This tool didn't help me with the question I posted, but it provides a lot of valuable diagnostics information that I know will be useful in the future. Thanks for this, I forwarded it to my team.