views:

142

answers:

1

Why does the wildcard query "dog#V*" fail to retrieve a document that contains "dog#VVP"?

The following code written in Jython for Lucene 3.0.0 fails to retrieve the indexed document. Am I missing something?

analyzer = WhitespaceAnalyzer()  
directory = FSDirectory.open(java.io.File("testindex"))  
iwriter = IndexWriter(directory, analyzer, True, IndexWriter.MaxFieldLength(25000))  

doc = Document()  
doc.add(Field("sentence", "dog#VVP", Field.Store.YES, Field.Index.ANALYZED))  
iwriter.addDocument(doc)  
iwriter.close()  
directory.close()  

parser = QueryParser(Version.LUCENE_CURRENT, "sentence", analyzer)  
directory = FSDirectory.open(java.io.File("testindex"))  
isearcher = IndexSearcher(directory, True) # read-only=true  

query = parser.parse("dog#V*")  
hits = isearcher.search(query, None, 10).scoreDocs  
print query_text + ":" + ", ".join([str(x) for x in list(hits)])  

Output is:

dog#V*: 

It doesn't return anything. I see the same behaviour for dog#VV* or with separators characters other than "#" (I tried "__" and "aaa"). Interestingly, the following queries work: dog#???, dog#*.

+2  A: 

If you'd looked carefully at the result of

parser.parse("dog#V*")

you'd have seen

sentence:dog#v*

Note the lowercase v! To avoid the automatic lowercasing of terms in a wildcard query, you'll have to do

parser.setLowercaseExpandedTerms(False)

before parsing query strings. I have no idea why the default is to lowercase.

Jonathan Feinberg
Thanks! That solves my current problem.As far as I understand LowerCase is just another filter in WhitespaceAnalysis. I will try and see what happens if I use my own custom Analysis class (which will employ a TurkishLowerCase) but is there anyone who could explain the mechanism and rationale behind this default?
Amaç Herdağdelen
@Amaç - Note that lowercasing is NOT part of WhiteSpaceAnalyzer, but rather a default behavior of the query parser. Therefore, if you want to change lowercasing, you should either set the flag as Jonathan suggested or write your own query parser class.
Yuval F
And Yuval's comment corrects my misunderstanding about lowercasing behavior. Everything is clear now, thank you again.
Amaç Herdağdelen