views:

217

answers:

1

I'm using Porter Stemmer to stem the words, and here's a problem I'm running into:

Word "mortgage" is correctly stemmed to "mortgag" Word "mortgagee" is (arguably incorrectly) stemmed to "mortgage"

There are approximately 100 documents with the word "mortgage" There is 1 document with word "mortgagee"

When I build an index without putting "mortgagee" in any documents, everything works fine: searching for "mortgage" or "mortgages" or "mortgag" returns all 100 documents.

When I build an index and one of the documents contains "mortgagee", searching the index for "mortgage" only returns a single document with "mortgagee" (which was stemmed down to "mortgage"). However, searching for "mortgag" or "mortgages" returns all 100 documents.

The only logical conclusion I can make from this problem is lucene first searches for the pre-stemmed word, and if it doesn't find any results, it continues to search for the stemmed word. Thus, when searching for 'mortgage', it first finds the 'mortgage' that was stemmed from 'mortgagee' and stops searching. Is this the correct behavior, or is it a bug?

+1  A: 

This sounds like a bug to me. A guiding principle of Lucene search says: "Search using the same analyzer that you used for indexing, unless you have a real good reason not to". After analysis and stemming, Lucene should return matches for search terms it has. In your case, "mortgage" was transformed into "mortgag" during indexing. The retrieval process should mirror that, and also transform "mortgage" into "mortgag", and then find the matches for "mortgag" (which represent "mortgage"). It seems that during retrieval you do not stem the query, which leads to erroneous results. If this answer is unclear, please edit your question and add a few lines of code describing how you create the index and how you search it.

Yuval F