ansaurus

Question

Django+Haystack+Whoosh: how to deal with language inflection

Answer 1

+1 A:

I've had a very similar issue, so I hope I can help.

Whoosh has, by default, only stemming for the english language.
To implement stemming for another language, first look inside:

/your_path_to_whoosh/whoosh/lang/analysis.py

This is where StemmingAnalyzer (the default analyzer) is defined and an excellent starting point. The stem function, imported from porter.py, is the other important place to look in.

So, the three steps are:

Implement your own stemming function, taking as a reference the stem function in porter.py and any grammar and language references you will need to get the rules right.
Implement your own Analyzer taking as reference StemmingAnalyzer inside analysis.py. The file is heavily documented so you should have no problem navigating through it. You'll see that StemmingAnalyzer is basically a chaining of a Tokenizer with a regex to match words, a lowercase filter and the stemming filter which basically calls the above stemming function. You'll see that StemFilter takes the stemming function as a parameter, so you don't have to reimplement the filter.
Pass your brand new Analyzer function at schema creation time, see here: http://files.whoosh.ca/whoosh/docs/latest/schema.html#creating-a-schema

I hope this helps!

Agos 2010-10-10 10:37:56

Thanks ! I didn't knew whoosh has the stemming at all.

thedk 2010-10-10 15:12:06

ansaurus

tags:

views:

answers:

Django+Haystack+Whoosh: how to deal with language inflection

related questions