views:

38

answers:

1

Many languages in Europe are inflectional. This means that one word can be written in multiple forms in text. For example, word 'computer' in polish "komputer" has multiple forms: "komputera", "komputerowi", "komputerem", "komputery" , etc..

How should I use django+haystack+whoosh properly to deal with language inflection?

Whenever I search for any form of "komputer", "komputera", "komputerowi" I mean this same thing ->"komputer".

In NLP there is a basic approach based either on stemming words (cutting suffixes) either on converting a form to the base form ("komputerowi" => "komputer"). There are some libraries that can help with that.

My first thought was to prepare some special template filter that will convert every recognized word in a given variable to the text with base forms rather then forms. Then I could use it in search index templates in django+haystack. If search query will be also converted before evaluate in whoosh engine this should work great. See example:

haystack search index template:
    {{some_indexed_text|convert_to_base_form_filter}}

text to index: "Nie ma komputera"  => "Nie ma komputer" <- this is really indexed
 search query: "komputery"         => "komputer"   <-- this will match 

But I don't think that this is "elegant" solution of this problem, also some other features won't work - like suggesting misspelling suggestions.

So - how should I solve this issue? Maybe I should use other search engine than whoosh?

+1  A: 

I've had a very similar issue, so I hope I can help.

Whoosh has, by default, only stemming for the english language.
To implement stemming for another language, first look inside:

/your_path_to_whoosh/whoosh/lang/analysis.py

This is where StemmingAnalyzer (the default analyzer) is defined and an excellent starting point. The stem function, imported from porter.py, is the other important place to look in.

So, the three steps are:

  • Implement your own stemming function, taking as a reference the stem function in porter.py and any grammar and language references you will need to get the rules right.

  • Implement your own Analyzer taking as reference StemmingAnalyzer inside analysis.py. The file is heavily documented so you should have no problem navigating through it. You'll see that StemmingAnalyzer is basically a chaining of a Tokenizer with a regex to match words, a lowercase filter and the stemming filter which basically calls the above stemming function. You'll see that StemFilter takes the stemming function as a parameter, so you don't have to reimplement the filter.

  • Pass your brand new Analyzer function at schema creation time, see here: http://files.whoosh.ca/whoosh/docs/latest/schema.html#creating-a-schema

I hope this helps!

Agos
Thanks ! I didn't knew whoosh has the stemming at all.
thedk