tags:

views:

172

answers:

2

In its enthusiasm to stemm tokens into lexemes, PostgreSQL Full Text Search engine also reduce proper nouns. For instance:

essais=> select to_tsquery('english', 'bortzmeyer');
to_tsquery 
------------
'bortzmey'

essais=> select to_tsquery('english', 'balling');
to_tsquery 
------------
'ball'
(1 row)

At least for the first one, I'm sure it is not in the english dictionary! What is the better way to avoid this spurious stemming?

+2  A: 

The point of stemming algorithms is not to reduce every word to its proper stem; the goal is to reduce words that are alike to a common stemmed form. The goal is generally not to get a word that can be presented to the user: even if 'balling' and 'ball' would both produce 'kjebnkkekaa' the algorithm is correct because it still sees 'balling' and 'ball' as generally concerning the same thing.

Also beware that no stemming algorithm is absolutely perfect, for more info look up the Porter Stemming algorithm

Jasper Bekkers
+1  A: 

That's due to the Snowball stemmer as explained here. Basically you'll want to disable the Snowball stemmer and use just iSpell or one of the other dictionaries, but that would also reduce the stemming efficiency for words not in the dictionaries.

codelogic