views:

3584

answers:

10

I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones.

My test words are: "cats running ran cactus cactuses community communities", and both get less than half right.

Ideally the class/function would be in PHP, but I can port it if it's in another language.

See also:

+1  A: 

Do a search foR Lucene, im not sure if theres a PHP port but i do know Lucene is available for many platforms. Lucene is an OSS (from Apache) indexing and search library. Naturally it and community extras might have something interesting to look at. At the very least you can learn how its done in one language so you can translate the "idea" into PHP

mP
+6  A: 

I tried your list of terms on this snowball demo site and the results look okay....

  • cats -> cat
  • running -> run
  • ran -> ran
  • cactus -> cactus
  • cactuses -> cactus
  • community -> communiti
  • communities -> communiti

A stemmer is supposed to turn inflected forms of words down to some common root. It's not really a stemmer's job to make that root a 'proper' dictionary word. For that you need to look at morphological/orthographic analysers.

I think this question is about more or less the same thing, and Kaarel's answer to that question is where I took the second link from.

StompChicken
+2  A: 

Martin Porter's official page contains a Porter Stemmer in PHP as well as other languages.

If you're really serious about good stemming though you're going to need to start with something like the Porter Algorithm, refine it by adding rules to fix incorrect cases common to your dataset, and then finally add a lot of exceptions to the rules. This can be easily implemented with key/value pairs (dbm/hash/dictionaries) where the key is the word to look up and the value is the stemmed word to replace the original. A commercial search engine I worked on once ended up with 800 some exceptions to a modified Porter algorithm.

Van Gale
+1  A: 

If I may quote my answer to the question StompChicken mentioned:

The core issue here is that stemming algorithms operate on a phonetic basis with no actual understanding of the language they're working with.

As they have no understanding of the language and do not run from a dictionary of terms, they have no way of recognizing and responding appropriately to irregular cases, such as "run"/"ran".

If you need to handle irregular cases, you'll need to either choose a different approach or augment your stemming with your own custom dictionary of corrections to run after the stemmer has done its thing.

Dave Sherohman
+3  A: 

Look into WordNet, a large lexical database for the English language:

http://wordnet.princeton.edu/

There are APIs for accessing it in several languages.

Ricardo J. Méndez
+1  A: 

http://wordnet.princeton.edu/man/morph.3WN

For a lot of my projects, I prefer the lexicon-based WordNet lemmatizer over the more aggressive porter stemming.

http://wordnet.princeton.edu/links#PHP has a link to a PHP interface to the WN APIs.

msbmsb
+1  A: 

Just to make a circular reference to the original question posted on Reddit:
How do I programmatically do stemming? (e.g. "eating" to "eat", "cactuses" to "cactus")

Posting it here because the comments include useful information.

Renaud Bompuis
+5  A: 

If you know Python, The Natural Language Toolkit (NLTK) has a very powerful lemmatizer that makes use of WordNet.

It works like this:

>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('cars')
'car'
>>> lmtzr.lemmatize('feet')
'foot'
>>> lmtzr.lemmatize('people')
'people'
>>> lmtzr.lemmatize('fantasized','v')
'fantasize'

There are other lemmatizers in the nltk.stem module, but I haven't tried them myself.

theycallmemorty
A: 

.Net lucene has an inbuilt porter stemmer. You can try that. But note that porter stemming does not consider word context when deriving the lemma. (Go through the algorithm and its implementation and you will see how it works)

Sriwantha Sri Aravinda Attanayake