Word Base/Stem Dictionary

This is called lemmatization, and what you call the "base of a word" is called a lemma. morpha and its reimplementation in the Stanford POS tagger do this. Both, however, require POS tagged input to resolve the inherent ambiguity in natural language.

(POS tagging means determining the word categories, e.g. noun, verb. I've been assuming you want a tool that handles English.)

Edit: since you're going to use this for search, here's a few tips:

Simple stemming for English has a mixed reputation in the search engine world. Sometimes it works, often it doesn't.
Automatic spelling correction may work better. This is what Google does. It's expensive in terms of computing time, though, if you want to do it right.
Lemmatization may provide benefits, but probably only if you index and search for both the words and the lemmas. (Same advice goes for stemming.)
Here's a plugin for Lucene that does lemmatization.

(Preceding remarks are based on my own research; I wrote my master's thesis about lemmatization in search engines for very noisy data.)

I want something that is always accurate (though not necessarily complete), which that doesn't seem like it can provide (nor can I possibly categorize all the potential words). I'd rather have some words not be appropriately lemmatized (?) then to have any incorrect ones.

AHungerArtist 2010-10-26 15:40:27

Then you need a simple word list, since these programs represent the state-of-the-art in POS tagging and lemmatization. (Categorizing the words is by the way exactly what the Stanford POS tagger does. It's not exactly plug-and-play, though.)

larsmans 2010-10-26 15:51:50

Right, that is what I'm looking for, a simple word list. I'm using a dictionary now that has what I'm looking for, but it's also full of alternate spellings, abbreviations, and other such things so that it's not as useful as it could be.

AHungerArtist 2010-10-26 16:06:52

In any case, thanks for the input. If I don't find anything else, I will look into this work a little more closely and just see exactly what kind of results I can get from it.

AHungerArtist 2010-10-26 16:12:52

I find that stemming works pretty well for searching so long as you run the data through the stemmer when you index it **and** run the query string through the same stemmer. Have done this with Lucene with excellent results.

Qwerky 2010-10-26 16:19:42

@Qwerky: yes, it may work, but it doesn't always, depending on document set and query quality. It's something to try, though. (Indexing and searching for both stemmer output and the original terms may work even better.)

larsmans 2010-10-26 16:25:54

I can't really afford to do two searches as speed is of the essence, though that almost certainly would give me optimal results. And currently I am running both the index (in my case, a trie) and the input through the substitution but it only works best when the full word is given as input. If there's only a partial string, it can end up not returning any results depending on how a word is substituted (or stemmed if I went that route).

AHungerArtist 2010-10-26 16:45:24

One search for double the number of keywords may be faster. Lemmatizing may be slow, though.

larsmans 2010-10-26 18:40:20

The problem with stemmers is that they tend to produce bogus output such as "strawberri".

larsmans 2010-10-26 15:34:48

@larsmans: eh, but seen that 'strawberri' is not a correct english word, ain't it trivial to run the result of the stemmer into a spellchecker that would then return 'strawberry' as a suggestion?

Webinator 2010-10-26 17:59:53

True, but stemmers can give far worse results than that. Might work, though. Might. (Paul's reasoning that stemmers "should include lists of word stems" is not generally true btw., as many stemmers are just simple string algorithms.)

larsmans 2010-10-26 18:28:09

ansaurus

tags:

views:

answers:

Word Base/Stem Dictionary

related questions