views:

57

answers:

2

It seems my Google-fu is failing me.

Does anyone know of a freely available word base dictionary that just contains bases of words? So, for something like strawberries, it would have strawberry. But does NOT contain abbreviations or misspellings or alternate spellings (like UK versus US)? Anything quickly usable in Java would be good but just a text file of mappings or anything that could be read in would be helpful.

+3  A: 

This is called lemmatization, and what you call the "base of a word" is called a lemma. morpha and its reimplementation in the Stanford POS tagger do this. Both, however, require POS tagged input to resolve the inherent ambiguity in natural language.

(POS tagging means determining the word categories, e.g. noun, verb. I've been assuming you want a tool that handles English.)

Edit: since you're going to use this for search, here's a few tips:

  • Simple stemming for English has a mixed reputation in the search engine world. Sometimes it works, often it doesn't.
  • Automatic spelling correction may work better. This is what Google does. It's expensive in terms of computing time, though, if you want to do it right.
  • Lemmatization may provide benefits, but probably only if you index and search for both the words and the lemmas. (Same advice goes for stemming.)
  • Here's a plugin for Lucene that does lemmatization.

(Preceding remarks are based on my own research; I wrote my master's thesis about lemmatization in search engines for very noisy data.)

larsmans
I want something that is always accurate (though not necessarily complete), which that doesn't seem like it can provide (nor can I possibly categorize all the potential words). I'd rather have some words not be appropriately lemmatized (?) then to have any incorrect ones.
AHungerArtist
Then you need a simple word list, since these programs represent the state-of-the-art in POS tagging and lemmatization. (Categorizing the words is by the way exactly what the Stanford POS tagger does. It's not exactly plug-and-play, though.)
larsmans
Right, that is what I'm looking for, a simple word list. I'm using a dictionary now that has what I'm looking for, but it's also full of alternate spellings, abbreviations, and other such things so that it's not as useful as it could be.
AHungerArtist
In any case, thanks for the input. If I don't find anything else, I will look into this work a little more closely and just see exactly what kind of results I can get from it.
AHungerArtist
I find that stemming works pretty well for searching so long as you run the data through the stemmer when you index it **and** run the query string through the same stemmer. Have done this with Lucene with excellent results.
Qwerky
@Qwerky: yes, it may work, but it doesn't always, depending on document set and query quality. It's something to try, though. (Indexing and searching for both stemmer output and the original terms may work even better.)
larsmans
I can't really afford to do two searches as speed is of the essence, though that almost certainly would give me optimal results. And currently I am running both the index (in my case, a trie) and the input through the substitution but it only works best when the full word is given as input. If there's only a partial string, it can end up not returning any results depending on how a word is substituted (or stemmed if I went that route).
AHungerArtist
One search for double the number of keywords may be faster. Lemmatizing may be slow, though.
larsmans
A: 

This isn't exactly what you're asking for, but Wikipedia on stemming was enlightening and contains a number of links to free stemming programs. Which presumably should include lists of word stems

Paul
The problem with stemmers is that they tend to produce bogus output such as "strawberri".
larsmans
@larsmans: eh, but seen that 'strawberri' is not a correct english word, ain't it trivial to run the result of the stemmer into a spellchecker that would then return 'strawberry' as a suggestion?
Webinator
True, but stemmers can give far worse results than that. Might work, though. Might. (Paul's reasoning that stemmers "should include lists of word stems" is not generally true btw., as many stemmers are just simple string algorithms.)
larsmans