tags:

views:

752

answers:

3

I'm looking for a good open source POS Tagger in Java. Here's what I have come up with so far.

Anybody got any recommendations?

+1  A: 

I have used OpenNLP with good results. You can also check out MorphAdorner.

Shashikant Kore
+2  A: 

I've used both LingPipe and Stanford's POS Tagger. The later is a state-of-the-art POS Tagger but, from my experience, it is too slow (although they do provide less accurate models, which are reasonably fast). Of course, it always depends on what you are trying to achieve, and there will always be a trade-off between speed and accuracy.

I've also once used an LBJ-based NER software and, although it was pretty accurate, the source code was a complete mess. Both LingPipe and Stanford's source is very clean and well documented.

You can also take a look at LTAG-spinal. I haven't used it yet, but from the algorithm description, and from the listed accuracy, it sure seems better than the alternatives you have so far.

Hope it helps.

JG
Stanford's best model is moderately slow. But, actually, LTAG-spinal is 3 times slower again and insignificantly better. For general purpose use, we recommend the left3words model: tagging with it is of similar or better speed than with Ratnaparkhi's or the OpenNLP tagger but is more accurate than either. Find [more info](http://nlp.stanford.edu/software/pos-tagger-faq.shtml#h) in the Stanford POS tagger FAQ.
Christopher Manning
+3  A: 

Are you looking to tag POS in a specific domain? Most of the general purpose taggers are trained on newswire text. Typically they don't perform well when you are using them in specific domains (such and biomedical text). There are other taggers specifically trained for such domains such as dTagger (java) for biomedical text.

For newswire text, Adwait Ratnaparkhi's MXPOST is very good and is the one I would recommend.

Other java implementations include:

  1. MontyLingua
  2. Berkeley Parser (Not really a POS tagger but all full blown parsers will typically include POS taggers. Google for java syntactic parsers and you will find many.)
  3. QTag
  4. LJB

OpenNLP and Lingpipe as posted by the other posters are also pretty decent.

Info on the state-of-the-art on POS tagging can be found here. As you can see LTAG-Spinal (also mentioned by another poster) ranks best as of now, but the variation across the various taggers is not much. I have not used LTAG myself.

Also note that the baseline performance for POS tagging is about 90%. Baseline means - (a) tag every word by most frequent POS tag from a lexicon, and (b) tag every unknown word as a noun.

hashable
Your MXPOST link is to an FTP site with a compressed archive. I searched around and couldn't find much about MXPOST other than it being one guy's CS thesis. Am I correct in assuming that there isn't much community support for MXPOST?
Glenn
@Glenn Yes. Although OPENNLP seems to be an equivalent implementation of MXPOST. I quote from the OPENNLP site: 1. *If you are familiar with feature selection for Adwait Ratnaparkhi's maxent implementation, you should have no problems since our implementation [of the POS tagger] uses features in the same manner as his.*and 2. *His[Adwait's] introduction to maxent for NLP and dissertation are what really made opennlp.maxent and our Grok maxent components (POS tagger, end of sentence detector, tokenizer, name finder) possible!* OpenNLP appears to have an active sourceforge community.
hashable