views:

574

answers:

2

Does anyone know of any free (licensed free for commercial use) tagged English corpus(es) that can be used to train a part of speech (POS) tagger?

The only ones I have seen online seem to start in the thousands for commercial use. Any help would be appreciated, thanks.

+5  A: 

There's some data freely available in the NLTK Corpora package. http://nltk.org

Dan Bikel offers his parser for free on his website with a train file based on the tagged Wall Street Journal Corpus, an NLP testing standard. http://www.cis.upenn.edu/~dbikel/software.html

If you're looking for good enough, I'm sure you can generate lots of data based on these parsers and then train your own, and it will probably perform fine for many commercial uses. Unfortunately, this is the disappointing reality of use of the great resources available at the Linguistic Data Consortium. However, for a startup that focuses on NLP, it's not really something you can skimp on. This is why for many of these kind of undertakings, you can employ a pilot phase on poorer data (see above) and then see what your success rates are before making the capital investment.

If you're just doing this for research, then by all means, seek out your nearest computational linguistics program and see what kinds of concessions they'll make for you to poke through their licensed corpora.

Good luck!

Robert Elwell
I think I had my sights too high - I have come across Monty taggers, NLTK, Brill taggers and all the usual suspects. I was hoping there was a mystic GPL tagged corpus.
stevedbrown
Don't let that obstacle get in the way of your project. Definitely use what's freely available but can't be licensed for commercial use for your personal research. If you've got something that's a compelling enough business model, do what other commercial NLP projects do: make your own corpus. It's a great opportunity to get a cheap proprietary resource you can have greater control over. It's also a great opportunity for interns and cheap part-time workers in the area of computational linguistics who desperately need CV/resume experience.
Robert Elwell
+2  A: 

The CoNLL 2003 Shared Task has about 200k tokens of POS-tagged English and German data. The tags are freely available but the textual data is taken from the Reuters Corpus Volume 1. The licence for the Reuters corpus indicates that it should be used for research use, but does not stipulate non-commercial.

StompChicken
Have you sent in the form? Wondering what the process is like.
stevedbrown
I can't remember exactly. The turnaround was about a week, I think.
StompChicken