ansaurus

Question

how do I create my own training corpus for stanford tagger?

Answer 1

A:

For the Stanford Parser, you use Penn treebank format, and see Stanford's FAQ about the exact commands to use. The JavaDocs for the LexicalizedParser class also give appropriate commands, particularly:

java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] \
   -train trainFilesPath fileRange
   -saveToSerializedFile serializedGrammarFilename

Ken Bloom 2010-07-01 13:14:23

Answer 2

+1 A:

To train the PoS tagger, see this mailing list post which is also included in the JavaDocs for the MaxentTagger class.

The javadocs for the edu.stanford.nlp.tagger.maxent.Train class specifies the training format:

The training file should be in the following format: one word and one tag per line separated by a space or a tab. Each sentence should end in an EOS word-tag pair. (Actually, I'm not entirely sure that is still the case, but it probably won't hurt. -wmorgan)

Ken Bloom 2010-07-01 13:20:37

@Ken, I checked everywhere but it does not specify how to structure the training file? And how long should my training model be?

goh 2010-07-02 07:23:37

@goh: I've responded with an edit.

Ken Bloom 2010-07-02 13:22:02

@ken, thanks for the help.

goh 2010-07-06 07:52:54

ansaurus

tags:

views:

answers:

how do I create my own training corpus for stanford tagger?

related questions