views:

63

answers:

2

Hey guys,

I have to analyze informal english text with lots of short hands and local lingo. Hence I was thinking of creating the model for the stanford tagger.

How do i create my own set of labelled corpus for the stanford tagger to train on?

What is the syntax of the corpus and how long should my corpus be in order to achieve a desirable performance?

A: 

For the Stanford Parser, you use Penn treebank format, and see Stanford's FAQ about the exact commands to use. The JavaDocs for the LexicalizedParser class also give appropriate commands, particularly:

java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] \
   -train trainFilesPath fileRange
   -saveToSerializedFile serializedGrammarFilename
Ken Bloom
+1  A: 

To train the PoS tagger, see this mailing list post which is also included in the JavaDocs for the MaxentTagger class.

The javadocs for the edu.stanford.nlp.tagger.maxent.Train class specifies the training format:

The training file should be in the following format: one word and one tag per line separated by a space or a tab. Each sentence should end in an EOS word-tag pair. (Actually, I'm not entirely sure that is still the case, but it probably won't hurt. -wmorgan)

Ken Bloom
@Ken, I checked everywhere but it does not specify how to structure the training file? And how long should my training model be?
goh
@goh: I've responded with an edit.
Ken Bloom
@ken, thanks for the help.
goh