tags:

views:

41

answers:

1

We've been working with the NLTK library in a recent project where we're mainly interested in the named entities part.

In general we're getting good results using the NEChunkParser class. However, we're trying to find a way to provide our own terms to the parser, without success.

For example, we have a test document where my name (Shay) appears in several places. The library finds me as GPE while I'd like it to find me as PERSON...

Is there a way to provide some kind of a custom file/ code so the parser will be able to interpret the named entity as I want it to?

Thanks!

A: 

The easy solution is to compile a list of entities that you know are misclassified, then filter the NEChunkParser output in a postprocessing module and replace these entities' tags with the tags you want them to have.

The proper solution is to retrain the NE tagger. If you look at the source code for NLTK, you'll see that the NEChunkParser is based on a MaxEnt classifier, i.e. a machine learning algorithm. You'll have to compile and annotate a corpus (dataset) that is representative for the kind of data you want to work with, then retrain the NE tagger on this corpus. (This is hard, time-consuming and potentially expensive.)

larsmans