views:

104

answers:

1

I'm doing a project on mining blog contents and I need help differentiating on which tool to uses. When do I use a parser, when do I use a tagger, and when do I need to use a NER tool?

For instance, I want to find out the most talked about topics/subjects between several blogs; do I use a part-of-speech tagger to grab the nouns and do a frequency count? That would probably be insufficient because very generic terms can pop up right? Or do I have a list of categories and these synonyms that I can match on?

BTW, I'm using nltk, but am looking at stanford tagger or parser since a couple of dudes said that it was good.

A: 

Instead of trying to reinvent the wheel, you might want to read up on Topic Models, which basically creates clusters of words that frequently occur together. Mallet has a readily available toolkit for doing such a task: http://mallet.cs.umass.edu/topics.php .

To answer your original question, POS tagger, parsers, and NER tools are not typically used for topic identification, but are more heavily used for tasks like information extraction where the goal is to identify within a document the specific actors, events, locations, times, etc... For example if you had a simple sentence like "John gave the apple to Mary." you might use a dependency parser to figure out that John is the subject, the apple is the object, and Mary is the prepositional object; thus you know John is the giver and Mary is the receiver and not vice-versa.

hapagolucky
@hapagolucky, thanks! finally some dude answered my question.
goh