views:

326

answers:

2

Hi folks,

I've built a content aggregator and would like to add a tag cloud representing the current trends.

Unfortunately this is quite complex, as I have to look for keywords that represent the context of each article.

For example words such as I, was, the, amazing, nice have no relation to context.


Help would be much appreciated! :)

+2  A: 

NLTK can help you analyze the content in order to pick out relevant terms.

Ignacio Vazquez-Abrams
+6  A: 

Use NLTK, and in particular its Stopwords corpus:

Besides regular content words, there is another class of words called stop words that perform important grammatical functions, but are unlikely to be interesting by themselves. These include prepositions, complementizers, and determiners. NLTK comes bundled with the Stopwords corpus, a list of 2400 stop words across 11 different languages (including English).

Alex Martelli
@Alex: thanks for the awesome answer! But can this deal with adjectives such as **good** **great** etc... ?
RadiantHex
@Radiant, adjectives aren't stopwords, as they do convey meaning -- e.g., "The Great Wall" is a very specific and long wall in China, while "The Wall" is a Pink Floyd album -- etc. If you want to skip adjectives (a dubious decision), use NLTK to do "Parts-of-Speech tagging", per http://streamhacker.com/2008/11/03/part-of-speech-tagging-with-nltk-part-1/ (also read parts 2 and 3 of course).
Alex Martelli