views:

348

answers:

2

I'm building a website in django that needs to extract key words from short (twitter-like) messages.

I've looked at packages like topia.textextract and nltk - but both seem to be overkill for what I need to do. All I need to do is filter words like "and", "or", "not" while keeping nouns and verbs that aren't conjunctives or other parts of speech. Are there any "simpler" packages out there that can do this?

EDIT: This needs to be done in near real-time on a production website, so using a keyword extraction service seems out of the question, based on their response times and request throttling.

+2  A: 

You can make a set sw of the "stop words" you want to eliminate (maybe copy it once and for all from the stop words corpus of NLTK, depending how familiar you are with the various natural languages you need to support), then apply it very simply.

E.g., if you have a list of words sent that make up the sentence (shorn of punctuation and lowercased, for simplicity), [word for word in sent if word not in sw] is all you need to make a list of non-stopwords -- could hardly be easier, right?

To get the sent list in the first place, using the re module from the standard library, re.findall(r'\w+', sentstring) might suffice if sentstring is the string with the sentence you're dealing with -- it doesn't lowercase, but you can change the list comprehension I suggest above to [word for word in sent if word.lower() not in sw] to compensate for that and (btw) keep the word's original case, which may be useful.

Alex Martelli
Thanks for your answer. Is there a way I can extract the stopwords corpus from nltk without having to *use* nltk?
oliland
Sure, you just download it, eg http://nltk.googlecode.com/svn/trunk/nltk_data/packages/corpora/stopwords.zip . It's just a zipfile of text files named english, russian, german, etc -- each has one word per line. Couldn't be easier to get.
Alex Martelli
A: 

Abbreviations like NO for navigation officer or OR for operations room need a little care lest you cause a SNAFU ;-) One suspects that better results could be obtained from "Find the NO and send her to the OR" by tagging the words with parts of speech using the context ... hint 1: "the OR" should result in "the [noun]" not "the [conjunction]". Hint 2: if in doubt about a word, keep it as a keyword.

John Machin