views:

110

answers:

5

How can I pick tags from an article or a user's post using Python?

Is the following method ok?

  1. Build a list of word frequency from the text and sort them.

  2. Remove some common words and pick the top 10 words remained in the list as the tags.

If the above method is ok, what library can detect if which words are common, like "the, if, you, etc" and which are descriptive words?

+4  A: 

Here's an article on removing stop words. The link to the stop word list in the article is broken but here's another one.

ʞɔıu
+3  A: 

The Natural Language Toolkit offers a broad variety of methods for this kind of stuff. I can't give you hands-on advice as I'm not familiar with this subject, but I think it's worth the effort to read a few articles about this topic first before you start: just picking words from the text directly won't get you very far I think, you should probably try to find similar words to the ones for that tags already exist. And of course you need to filter out the common words of the language like "the" and stuff. Again, this Python library can help you with this, at least for a few common languages.

paprika
+2  A: 

I'd suggest you download the Stack Overflow data dump. There you get a lot of real world posts, with appropriate tags, to test different algorithms of tag selection.

But generally I doubt it will work too well. For your own question "words" is the clear winner in word count, followed by a list of words with two appearances each, like "common", "list", "method", "pick" and "tags". Which of those would you automatically choose as tags? Also the tags you chose manually contain "python" and "context", none of which shows up with high word frequency.

sth
A: 

Instead of blacklisting words that shouldn't be tags, why don't you instead build a whitelist of words that would make for good tags?

Start with an handful of tags that you would like to have, like Python, off-topic, football, rickroll or whatnot (depends on the kind of site you are building!) and have the system only suggest between those, then let users handpick appropriate tags and also let them type in their own tags.

When enough users suggest a tag, it gets into the pool of "known good" tags for auto suggestion -- maybe after some sort of moderation, so that you can still blacklist stupid tags like the, lolol, or typoed tags like objectoriented when you have object-oriented.

Only show few suggestions. Offer autocompletion. Limit the number of tags per item. If this will be about coding, maybe some sort of language detection system (the file linux command is not too shabby on this) will help your suggestion system.

badp
+1  A: 

Train Bayes or Fischer filter with already tagged data (e.g. with Stackoverflow data dump suggested by sth) and use it to classify new posts. I'd recommend reading excellent Programming Collective Intelligence book by Toby Segaran for more information and python examples on this topic.

Denis Otkidach