views:

55

answers:

1

Hello,

I am planning an application which will make clusters of short messages/tweets based on topics. The number of topics will be limited like Sports [ NBA, NFL, Cricket, Soccer ], Entertainment [ movies, music ] and so on...

I can think of two approaches to this

  • Ask users to tag questions like Stackoverflow does. Users can select tags from a predefined list of tags. Then on server side I will cluster them based on tags. Pros:- Simple design. Less complexity in code. Cons:- Choices for users will be restricted. Clusters will not be dynamic. If a new event occurs, the predefined tags will miss it.
  • Take the message, delete the stopwords [ predefined in a dictionary ], apply some clustering algorithm on the stemmed message to make a cluster and depending on its popularity display the cluster. The cluster will be displayed till the time it remains popular [ many messages/minute].New messages will be skimmed and assigned to corresponding clusters. Pros:- Dynamic clustering based on the popularity of the event/accident. Cons:- Increased complexity. More server resources required.

I would like to know whether there are any other approaches to this problem. Or are there any ways of improving the above mentioned methods?

Also suggest some good clustering algorithms.I think "K-Nearest Clustering" algorithm is apt for this situation.

+1  A: 

Use Bayesian classification. Train the filter with some predefined corpus, and (optionally) provide a way for users to further refine it by flagging things that were incorrectly categorized.

Here's some examples of using the Bayesian classifier in NLTK.

Hank Gay
@Hank thanks for the reply... Actually I want to keep it as simple as it can get for the users. I think it would be nice if users can just enter some messages and server will figure out where to put it. Though putting that much of intelligence in server will be tough.
Jagira
You don't have to provide a way to do ongoing training of the filter; that just makes the filter better. If you have a good corpus, the classification should be acceptable without ongoing tuning.
Hank Gay