views:

45

answers:

1

I'm doing a little research on how to cluster articles into 'news stories' ala Google News.

Looking at previous questions here on the subject, I often see it recommended to simply pull out a vector of words from an article, weight some of the words more if they're in certain parts of the article (e.g. the headline), and then to use something like a k-means algorithm to cluster the articles.

But this leads to a couple of questions:

  • With k-means, how do you know in advance how much k should be? In a dynamic news environment you may have a very variable number of stories, and you won't know in advance how many stories a collection of articles represents.

  • With hierarchal clustering algorithms, how do you decide which clusters to use as your stories? You'll have clusters at the bottom of the tree that are just single articles, which you obviously won't want to use, and a cluster at the root of the tree which has all of the articles, which again you won't want...but how do you know which clusters in between should be used to represent stories?

  • Finally, with either k-means or hierarchal algorithms, most literature I have read seems to assume you have a preset collection of documents you want to cluster, and it clusters them all at once. But what of a situation where you have new articles coming in every so often. What happens? Do you have to cluster all the articles from scratch, now that there's an additional one? This is why I'm wondering if there are approaches that let you 'add' articles as you go without re-clustering from scratch. I can't imagine that's very efficient.

Thanks for any tips and help :)

+1  A: 

I would do a search for adaptive K-means clustering algorithms. There is a good section of research devoted to the problems you describe. Here is one such paper (pdf)

Eric LaForce
Thanks Eric!That's a helpful paper :)It addresses the issue of pre-determining the number of clusters, and I guess the choice of threshold is pretty critical in terms of the quality of clusters...but it's something that can be experimented with.I'm wondering though...do you know if this algorithm would work well in an incremental context? I mean, if a new article comes along, and I assign it to a cluster based on least distance to existing clusters, will this lead to the same result as recomputing the clusters from scratch, or a result that is for all intents and purposes 'as good'?
Peter
Based on his conclusion paragraph I believe the answer is yes it will perform "as good" as if you had recomputed the clusters from scratch assuming your distance calculation is done correctly. I don't think it would take you too long to implement a prototype in a scripting language (easy to parse many data formats quickly, and provides good libraries for cluster visualization). Then you could have a strategy pattern, one strategy using the adaptive k-means and one strategy using the normal k-means that recomputes every time.
Eric LaForce