views:

25

answers:

2

In an app that i'm creating, I want to add functionality that groups news stories together. I want to group news stories about the same topic from different sources into the same group. For example, an article on XYZ from CNN and MSNBC would be in the same group. I am guessing its some sort of fuzzy logic comparison. How would I go about doing this from a technical standpoint? What are my options? We haven't even started the app yet, so we aren't limited in the technologies we can use.

Thanks, in advance for the help!

+1  A: 

One approach would be to add tags to the articles when they are listed. One tag would be XYZ. Other tags might describe the article subject.

You can do that in a database. You can have an unlimited number of tags for each article. Then, the "groups" could be identified by one or more tags.

This approach is heavily dependent upon human beings assigning appropriate tags, so that the right articles are returned from the search, but not too many articles. It isn't easy to do really well.

DOK
hmmm, good solution, but I dont think that would work for us. our solution will automatically pull articles from the web without any human interaction, so we cant tag them.
Randy
+1  A: 

This problem breaks down into a few subproblems from a machine learning standpoint.

First, you are going to want to figure out what properties of the news stories you want to group based on. A common technique is to use 'word bags': just a list of the words that appear in the body of the story or in the title. You can do some additional processing such as removing common English "stop words" that provide no meaning, such as "the", "because". You can even do porter stemming to remove redundancies with plural words and word endings such as "-ion". This list of words is the feature vector of each document and will be used to measure similarity. You may have to do some preprocessing to remove html markup.

Second, you have to define a similarity metric: similar stories score high in similarity. Going along with the bag of words approach, two stories are similar if they have similar words in them (I'm being vague here, because there are tons of things you can try, and you'll have to see which works best).

Finally, you can use a classic clustering algorithm, such as k-means clustering, which groups the stories together, based on the similarity metric.

In summary: convert news story into a feature vector -> define a similarity metric based on this feature vector -> unsupervised clustering.

Check out Google scholar, there probably have been some papers on this specific topic in the recent literature. A lot of these things that I just discussed are implemented in natural language processing and machine learning modules for most major languages.

orangeoctopus
Great answer! This is exactly what I was looking for. Quick follow up question. If I were looking for a developer with these skill sets, what kind of things should I ask for? I dont even know what this field of study is called.
Randy
Look for a computer science student that has either taken a class with or has had experience with either 'natural language processing' or 'machine learning'. Your question was very straightforward to answer in a machine learning context, so just ask them how they would implement something that groups news stories. Also, projects like this don't always work out because there are tons of things that can go wring in ML and NLP -- but when it does work, it is pretty awesome.
orangeoctopus