views:

39

answers:

2

Dear deveopers,

Let me narrate the issue i am facing Currently.

I am unable to manage the RSS feeds easily due to an overwhelming number of new stories / similar news contents posted in various news sites. For subjects such as world news and business news, many of the stories are redundant, adding a burden to readers to sort out which stories they've already read. To deal with the twin problems of flooding and redundancy, i need to develop an code that reduces the number of items to read and uses the overlapping information to divine interesting topics.

it would be easier if i am able to Grouping similar news contents together like in GOOGLE NEWS / StackOverflow and present it to the users.

Thanks in Advance

A: 

I don't see any question here, but I would start by developing some sort of fingerprint algorithm, with words, names, titles, dates etc from the articles. Then I would check the similarity of the fingerprints to find identical articles, maybe by some sort of MapReduce job to easily spread the work to different servers in a cluster.

If you want some inspiration, check out the source code for Google Living Stories: http://code.google.com/p/living-stories/

Emil Vikström
+1  A: 

This is definitely a not-so-easy-to-solve problem that can be solved by:

  • smart text-parsing functions
  • raw hardware power
  • both of them
  • testing, testing, testing
  • fine-tuning at the end

First of all i'd group different news sources to some relatively broad category. You can easily determine a Tech news source won't publish news under economic category. (Or will, that's the problem.)

Most of the cases news title won't be touched, it remains in the original form at the most. So Category, Title, and Publish Date a good starting point to group news into one.

If you detect problems with the methods above you need some fine-tuning under the hood.

Maybe you need to read the whole article and compare two (thousands of) articles word-by-word.

  • There are a lot of stopwords that can distort the comparison, so you'll need to ignore these.
  • You may want define synonyms (J Lo = Jennifer Lopez)

If the raw texts of news are similar (you can define a threshold value) you can compare the other factors again (described above).

Some news sources providing good tagging in the RSS source, maybe you can use this too but not rely on it.

And remember, you'll need a lot of fine-tunings at the start (about 1 year) then you'll be fine.

fabrik
Dear Fabrik, thanks for your reply... is there any algorithm or code available for this.
Gourav
The bad news: you should write your own. The good one? I've provided a lot of useful infos ;)
fabrik