views:

114

answers:

2

I'm working on a free web application that will analyze top news stories throughout the day and provide stats. Most news websites offer RSS feeds, which works fine for knowing which stories to retrieve. However, the problems arise when attempting to get the full news story from the news website itself. At the moment, I have separate NewsSource classes for each source (CNN, NY Times, etc) that read the appropriate RSS feed(s), follows each link, and strips out the body. This seems tedious and very unmanageable when a news website decides to change the HTML structure of their articles.

Is there a service (preferably free) that already aggregates multiple news sources with the full article content (not just a summary)? If not, do you have any suggestions for handling multiple sources with different HTML structures that may change without notice?

A: 

I know this isn't a great answer, but I forget the name of the startup here in Colorado that can take unstructured/semistructured data and parse it into a structured format. I think if you search the coloradostartups blog for 'data' you might find it.

ybakos
A: 

Spinn3r is a paid service that does what you want. If you are from academia they might give you free access to their data.

MrAnonymous