I'm working on a free web application that will analyze top news stories throughout the day and provide stats. Most news websites offer RSS feeds, which works fine for knowing which stories to retrieve. However, the problems arise when attempting to get the full news story from the news website itself. At the moment, I have separate NewsSource classes for each source (CNN, NY Times, etc) that read the appropriate RSS feed(s), follows each link, and strips out the body. This seems tedious and very unmanageable when a news website decides to change the HTML structure of their articles.
Is there a service (preferably free) that already aggregates multiple news sources with the full article content (not just a summary)? If not, do you have any suggestions for handling multiple sources with different HTML structures that may change without notice?