I'm looking for an algorithm (or some other technique) to read the actual content of news articles on websites and ignore anything else on the page. In a nutshell, I'm reading an RSS feed programatically from Google News. I'm interested in scraping the actual content of the underlying articles. On my first attempt I have the URLs from the RSS feed and I simply follow them and scrape the HTML from that page. This very clearly resulted in a lot of "noise", whether it be HTML tags, headers, navigation, etc. Basically all the information that is unrelated to the actual content of the article.
Now, I understand this is an extremely difficult problem to solve, it would theoretically involve writing a parser for every website out there. What I'm interested in is an algorithm (I'd even settle for an idea) on how to maximize the actual content that I see when I download the article and minimize the amount of noise.
A couple of additional notes:
- Scraping the HTML is simply the first attempt I tried. I'm not sold that this is the best way to do things.
- I don't want to write a parser for every website I come across, I need the unpredictability of accepting whatever Google provides through the RSS feed.
- I know whatever algorithm I end up with is not going to be perfect, but I'm interested in a best possible solution.
Any ideas?