This is not really a programming question, more of an algorithmic one.
The problem: Finding the "content" section of an HTML page.
By "content" I mean the dom that contains the page content as seen by humans, without the noise, simply the "page actual content". I know the problem is not well defined, but let's continue... For example in blog sites, this is is usually easy, when browsing to a specific post you usually have some toolbars at the top of the page, maybe some navigation elements on the LHS and then you have the div that contains the content. Trying to figure this out from the HTML can be tricky. Luckily, however, most blogs have RSS feeds and in the feed for this specific post you'd find a <description> section (or <content:encoded>) and this is exactly what you want. So, to refine the definition of content, this is the actual thing on the page that contains the interesting part, removing all the ads, navigation elements etc. So finding content from blogs is relatively easy, assuming they have RSS. Same goes for other RSS supportive sites.
What about news sites? In many cases news sites have RSS, but not always. How does one find content on news sites then? What about more general sites? Many web pages (of course not all of them) have content section and other sections. Can you think of a good algorithm to find the sections that are "interesting" v/s the less interesting? Perhaps the sections that change from those that do not change?
Hope I've made myself clear... Thanks!