views:

95

answers:

2

This is not really a programming question, more of an algorithmic one.

The problem: Finding the "content" section of an HTML page.

By "content" I mean the dom that contains the page content as seen by humans, without the noise, simply the "page actual content". I know the problem is not well defined, but let's continue... For example in blog sites, this is is usually easy, when browsing to a specific post you usually have some toolbars at the top of the page, maybe some navigation elements on the LHS and then you have the div that contains the content. Trying to figure this out from the HTML can be tricky. Luckily, however, most blogs have RSS feeds and in the feed for this specific post you'd find a <description> section (or <content:encoded>) and this is exactly what you want. So, to refine the definition of content, this is the actual thing on the page that contains the interesting part, removing all the ads, navigation elements etc. So finding content from blogs is relatively easy, assuming they have RSS. Same goes for other RSS supportive sites.

What about news sites? In many cases news sites have RSS, but not always. How does one find content on news sites then? What about more general sites? Many web pages (of course not all of them) have content section and other sections. Can you think of a good algorithm to find the sections that are "interesting" v/s the less interesting? Perhaps the sections that change from those that do not change?

Hope I've made myself clear... Thanks!

+1  A: 

I haven't done this, but this would be my general approach.

As you indicate, the lack of structure in the visible content parts (i.e. it doesn't have tags such as header, navigation, ads) of HTML means it is harder to home in on the key part of the page. My approach would be to first remove distinct elements which you have definitely decided are not interesting. A possible list of exclusions could be:

  • meta elements such as !doctype, head (take the title as a separate piece of data)
  • dynamic elements such as object, embed, applet, script
  • images (depending on whether want to retain them or not), img
  • form elements, i.e. form, input, textarea, label, legend, select, option

A second pass could then start to exclude commonly occurring div or ul id/class names, and all tags within them, such as:

  • header, footer, meta
  • nav, navigation, topnav, sidebar
  • ad, ads, adu (and other names commonly used for ads)

This will hopefully remove a significant amount of decoration from the page. The next challenge is to try to identify the main content from what's left, and I would suggest initially assuming that the site author is using semantic HTML properly, and so is principally using the h1, h2 head tags and the p paragraph tag.

To identify content, I would look for any header tag which is then followed by a paragraph tag(s). (This may be h2 for your main content; the h1 tag is often (and arguably incorrectly) used to display the site name or logo, but this will hopefully have been eliminated by excluding the header parts of the page.) Each subsequent paragraph should be added to the current content until you reach a break, which could either be the end of the div or td element, or it could be a header element of the same level you started from.

As there may still be several sets of content that you've gathered on the page (maybe the main content plus the blurb about the author), you need to test and refine a decision-making step here which chooses the most likely candidate. This will often simply be the largest, both in terms of length and number of paragraph elements used.

As you gather more examples of content, you can add supporting measures to your algorithm; this might be that you notice many of the pages use div id="content" or id="maincontent". It may also be useful to retain the secondary items of content that you detected, so that if certain sites have a curious way of structuring the content, then once you've add a catcher into your algorithm it can be re-run against just this site's content.

Alistair Knock
Thanks, for the length reply, I think I'll use at least some of the ideas you have. I've been trying to avoid heuristics as much as I could but I start to realize there's no escape...
Ran
That's part of the problem and the success with the HTML spec, the fact that it is generic means it isn't sufficiently semantically rich to cover the complex provision of information (compared to simple, academic texts) demanded by today's websites. I still think a nav tag would've helped enormously in being able to isolate a large part of the page...
Alistair Knock
A: 

A well structured site will have its common areas reusing the same code, e.g. navigation, header and etc.

When you have a target page that you would like to analyze, try browse through a few other pages under the same domain/subdomain and find elements which are common to all pages. Those are the noises you want to get rid of.

Then you can take a look at what's remaining, to see if some noises slipped in. When you have collected a reasonable amount of those data, try to find some pattern in them. Refine your logic and repeat.

Bill Yang