views:

636

answers:

11

What work, if any, has been done to automatically determine the most important data within an html document? As an example, think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.

How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?

Note: Ideally, the method would work with well-formed markup, and terrible markup. Whether somebody uses paragraph tags to make paragraphs, or a series of breaks.

+1  A: 

I think a very effective algorithm for this might be, "Which DIV has the most text in it that contains few links?"

Seldom do ads have more than two or three sentences of text. Look at the right side of this page, for example.

The content area is almost always the area with the greatest width on the page.

Robert Harvey
A: 

I would probably start with Title and anything else in a Head tag, then filter down through heading tags in order (ie h1, h2, h3, etc.)... beyond that, I guess I would go in order, from top to bottom. Depending on how it's styled, it may be a safe bet to assume a page title would have an ID or a unique class.

orthod0ks
I am not looking for the title. I'm looking for the "body," such as an article, or a blog post.
Jonathan Sampson
+10  A: 

Sometimes there's a CSS Media section defined as 'Print.' It's intended use is for 'Click here to print this page' links. Usually people use it to strip a lot of the fluff and leave only the meat of the information.

http://www.w3.org/TR/CSS2/media.html

I would try to read this style, and then scrape whatever is left visible.

Ian Jacobs
+1 Excellent idea.
Cal Jacobson
That would work great for say, a website owned by a publication company, but not so much on smaller blogs or those which are poorly marked up.
Shadow
Very true; perhaps in coordination with some kind of Bayesian filter, then?
Cal Jacobson
I doubt there's going to be one grand unifying approach to consistently solve this problem. There's too many 'iffy' (to put it politely) webpages out there that don't follow any kind of sane standard.
Ian Jacobs
Sorry, this doesn't address the many pages that don't use the print style. And frankly, I'm interested in viewing the markup, and not the effect that the styles would have of the markup. Thank you for the suggestion though. Is it good, but not exactly what I'm looking for.
Jonathan Sampson
+2  A: 

I think the most straightforward way would be to look for the largest block of text without markup. Then, once it's found, figure out the bounds of it and extract it. You'd probably want to exclude certain tags from "not markup" like links and images, depending on what you're targeting. If this will have an interface, maybe include a checkbox list of tags to exclude from the search.

You might also look for the lowest level in the DOM tree and figure out which of those elements is the largest, but that wouldn't work well on poorly written pages, as the dom tree is often broken on such pages. If you end up using this, I'd come up with some way to see if the browser has entered quirks mode before trying it.

You might also try using several of these checks, then coming up with a metric for deciding which is best. For example, still try to use my second option above, but give it's result a lower "rating" if the browser would enter quirks mode normally. Going with this would obviously impact performance.

Shadow
+2  A: 

You can use support vector machines to do text classification. One idea is to break pages into different sections (say consider each structural element like a div is a document) and gather some properties of it and convert it to a vector. (As other people suggested this could be number of words, number of links, number of images more the better.)

First start with a large set of documents (100-1000) that you already choose which part is the main part. Then use this set to train your SVM.

And for each new document you just need to convert it to vector and pass it to SVM.

This vector model actually quite useful in text classification, and you do not need to use an SVM necessarily. You can use a simpler Bayesian model as well.

And if you are interested, you can find more details in Introduction to Information Retrieval. (Freely available online)

Szere Dyeri
One of these properties would be the font size. You can retrieve the font size by looking it up in the css. That way you will often get the visually most distinctive fragments.
sebastiangeiger
You can setup serveral criterions and affect different weights (e.g. heavy wieght for the "printable" content in Ian Jacob's solution). Acr90's Readability uses number of P, commas, etc. and calculate overall score. http://lab.arc90.com/experiments/readability/ and the script is at http://lab.arc90.com/experiments/readability/js/readability.js
streetpc
A downside of this solution is still, that you have to have a training set which contains sites where the important areas are annotated. It will work automatically later but you have to code it and then train it upfront. A big advantage of this solution is - if you want to have a public project - that you could "crowdsource" the classification, so your users could say whats important to them and your system could adapt to unforseen circumstances.
sebastiangeiger
+11  A: 

think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.

How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?

I would probably try something like this:

  • open URL
  • read in all links to same website from that page
  • follow all links and build a DOM tree for each URL (HTML file)
  • this should help you come up with redundant contents (included templates and such)
  • compare DOM trees for all documents on same site (tree walking)
  • strip all redundant nodes (i.e. repeated, navigational markup, ads and such things)
  • try to identify similar nodes and strip if possible
  • find largest unique text blocks that are not to be found in other DOMs on that website (i.e. unique content)
  • add as candidate for further processing

This approach of doing it seems pretty promising because it would be fairly simple to do, but still have good potential to be adaptive, even to complex Web 2.0 pages that make excessive use of templates, because it would identify similiar HTML nodes in between all pages on the same website.

This could probably be further improved by simpling using a scoring system to keep track of DOM nodes that were previously identified to contain unique contents, so that these nodes are prioritized for other pages.

none
A: 

I would look for sentences with punctuation. Menus, headers, footers etc. usually contains seperate words, but not sentences ending containing commas and ending in period or equivalent punctuation.

You could look for the first and last element containing sentences with punctuation, and take everything in between. Headers are a special case since they usually dont have punctuation either, but you can typically recognize them as Hn elements immediately before sentences.

JacquesB
This is a good idea, but do you think it would qualify comments on the remote page as well?
Jonathan Sampson
A: 

Today most of the news/blogs websites are using a blogging platform. So i would create a set of rules by which i would search for content. By example two of the most popular blogging platforms are wordpress and Google Blogspot.

Wordpress posts are marked by:

<div class="entry">
    ...
</div>

Blogspot posts are marked by:

<div class="post-body">
    ...
</div>

If the search by css classes fails you could turn to the other solutions, identifying the biggest chunk of text and so on.

Fiur
A: 

While this is obviously not the answer, I would assume that the important content is located near the center of the styled page and usually consists of several blocks interrupted by headlines and such. The structure itself may be a give-away in the markup, too.

A diff between articles / posts / threads would be a good filter to find out what content distinguishes a particular page (obviously this would have to be augmented to filter out random crap like ads, "quote of the day"s or banners). The structure of the content may be very similar for multiple pages, so don't rely on structural differences too much.

Alan
+10  A: 

Readability does a decent job of exactly this.

It's open source and posted on Google Code.


UPDATE: I see (via HN) that someone has used Readability to mangle RSS feeds into a more useful format, automagically.

Colin Pickard
This is very cool. Exactly what I am wanting to duplicate. I'll have to look into it a bit.
Jonathan Sampson
This is amazing and you are amazing
Chris McCall
I didn't write it :) But it is a very good idea and execution
Colin Pickard
my first bounty :)
Colin Pickard
You deserved it, Colin :)
Jonathan Sampson
A: 

Instapaper does a good job with this. You might want to check Marco Arment's blog for hints about how he did it.

Kristopher Johnson