views:

133

answers:

2

For a while, I've been trying to find a way of intelligently extracting the "relevant" text from a URL by eliminating the text related to ads and all the other clutter.After several months of researching, I gave it up as a problem that cannot be accurately determined. (I've tried different ways but none were reliable)

A week back, I stumbled across Readability - a plugin that converts any URL into readable text. It looks pretty accurate to me. My guess is that they somehow have an algorithm that's smart enough to extract the relevant text.

Does anyone know how they do it? Or how I could do it reliably?

+2  A: 

readability is a javascript bookmarklet. meaning its client side code that manipulates the dom. Look at the javascript and you should be able to see whats going on.

Readability's workflow and code:

/*
     *  1. Prep the document by removing script tags, css, etc.
     *  2. Build readability's DOM tree.
     *  3. Grab the article content from the current dom tree.
     *  4. Replace the current DOM tree with the new one.
     *  5. Read peacefully.
*/

javascript: (function () {
    readConvertLinksToFootnotes = false;
    readStyle = 'style-newspaper';
    readSize = 'size-medium';
    readMargin = 'margin-wide';
    _readability_script = document.createElement('script');
    _readability_script.type = 'text/javascript';
    _readability_script.src = 'http://lab.arc90.com/experiments/readability/js/readability.js?x=' + (Math.random());
    document.documentElement.appendChild(_readability_script);
    _readability_css = document.createElement('link');
    _readability_css.rel = 'stylesheet';
    _readability_css.href = 'http://lab.arc90.com/experiments/readability/css/readability.css';
    _readability_css.type = 'text/css';
    _readability_css.media = 'all';
    document.documentElement.appendChild(_readability_css);
    _readability_print_css = document.createElement('link');
    _readability_print_css.rel = 'stylesheet';
    _readability_print_css.href = 'http://lab.arc90.com/experiments/readability/css/readability-print.css';
    _readability_print_css.media = 'print';
    _readability_print_css.type = 'text/css';
    document.getElementsByTagName('head')[0].appendChild(_readability_print_css);
})();

And if you follow the JS and CSS files that the above code pulls in you'll get the whole picture:

http://lab.arc90.com/experiments/readability/js/readability.js (this is pretty well commented, interesting reading)

http://lab.arc90.com/experiments/readability/css/readability.css

Moin Zaman
+2  A: 

There's no 100% reliable way to do this, of course. You can have a look at the Readability source code here

Basically, what they're doing is trying to identify positive and negative blocks of text. Positive identifiers (i.e. div IDs) would be something like:

  • article
  • body
  • content
  • blog
  • story

Negative identifiers would be:

  • comment
  • discuss

And then they have unlikely and maybe candidates. What they would do is determine what is most likely to be the main content of the site, see line 678 in the readability source. This is done by analyzing mostly the length of paragraphs, their identifiers (see above), the DOM tree (i.e. if the paragraph is a last child node), strip out everything unnecessary, remove formatting, etc.

The code has 1792 lines. It does seem like a non trivial problem, so maybe you can get your inspirations from there.

slhck
@slhck - Do you happen to know if their code is open source and if it can be used in commercial products?
It says that the source code is released under Apache License 2.0, that means you can use it, distribute it, modify and distribute modified versions of it. I'm not too clear on the details though.
slhck
@bobsmith Apple used it in the latest version of Safari. They credited Arc90 in the release notes.
Sidnicious