ansaurus

Question

What algorithm does Readability use for extracting text from URLs?

Answer 1

+2 A:

readability is a javascript bookmarklet. meaning its client side code that manipulates the dom. Look at the javascript and you should be able to see whats going on.

Readability's workflow and code:

/*
     *  1. Prep the document by removing script tags, css, etc.
     *  2. Build readability's DOM tree.
     *  3. Grab the article content from the current dom tree.
     *  4. Replace the current DOM tree with the new one.
     *  5. Read peacefully.
*/

javascript: (function () {
    readConvertLinksToFootnotes = false;
    readStyle = 'style-newspaper';
    readSize = 'size-medium';
    readMargin = 'margin-wide';
    _readability_script = document.createElement('script');
    _readability_script.type = 'text/javascript';
    _readability_script.src = 'http://lab.arc90.com/experiments/readability/js/readability.js?x=' + (Math.random());
    document.documentElement.appendChild(_readability_script);
    _readability_css = document.createElement('link');
    _readability_css.rel = 'stylesheet';
    _readability_css.href = 'http://lab.arc90.com/experiments/readability/css/readability.css';
    _readability_css.type = 'text/css';
    _readability_css.media = 'all';
    document.documentElement.appendChild(_readability_css);
    _readability_print_css = document.createElement('link');
    _readability_print_css.rel = 'stylesheet';
    _readability_print_css.href = 'http://lab.arc90.com/experiments/readability/css/readability-print.css';
    _readability_print_css.media = 'print';
    _readability_print_css.type = 'text/css';
    document.getElementsByTagName('head')[0].appendChild(_readability_print_css);
})();

And if you follow the JS and CSS files that the above code pulls in you'll get the whole picture:

http://lab.arc90.com/experiments/readability/js/readability.js (this is pretty well commented, interesting reading)

http://lab.arc90.com/experiments/readability/css/readability.css

Moin Zaman 2010-09-06 15:42:24

Answer 2

+2 A:

There's no 100% reliable way to do this, of course. You can have a look at the Readability source code here

Basically, what they're doing is trying to identify positive and negative blocks of text. Positive identifiers (i.e. div IDs) would be something like:

article
body
content
blog
story

Negative identifiers would be:

comment
discuss

And then they have unlikely and maybe candidates. What they would do is determine what is most likely to be the main content of the site, see line 678 in the readability source. This is done by analyzing mostly the length of paragraphs, their identifiers (see above), the DOM tree (i.e. if the paragraph is a last child node), strip out everything unnecessary, remove formatting, etc.

The code has 1792 lines. It does seem like a non trivial problem, so maybe you can get your inspirations from there.

slhck 2010-09-06 15:53:01

@slhck - Do you happen to know if their code is open source and if it can be used in commercial products?

2010-09-06 16:13:18

It says that the source code is released under Apache License 2.0, that means you can use it, distribute it, modify and distribute modified versions of it. I'm not too clear on the details though.

slhck 2010-09-06 17:21:56

@bobsmith Apple used it in the latest version of Safari. They credited Arc90 in the release notes.

Sidnicious 2010-09-13 13:30:40

ansaurus

tags:

views:

answers:

What algorithm does Readability use for extracting text from URLs?

related questions