views:

546

answers:

6

Hi,

I'm looking for a package / module / function etc. that is approximately the Python equivalent of Arc90's readability.js

http://lab.arc90.com/experiments/readability

http://lab.arc90.com/experiments/readability/js/readability.js

so that I can give it some input.html and the result is cleaned up version of that html page's "main text". I want this so that I can use it on the server-side (unlike the JS version that runs only on browser side).

Any ideas?

PS: I have tried Rhino + env.js and that combination works but the performance is unacceptable it takes minutes to clean up most of the html content :( (still couldn't find why there is such a big performance difference).

A: 

Why not try using Google V8/Node.js instead of Rhino? It should be acceptably fast.

Vinay Sajip
Does env.js run on V8/Node.js so that I have a browser-like environment?
Emre Sevinç
That's exactly what http://flockfeeds.com/ does.
Sridhar Ratnakumar
A: 

I think BeautifulSoup is the best HTML parser for python. But you still need to figure out what the "main" part of the site is.

If you're only parsing a single domain, it's fairly straight forward, but finding a pattern that works for any site is not so easy.

Maybe you can port the readability.js approach to python?

eikes
+1  A: 

I have done some research on this in the past and ended up implementing this approach [pdf] in Python. The final version I implemented also did some cleanup prior to applying the algorithm, like removing head/script/iframe elements, hidden elements, etc., but this was the core of it.

Here is a function with a (very) naive implementation of the "link list" discriminator, which attempts to remove elements with a heavy link to text ratio (ie. navigation bars, menus, ads, etc.):

def link_list_discriminator(html, min_links=2, ratio=0.5):
    """Remove blocks with a high link to text ratio.

    These are typically navigation elements.

    Based on an algorithm described in:
        http://www.psl.cs.columbia.edu/crunch/WWWJ.pdf

    :param html: ElementTree object.
    :param min_links: Minimum number of links inside an element
                      before considering a block for deletion.
    :param ratio: Ratio of link text to all text before an element is considered
                  for deletion.
    """
    def collapse(strings):
        return u''.join(filter(None, (text.strip() for text in strings)))

    # FIXME: This doesn't account for top-level text...
    for el in html.xpath('//*'):
        anchor_text = el.xpath('.//a//text()')
        anchor_count = len(anchor_text)
        anchor_text = collapse(anchor_text)
        text = collapse(el.xpath('.//text()'))
        anchors = float(len(anchor_text))
        all = float(len(text))
        if anchor_count > min_links and all and anchors / all > ratio:
            el.drop_tree()

On the test corpus I used it actually worked quite well, but achieving high reliability will require a lot of tweaking.

Alec Thomas
+3  A: 

We've just launched a new natural language processing API over at repustate.com. Using a REST API, you can clean any HTML or PDF and get back just the text parts. Our API is free so feel free to use to your heart's content. And it's implemented in python. Check it out and compare the results to readability.js - I think you'll find they're almost 100% the same.

Martin
Hmm, looks promising! ;-) I'll give it a try. Are there any hard limits? How many pages can I process per day, etc.?
Emre Sevinç
Wow, I just used your site to input some url's, and it extracted the articles perfectly.
PythonUser
A: 

@Emre, no limits on the API. Give it a shot and let me know if we can improve things.

Martin
+2  A: 

hn.py via Readability's blog. Readable Feeds, an App Engine app, makes use of it.

I have bundled it as a pip-installable module here: http://github.com/srid/readability

Sridhar Ratnakumar