ansaurus

Question

Is there anything for Python that is like readability.js?

Answer 1

A:

Why not try using Google V8/Node.js instead of Rhino? It should be acceptably fast.

Vinay Sajip 2010-05-27 13:51:31

Does env.js run on V8/Node.js so that I have a browser-like environment?

Emre Sevinç 2010-05-27 14:12:08

That's exactly what http://flockfeeds.com/ does.

Sridhar Ratnakumar 2010-09-07 05:37:32

Answer 2

A:

I think BeautifulSoup is the best HTML parser for python. But you still need to figure out what the "main" part of the site is.

If you're only parsing a single domain, it's fairly straight forward, but finding a pattern that works for any site is not so easy.

Maybe you can port the readability.js approach to python?

eikes 2010-05-28 14:00:07

Answer 3

+1 A:

I have done some research on this in the past and ended up implementing this approach [pdf] in Python. The final version I implemented also did some cleanup prior to applying the algorithm, like removing head/script/iframe elements, hidden elements, etc., but this was the core of it.

Here is a function with a (very) naive implementation of the "link list" discriminator, which attempts to remove elements with a heavy link to text ratio (ie. navigation bars, menus, ads, etc.):

def link_list_discriminator(html, min_links=2, ratio=0.5):
    """Remove blocks with a high link to text ratio.

    These are typically navigation elements.

    Based on an algorithm described in:
        http://www.psl.cs.columbia.edu/crunch/WWWJ.pdf

    :param html: ElementTree object.
    :param min_links: Minimum number of links inside an element
                      before considering a block for deletion.
    :param ratio: Ratio of link text to all text before an element is considered
                  for deletion.
    """
    def collapse(strings):
        return u''.join(filter(None, (text.strip() for text in strings)))

    # FIXME: This doesn't account for top-level text...
    for el in html.xpath('//*'):
        anchor_text = el.xpath('.//a//text()')
        anchor_count = len(anchor_text)
        anchor_text = collapse(anchor_text)
        text = collapse(el.xpath('.//text()'))
        anchors = float(len(anchor_text))
        all = float(len(text))
        if anchor_count > min_links and all and anchors / all > ratio:
            el.drop_tree()

On the test corpus I used it actually worked quite well, but achieving high reliability will require a lot of tweaking.

Alec Thomas 2010-05-29 07:20:03

Answer 4

+3 A:

We've just launched a new natural language processing API over at repustate.com. Using a REST API, you can clean any HTML or PDF and get back just the text parts. Our API is free so feel free to use to your heart's content. And it's implemented in python. Check it out and compare the results to readability.js - I think you'll find they're almost 100% the same.

Martin 2010-05-31 19:47:57

Hmm, looks promising! ;-) I'll give it a try. Are there any hard limits? How many pages can I process per day, etc.?

Emre Sevinç 2010-06-01 07:47:50

Wow, I just used your site to input some url's, and it extracted the articles perfectly.

PythonUser 2010-08-03 17:37:43

Answer 5

A:

@Emre, no limits on the API. Give it a shot and let me know if we can improve things.

Martin 2010-06-01 13:50:57

Answer 6

+2 A:

hn.py via Readability's blog. Readable Feeds, an App Engine app, makes use of it.

I have bundled it as a pip-installable module here: http://github.com/srid/readability

Sridhar Ratnakumar 2010-09-07 01:11:09

ansaurus

tags:

views:

answers:

Is there anything for Python that is like readability.js?

related questions