views:

917

answers:

3

Of course an HTML page can be parsed using any number of python parsers, but I'm surprised that there don't seem to be any public parsing scripts to extract meaningful content (excluding sidebars, navigation, etc.) from a given HTML doc.

I'm guessing it's something like collecting DIV and P elements and then checking them for a minimum amount of text content, but I'm sure a solid implementation would include plenty of things that I haven't thought of.

+1  A: 

What is meaningful and what is not, it depends on the semantic of the page. If the semantics is crappy, your code won't "guess" what is meaningful. I use readability, which you linked in the comment, and I see that on many pages I try to read it does not provide any result, not talking about a decent one.

If someone puts the content in a table, you're doomed. Try readability on a phpbb forum you'll see what I mean.

If you want to do it, go with a regexp on <p></p>, or parse the DOM.

zalew
If you look at the source, you'll see even StackOverflow uses tables for layout in some places!!
Jon Cage
but it has text in paragraphs!! not in td alone like crappy forums!! and no need to shout that!!
zalew
Very true, I was just surprised that SO used tables for layout at all. Sure, tables are often more reliably rendered but css and more 'div's and 'p's would be a better solution for readability (screen readers have more trouble with tables for example)..
Jon Cage
+4  A: 

Try the Beautiful Soup library for Python. It has very simple methods to extract information from an html file.

Trying to generically extract data from webpages would require people to write their pages in a similar way... but there's an almost infinite number of ways to convey a page that looks identical let alone all the conbinations you can have to convey the same information.

Was there a particular type of information you were trying to extract or some other end goal?

You could try extracting any content in 'div' and 'p' markers and compare the relative sizes of all the information in the page. The problem then is that people probably group information into collections of 'div's and 'p's (or at least they do if they're writing well formed html!).

Maybe if you formed a tree of how the information is related (nodes would be the 'p' or 'div or whatever and each node would contain the associated text) you could do some sort of analysis to identify the smallest 'p' or 'div' that encompases what appears to be the majority of the information.. ?

[EDIT] Maybe if you can get it into the tree structure I suggested, you could then use a similar points system to spam assassin. Define some rules that attempt to classify the information. Some examples:

+1 points for every 100 words
+1 points for every child element that has > 100 words
-1 points if the section name contains the word 'nav'
-2 points if the section name contains the word 'advert'

If you have a lots of low scoring rules which add up when you find more relevent looking sections, I think that could evolve into a fairly powerful and robust technique.

[EDIT2] Looking at the readability, it seems to be doing pretty much exactly what I just suggested! Maybe it could be improved to try and understand tables better?

Jon Cage
This is what I had in mind, but I'm still surprised that there isn't a simple library or BeautifulSoup plugin that does this work for you, since I'd imagine that content extraction from an HTML could use these kinds of rules without variation > 90% of the time...
jamtoday
It really depend what you're after; Just about every single scraper I've written has been looking for lots of small snippets of information rather than the larger blurbs of text (which are frequently generic information about the site).
Jon Cage
An additional interesting side note: javascript-based “readability” script does content extraction (ot, rather, selection) as well. It can be stripped for ideas/algorithm too. Although it's not completely aways successful.
+3  A: 

Have a look at templatemaker: http://www.holovaty.com/writing/templatemaker/

It's written by one of the founders of Django. Basically you feed it a few example html files and it will generate a "template" that you can then use to extract just the bits that are different (which is usually the meaningful content).

Here's an example from the google code page:


# Import the Template class.
>>> from templatemaker import Template

# Create a Template instance.
>>> t = Template()

# Learn a Sample String.
>>> t.learn('<b>this and that</b>')

# Output the template so far, using the "!" character to mark holes.
# We've only learned a single string, so the template has no holes.
>>> t.as_text('!')
'<b>this and that</b>'

# Learn another string. The True return value means the template gained
# at least one hole.
>>> t.learn('<b>alex and sue</b>')
True

# Sure enough, the template now has some holes.
>>> t.as_text('!')
'<b>! and !</b>'

John Montgomery