What algorithms could I use to identify content on a web page

views:

124

answers:

+7 Q:

What algorithms could I use to identify content on a web page

I have a web page loaded up in the browser (i.e. its DOM and element positioning are both accessible to me) and I want to find the block element (or a sorted list of these elements), which likely contains the most content (as in a continuous block of text). The goal is to exclude things like menus, headers, footers and such.

+1 A:

First, if you need to parse a web page, I would use HTMLAgilityPack to transform it to an XML. It will speed everything and will enable you, using a simple XPath to go directly to the BODY.

After that, you have to run on all the divs (You can get all the DIV elements in a list from the agility pack), and get whatever you want.

Faruz 2010-01-04 12:24:56

I'm interested more in the kinds of criteria I could possibly use to judge the individual candidate nodes.

VoY 2010-01-04 13:03:40

+1 A:

This is my personal favorite: VIPS: a Vision-based Page Segmentation Algorithm

Gideon 2010-01-05 01:10:39

+1 A:

There's a simple technique to do this,based on analysing how "noisy" HTML is, i.e., what is the ratio of markup to displayed text through an html page. The Easy Way to Extract Useful Text from Arbitrary HTML describes this tex, giving some python code to illustrate.

Cf. also the HTML::ContentExtractor Perl module, which implements this idea. It would make sense to clean the html first, if you wanted to use this, using beautifulsoup.

Charles Stewart 2010-01-07 11:55:42

Once you've parsed the page...

You could also learn to use Xpath since running the command "//div" will get you all the divs as well. Then you can narrow down the query as necessary to get to the html element that you're interested in.

e.g. //div[@id!='menu'

kurtnelle 2010-01-07 11:56:05

+1 A:

I would recommend Vit Baisa's thesis on Web Content Cleaning, I think he has some code too, but I can't find a link for it. There is also a discussion of the very same problem on the natural language processing LingPipe blog.

Jeff Kubina 2010-01-11 14:14:39

ansaurus

tags:

views:

answers:

What algorithms could I use to identify content on a web page

related questions