views:

124

answers:

5

I have a web page loaded up in the browser (i.e. its DOM and element positioning are both accessible to me) and I want to find the block element (or a sorted list of these elements), which likely contains the most content (as in a continuous block of text). The goal is to exclude things like menus, headers, footers and such.

+1  A: 

First, if you need to parse a web page, I would use HTMLAgilityPack to transform it to an XML. It will speed everything and will enable you, using a simple XPath to go directly to the BODY.

After that, you have to run on all the divs (You can get all the DIV elements in a list from the agility pack), and get whatever you want.

Faruz
I'm interested more in the kinds of criteria I could possibly use to judge the individual candidate nodes.
VoY
+1  A: 

This is my personal favorite: VIPS: a Vision-based Page Segmentation Algorithm

Gideon
+1  A: 

There's a simple technique to do this,based on analysing how "noisy" HTML is, i.e., what is the ratio of markup to displayed text through an html page. The Easy Way to Extract Useful Text from Arbitrary HTML describes this tex, giving some python code to illustrate.

Cf. also the HTML::ContentExtractor Perl module, which implements this idea. It would make sense to clean the html first, if you wanted to use this, using beautifulsoup.

Charles Stewart
A: 

Once you've parsed the page...

You could also learn to use Xpath since running the command "//div" will get you all the divs as well. Then you can narrow down the query as necessary to get to the html element that you're interested in.

e.g. //div[@id!='menu'

kurtnelle
+1  A: 

I would recommend Vit Baisa's thesis on Web Content Cleaning, I think he has some code too, but I can't find a link for it. There is also a discussion of the very same problem on the natural language processing LingPipe blog.

Jeff Kubina