views:

50

answers:

2

Given an HTML page that is a text heavy article, I would like to identify and parse out the primary content.

Using http://www.fivethirtyeight.com/2009/08/chavismo-obama-and-monroe-doctrine.html as an example, I want to identify div#post-4438372351887392855, which contains the title and article.

I know nothing can be perfect or work 100% of the time, but is there an approach that can give me the desired result in a reasonable number of circumstances?

My present thought is to iterate through each div, stripping out the markup, then finding the inner-most div that contains the most text.

At this point, I'm just getting started, so looking for input I can put towards a conceptual approach. Or, if something is out there, an open source library would be nice.

Thanks in advance for the insights.

A: 

I just turned this API solution up. I still invite other ideas.

chipotle_warrior
+1  A: 

Some folks at arc90 have done a pretty impressive job with this with their readability bookmarklet. It seems to do a pretty good job of finding the 'main' content -- works on the page you list perfectly.
You can look through their well commented javascript (linked to in the bookmarklet), but you might want to contact the developers for their ideas and permission to use them.

Peter McMahan