Given an HTML page that is a text heavy article, I would like to identify and parse out the primary content.
Using http://www.fivethirtyeight.com/2009/08/chavismo-obama-and-monroe-doctrine.html as an example, I want to identify div#post-4438372351887392855, which contains the title and article.
I know nothing can be perfect or work 100% of the time, but is there an approach that can give me the desired result in a reasonable number of circumstances?
My present thought is to iterate through each div, stripping out the markup, then finding the inner-most div that contains the most text.
At this point, I'm just getting started, so looking for input I can put towards a conceptual approach. Or, if something is out there, an open source library would be nice.
Thanks in advance for the insights.