views:

172

answers:

1

I am doing some analysis by mining web content using my crawlers. Web pages often contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content.

To extract the sensible content is a difficult problem as I understand it, considering the fact that there is no standard that defines the actual position of a news-story/blog post/forum comment/article in the web page.

I could find some open source solutions like this http://search.cpan.org/~jzhang/HTML-ContentExtractor-0.03/lib/HTML/ContentExtractor.pm

But I am curious if anyone has dealt with this and got reasonable success rate. It seems a fairly common problem and I would like to believe many experts are out there. I would prefer a JAVA based solution but that is not a hard rule. Please give some inputs. I will deeply appreciate.

+1  A: 

Ideally, you would look for an RSS feed to get the raw content.

The is no standard for overall structure & meaning in HTML. Authors define different elements in their page. Search engines have invested a lot into this area, and they have their own secret sauce for indexing the content and getting some kind of meaning & structure out of it for search ranking.

Until we have the long-foretold "semantic web", we can only make educated guesses about the structure and meaning of arbitrary HTML pages.

But, in theory:

Look for heading tags. These should give you a clue for where to start reading, and hopefully an outline for the order of importance for the content.

Look for common element id and classes. A well-structured site might have things like <div id="content"> and <div class="article">, which is as semantic as it gets these days. Also get to know the standard element names used by common CMS platforms like WordPress ("post") or Drupal ("node"). Often these will be used to mark up the content.

Last but not least, look for microformats.

Andrew Vit