I am doing some analysis by mining web content using my crawlers. Web pages often contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content.
To extract the sensible content is a difficult problem as I understand it, considering the fact that there is no standard that defines the actual position of a news-story/blog post/forum comment/article in the web page.
I could find some open source solutions like this http://search.cpan.org/~jzhang/HTML-ContentExtractor-0.03/lib/HTML/ContentExtractor.pm
But I am curious if anyone has dealt with this and got reasonable success rate. It seems a fairly common problem and I would like to believe many experts are out there. I would prefer a JAVA based solution but that is not a hard rule. Please give some inputs. I will deeply appreciate.