Hi, Does anybody know some open source tools to parse the html pages, filter the Ads,JS and etc to get title, text. Front end of my application is based on LAMP. So I needs to parse the html pages and storage them into Mysql. And populate front pages with these data.
I know some tools: Heritrix, Nutch. But it seems that they are crawlers.
Thanks. Joseph