views:

47

answers:

1

Hi, Does anybody know some open source tools to parse the html pages, filter the Ads,JS and etc to get title, text. Front end of my application is based on LAMP. So I needs to parse the html pages and storage them into Mysql. And populate front pages with these data.

I know some tools: Heritrix, Nutch. But it seems that they are crawlers.

Thanks. Joseph

A: 

It depends on what you mean by "text" from the webpage. I did a similar thing by grabbing a webpage using the apache HttpClient libraries and then dom4j to look for a particular tag to extract text from. But you do in effect need the same type of crawler that search engines like google use. You are emulating the basic steps that they do when they crawl a website. Extracting the information. It would be helpful if you went into a little more detail on what kind of information you want to retrieve from the pages.

controlfreak123
Useful info. eg: For a news page, I want to get main news content from html page.
Joseph