Parse html pages and store the contents(title,text and etc) into Database. | ansaurus

tags:

views:

47

answers:

1

+1 Q:

Parse html pages and store the contents(title,text and etc) into Database.

Hi, Does anybody know some open source tools to parse the html pages, filter the Ads,JS and etc to get title, text. Front end of my application is based on LAMP. So I needs to parse the html pages and storage them into Mysql. And populate front pages with these data.

I know some tools: Heritrix, Nutch. But it seems that they are crawlers.

Thanks. Joseph

A:

It depends on what you mean by "text" from the webpage. I did a similar thing by grabbing a webpage using the apache HttpClient libraries and then dom4j to look for a particular tag to extract text from. But you do in effect need the same type of crawler that search engines like google use. You are emulating the basic steps that they do when they crawl a website. Extracting the information. It would be helpful if you went into a little more detail on what kind of information you want to retrieve from the pages.

controlfreak123 2010-09-16 17:26:48

Useful info. eg: For a news page, I want to get main news content from html page.

Joseph 2010-09-17 02:12:12

related questions

Java Time Zone is messed up

Eclipse on win64

Automate builds for Java RCP for deployment with JNLP

Why are professors or schools picking Java over C++ to teach to students?

Is there a real benefit of using J#?

Public/Popular Websites using JavaServer Faces

Why can't I use a try block around my super() call?

Accessing post variables using Java Servlets

Personal Linux web server

Is this really widening vs autoboxing?

How can I Java webstart multiple, dependent, native libraries?

Why can't I call toString() on a Java primitive?

How do I use Java to read from a file that is actively being written?

What code analysis tools do you use for your Java projects?

IllegalArgumentException or NullPointerException for a null parameter?

How do I configure and communicate with a serial port?

What is the best way to parse strings in Java

Getting started with a custom JXTA PeerGroup

Creating a custom button in Java

How to get started "writing" a code coverage tool?

Which Build-/Configuration Management Tool?

What is the difference between an int and an Integer in Java/C#?

What is the meaning of the type safety warning in certain Java generics casts?

How would you access Object properties from within an object method?

Converting CSV File to XML in Java