How to extract the data from a website using java?

views:

560

answers:

+1 Q:

How to extract the data from a website using java?

Hi I am familier with java programming language I like to extract the data from a website and store it to my database running on my machine.Is that possible in java.If so which API I should use. For example the are number of schools listed on a website How can I extract that data and store it to my database using java.

+3 A:

What you're referring to is commonly called 'screenscraping'. There are a variety of ways to do this in Java, however, I prefer HtmlUnit. While it was designed as a way to test web functionality, you can use it to hit a remote webpage, and parse it out.

I would recommend using a good error handling html parser like Tagsoup to extract from the HTML exactly what you're looking for.

lucas 2010-01-11 18:45:00

I can second the recommendation for Tagsoup. i use it for some while now to extract data form 'real world pages' (meaning, full of invalid html) and it works great

bert 2010-01-21 10:28:51

Depending on what you are really trying to do, you can use many different solutions.

If you juste wanna fetch the HTML code of a web page, then URL.getContent() may be your solution. Here is a little tutorial :

http://www.javacoffeebreak.com/books/extracts/javanotesv3/c10/s4.html

EDIT : didn't understand he was searching for a way to parse the HTML code. Some tools have been suggested above. Sorry for that.

Snowangelic 2010-01-11 23:31:32

You definitely need a good parser like NekoHTML.

Here's an example of using NekoHTML, albeit using Groovy (a Java-based scripting language) rather than Java itself:

http://www.keplarllp.com/blog/2010/01/better-competitive-intelligence-through-scraping-with-groovy

Alex Dean 2010-01-21 09:31:02

I don't know how fast and far you can go with Java for extracting data from web. I believe simple point and click extraction is easier and faster. You can get it stored into your database also. Watch this flash demo for better understanding on the extraction.

Bob 2010-04-28 07:47:11

You can use VietSpider XML from

http://sourceforge.net/projects/binhgiang/files/

Download VietSpider3_16_XML_Windows.zip or VietSpider3_16_XML_Linux.zip

VietSpider Web Data Extractor: Software crawls the data from the websites ((Data Scraper)), format to XML standard (Text, CDATA) then store in the relational database. Product supports the various of RDBMs such as Oracle, MySQL, SQL Server, H2, HSQL, Apache Derby, Postgres …VietSpider Crawler supports session (login, query by form input), multi-downloading, JavaScript handling, proxy (and multi-proxy by auto scan the proxies from website)…

vietspider 2010-05-20 04:10:25

ansaurus

tags:

views:

answers:

How to extract the data from a website using java?

related questions