Web scraping with Java

tags:

java
web

views:

146

answers:

+3 Q:

Web scraping with Java

I'm not able to find any good web scraping Java based API. The site which I need to scrape does not provide any API as well; I want to iterate over all web pages using some pageID and extract the HTML titles / other stuff in their DOM trees.

Are there ways other than web scraping?

Thanks

+3 A:

extract title is not difficult, and you have many options ( search here in SO for HTML Parsers in Java )

one of them is use: http://jsoup.org/

you can navigate the page using DOM if you know the page structure: http://jsoup.org/cookbook/extracting-data/dom-navigation

its a good library and i use it in my last projects.

Wajdy Essam 2010-07-08 09:44:11

I would suggest you a combination of Groovy and HtmlUnit. For lower level handling, you can use HttpBuilder.

Riduidel 2010-07-08 09:45:34

Look at an HTML parser such as TagSoup, HTMLCleaner or NekoHTML.

Mikos 2010-07-08 09:45:47

+1 A:

Thanks! JSoup was what I was looking for.

NoneType 2010-07-08 11:15:28

FYI: I presume this was in response to Wajdy's answer. You can comment on someone's answer by clicking the "add comment" link beneath it. As well, if his answer did solve your problem, you can accept it by clicking the check box next to his answer. :)

Adam Paynter 2010-07-08 11:34:35

Your best bet is to use Selenium Web Driver since it

Provides visual feedback to the coder (see your scraping in action, see where it stops)
Accurate and Consistent as it directly controls the browser you use.
Slow. Doesn't hit web pages like HtmlUnit does but sometimes you don't want to hit too fast.

Htmlunit is fast but is horrible at handling Javascript and AJAX.

Kim Jong Woo 2010-09-23 19:45:12

ansaurus

tags:

views:

answers:

Web scraping with Java

related questions