tags:

views:

146

answers:

5

I'm not able to find any good web scraping Java based API. The site which I need to scrape does not provide any API as well; I want to iterate over all web pages using some pageID and extract the HTML titles / other stuff in their DOM trees.

Are there ways other than web scraping?

Thanks

+3  A: 

extract title is not difficult, and you have many options ( search here in SO for HTML Parsers in Java )

one of them is use: http://jsoup.org/

you can navigate the page using DOM if you know the page structure: http://jsoup.org/cookbook/extracting-data/dom-navigation

its a good library and i use it in my last projects.

Wajdy Essam
A: 

I would suggest you a combination of Groovy and HtmlUnit. For lower level handling, you can use HttpBuilder.

Riduidel
A: 

Look at an HTML parser such as TagSoup, HTMLCleaner or NekoHTML.

Mikos
+1  A: 

Thanks! JSoup was what I was looking for.

NoneType
FYI: I presume this was in response to Wajdy's answer. You can comment on someone's answer by clicking the "add comment" link beneath it. As well, if his answer did solve your problem, you can accept it by clicking the check box next to his answer. :)
Adam Paynter
A: 

Your best bet is to use Selenium Web Driver since it

  1. Provides visual feedback to the coder (see your scraping in action, see where it stops)
  2. Accurate and Consistent as it directly controls the browser you use.
  3. Slow. Doesn't hit web pages like HtmlUnit does but sometimes you don't want to hit too fast.

    Htmlunit is fast but is horrible at handling Javascript and AJAX.

Kim Jong Woo