views:

2783

answers:

4

Hi,

I need to screen scrape some data from a website, because it isn't available via their web service. When I've needed to do this previously, I've written the Java code myself using Apache's HTTP client library to make the relevant HTTP calls to download the data. I figured out the relevant calls I needed to make by clicking through the relevant screens in a browser while using the Charles web proxy to log the corresponding HTTP calls.

As you can imagine this is a fairly tedious process, and I'm wodering if there's a tool that can actually generate the Java code that corresponds to a browser session. I expect the generated code wouldn't be as pretty as code written manually, but I could always tidy it up afterwards. Does anyone know if such a tool exists? Selenium is one possibility I'm aware of, though I'm not sure if it supports this exact use case.

Thanks, Don

+1  A: 

You should take a look at HtmlUnit - it was designed for testing websites but works great for screen scraping and navigating through multiple pages. It takes care of cookies and other session-related stuff.

Marc Novakowski
+2  A: 

The manageability.org blog has an entry which lists a whole bunch of web page scraping tools for Java. However, I do not seem to be able to reach it right now, but I did find a text only representation in Google's cache here.

Nicholas
A: 

A tool called The Grinder allows you to script a session to a site by going through its proxy. The output is Python (runnable in Jython).

laz
If the output was Java code, that would be exactly what I'm looking for. I know you can run Python on the JVM via Jython, but I need Java code.
Don
+1  A: 

I would also add +1 for HtmlUnit since its functionality is very powerful: if you are needing behaviour 'as though a real browser was scraping and using the page' that's definitely the best option available. HtmlUnit executes (if you want it to) the Javascript in the page.

It currently has full featured support for all the main Javascript libraries and will execute JS code using them. Corresponding with that you can get handles to the Javascript objects in page programmatically within your test.

If however the scope of what you are trying to do is less, more along the lines of reading some of the HTML elements and where you dont much care about Javascript, then using NekoHTML should suffice. Its similar to JDom giving programmatic - rather than XPath - access to the tree. You would probably need to use Apache's HttpClient to retrieve pages.

j pimmel