I often find myself needing to do some simple screen scraping for internal purposes (i.e. a third party service I use only publishes reports via HTML). I have at least two or three cases of this now. I could use apache httpclient and create all the necessary screen scraping code but it takes a while. Here is my usual process:
- Open up Charles Proxy on the web site and see whats going on.
- Start writing some java code using Apache HttpClient, dealing with cookies, multiple requests
- use Jericho HTML to deal with parsing of the HTML.
I wish I could just "record my session" quickly and then parametrize the things that vary from session to session. Imagine just using Charles to grab all the request HTTP and then parametrize the relevant query string or post params. Voila I have a reusable http script.
Is there anything that does this already? I remember when I used to work at a big company there used to be a tool we used called Load Runner by Mercury Interactive that essentially had a nice way to record an http session and make it reusable (for testing purposes). That tool, unfortunately, is very expensive.