views:

2990

answers:

15

I often find myself needing to do some simple screen scraping for internal purposes (i.e. a third party service I use only publishes reports via HTML). I have at least two or three cases of this now. I could use apache httpclient and create all the necessary screen scraping code but it takes a while. Here is my usual process:

  1. Open up Charles Proxy on the web site and see whats going on.
  2. Start writing some java code using Apache HttpClient, dealing with cookies, multiple requests
  3. use Jericho HTML to deal with parsing of the HTML.

I wish I could just "record my session" quickly and then parametrize the things that vary from session to session. Imagine just using Charles to grab all the request HTTP and then parametrize the relevant query string or post params. Voila I have a reusable http script.

Is there anything that does this already? I remember when I used to work at a big company there used to be a tool we used called Load Runner by Mercury Interactive that essentially had a nice way to record an http session and make it reusable (for testing purposes). That tool, unfortunately, is very expensive.

+3  A: 

You don't mention what you want to use this for; One solution is to simply "script" your web browser using tools like Selenium if having a web browser repeat your actions is an acceptable solution. You can use the Selenium IDE to record what you do and then alter the parameters.

Mark Fowler
+7  A: 

HtmlUnit is a scriptable, headless browser written in Java. We use it for some extremely fault-heavy, complex web pages and it usually does a very good job.

To simplify things even more you can run it in Jython. The resultant program reads more like a transcript of how one might use a browser than hard work.

toothygoose
+3  A: 

I wish I could just "record my session" quickly and then parametrize the things that vary from session to session.

If you have Visual Studio test edition it's web test function does that exactly. If you aren't using VS or want a stand alone tool I have had great success with OpenSpan. It is more than just web, it does windows apps, and java!

Robert MacLean
+3  A: 

Selenium would be my 1st pick, as the IDE lets you do a lot of things the easy way by "recording" a session for you. But, if you're not happy with what it provides, you can also use the Python module called Beautiful Soup to programmatically walk through a website.

ojrac
+2  A: 

Coscripter

http://coscripter.research.ibm.com/coscripter

Simplifying web-based processes.

CoScripter is a system for recording, automating, and sharing processes performed in a web browser such as printing photos online, requesting a vacation hold for postal mail, or checking flight arrival times. Instructions for processes are recorded and stored in easy-to-read text here on the CoScripter web site, so anyone can make use of them. If you are having trouble with a web-based process, check to see if someone has written a CoScript for it!

Wget

To quickly pull down content, use wget:

wget -r -n -k -w 2 foo.com

Then parse the HTML locally.

Twill

In addition to Selenium, you might also check out Twill, the command line companion:

http://twill.idyll.org/

ramanujan
+2  A: 

I'd also look at Selenium and/or BeautifulSoup if you're willing to use Python. There's also a nice testing tool Twill for automatic website testing that might do what you want. It's also written in Python, and it has a Python API, but there is also a simplified command language you can use with it. Here is an example from the Twill documentation::

setlocal username <your username>
setlocal password <your password>

go http://www.slashdot.org/
formvalue 1 unickname $username
formvalue 1 upasswd $password
submit

code 200     # make sure form submission is correct!
Rick Copeland
Thank you for introducing me to Twill. I think it addresses what I need a bit. Although there is no recording capability I think the simplicity of its scripting language will allow me to build very quick screen-scraping code. Therefore I have selected it as the answer to the bounty.
Ish
I also chose your answer because of the quick code sample which showed me how easy it is to use.
Ish
+1  A: 

I used DomInspector for manually inspecting the site of interest to parametrize it's structure. Then simple Apache HttpClient and hand-made parser using this parametrized structure. Basically I could extract any info from any site automatically with a little tweak of parameters.. It's similar to how SAX parser works, all you need to tell it is at what sequence of tags you want to start grabbing the data. For example, google have pretty standard format of search results.. So, you just run to the third occurrence of 'tab' and start getting text from the first 'div' up until the end '/div'

Dima
+1  A: 

iMacro is scriptable but just for Firefox I assume it's not great in performance but can handle most complex situations and can record stuff easily.

dr. evil
I have used iMacro. It works great in bringing the page down and it remembers username, password ... etc as if you were doing it manually. However, it will require Firefox. So if you are thinking of running your scraper on a headless (Gnome, KDE) server without window manager, then you are out of luck.
VN44CA
+1  A: 

Internet Explorer supports Browser Helper Objects (BHOs). They can access IE' HWND (window handle) and it's easy to scrape the pixels from there. The IWebBrowser2 COM interface also gives you access to the HTTP requests, and you can get back the parsed HTML document via IWebBrowser2::Document = IHTMLDocument / IHTMLDocument2 /IHTMLDocument3

MSalters
+1  A: 

Using FireFox, it should be possible to implement much of it with its powerful support for addons and enhancements, however that wouldn't really mean to run "headless", but really be a real scripted browser. Also, I seem to recall having read that google's chrome browser uses a similar technique to do automated regression testing.

none
+1  A: 

I can't personally vouch for it, but there is a free firefox plugin: DejaClick I installed it the other day and did some remedial recording, playback, and script editing activities with it. It pulled them off without much of a learning curve. If your end goal is to show something in a web browser, then it should suffice.

They offer web transaction monitoring services, implying that you can export the scripts for other uses, but they may be too proprietary to use outside of your web browser / their paid service.

http://www.dejaclick.com/

scottwed
+1  A: 

I'd check out Badboy. It runs an IE browser, but you can literally click record and it records all of your activity.

You can then automate the processing of that script and populate values from a datasource (ODBC, Excel, etc...)

Badboy Software

Doug Hays
+1  A: 

Try iOpus iMacros http://www.iopus.com/imacros/. I am using this for screen scrapping and its working very well and the speed is also very good. Its not that costly either.

It will record the script while you are browsing. You can then parametrize the script and execute using Java, .net, etc.

Bhushan
+1  A: 

I would look at Fiddler judging by your requests it will do everything you need.

Alex
+1  A: 

Python and Perl both have a module called Mechanize (WWW::Mechanize for perl) that makes it easy to do browser behavior programmaticly (filling out forms, handling cookies, etc).

So, Python + BeautifulSoup (great html/xml parser) + mechanize (browser functions) = super easy/fast scraper

Alex
Quick Question, Can Mechanize handle AJAX?
VN44CA