views:

2773

answers:

6

Hello

I have been asked to write an app which screen scrapes info from an intranet web page and presents the certain info from it in a nice easy to view format. The web page is a real mess and requires the user to click on half a dozen icons to discover if an ordered item has arrived or has been receipted. As you can imagine users find this irritating to say the least and it would be nice to have an app anyone can use that lists the state of their orders in a single screen.

Yes I know a better solution would be to re-write the web app but that would involve calling in the vendor and would cost us as small fortune.

Anyway while looking into this I discovered the web page I want to scrape is mostly Javascript (although it doesn't use any AJAX techniques). Does anyone know if a library or program exists which I could feed with the Javascript and which would then spit out the DOM for my app to parse ?

I can pretty much write the app in any language but my preference would be JavaFX just so I could have a play with it.

Thanks for your time.

Ian

+1  A: 

I'd go with Perl's Win32::IE::Mechanize which lets you automate Internet Explorer. You should be able to click on icons and extract text while letting MSIE do the annoying tasks of processing all the JS.

David Dorward
I like Perl but this web app isn't compatible with IE ! From what I'm told its Firefox and Safari only.
IanW
+3  A: 

You may consider using HTMLunit It's a java class library made to automate browsing without having to control a browser, and it integrates the Mozilla Rhino Javascript engine to process javascript on the pages it loads. There's also a JRuby wrapper for that, named Celerity. Its javascript support is not really perfect right now, but if your pages don't use many hacks things should work fine the performance should be way better than controlling a browser. Furthermore, you don't have to worry about cookies being persisted after your scraping is over and all the other nasty things connected to controlling a browser (history, autocomplete, temp files etc).

emaster70
Thanks I have been reading about HTMLunit and it looks like it is just what I'm after.
IanW
+4  A: 

Since you say that no AJAX is used, then all the info is present at the HTML source. The javascript just renders it based on user clicks. So you need to reverse engineer the way the application works, parse the html and the javascript code and extract the useful information. It is strictly business of text parsing - you shouldn't deal with running javascript and producing a new DOM. This would be much more difficult to do.

If AJAX was used, your job would be easier. You could easily find out how the AJAX services work (probably by receiving JSON and XML) and extract the information.

kgiannakakis
Thanks for your reply. I have just been looking at the JS and HTML source from the web app and there are some AJAX calls which I hadn't noticed before.
IanW
+1  A: 

I agree with kgiannakakis' answer. I'd be suprised if you couldn't reverse engineer the javascript to identify where the information comes from and then write some simple Python scripts using Urllib2 and the Beautiful Soup library to scrape the same information.

If Python and scraping are a new idea, there's some excellent tutorials available on how to get going.

[Edit] Looks like there's a Python version of mechanize too. Time to re-write some scrapers I developed a while back! :-)

Jon Cage
Thanks Jon. I have writen scraping apps many years ago in Perl well before Javascript was a problem. I keep looking for reasons to learn Python so shall look into what you suggest later.
IanW
+3  A: 

You could consider using a greasemonkey JS. greasemonkey is a very powerful Firefox add on that allows you to run your own script alongside that of specific web sites. This allows you to modify how the web site is displayed, add or remove content. You can even use it to do AJAX style lookups and add dynamic content.

If your tool is for in house use, and users are all happy to use Firefox then this could be a winner.

Regards

Howard May
Thanks Greasemonkey looks good I hadn't heard of it before. Sadly some of my users aren't able to install add on to their Firefox installation so I don't think I will be able to use it.
IanW
+1  A: 

I suggest IRobotSoft web scraper. It is a dedicated free software for screen scraping with the best javascript support. You can create and test a robot with its visual interface. You can also embed it into your own application using its ActiveX control and hide the browser window.

seagulf