views:

364

answers:

4

Hi,

I am trying to automate data extraction from a website and I really don't know where to start. One of our suppliers is giving us access to some equipment logging data through a "Business Objects 11" online application. If you are not familiar with this online app, think of it as a web based report generator. The problem is that I am trying to monitor a lot of equipment and this supplier has only created a request to extract one log at a time. This request takes the equipment number, the start date and the end date... To make matters worse, we can only export to the binary Excel format since de "csv" export is broke and they refuse to fix it... hence we are limited by Excel's 65 536 row limitation... (that amounts to 3-4 days of data recording in my case). I can't create a new resquest as only the supplier has the necessary admin rights.

What do you think would be the most elegant way of running a lot of requests (around 800) through a web GUI ? I guess I could hardcode mouse positions, click events, and keystrokes with delays and everything... But there has to be a better way.

I read about AutoHotKey and AutoIt scripting but they seem to be limited as to what they can do on the web. Also... I am stuck with IE6... But if you know a way that involves another browser, I am still very interested in your answer.

(once I have the log files locally, extracting the data is not a problem)

Thank you very much for your time !

A: 

Normally, I would suggest not to use IE (or any browser) at all. Remember, web browser software are just proxy programs for making http requests and displaying the results in meaningful ways. There are other ways you can make similar http requests and process the responses. Almost every modern language has this built into it's API somewhere. This is called screen scraping or web scraping.

But to complete this suggestion I need to know more about your programming environment: ie, in what programming language do you envision writing this script?

A typical example using C# where you just get the html result as string would look like this:

new System.Net.WebClient().DownloadString("http://example.com");

You then parse the string to find any fields you need and send another request. The WebClient class also have a .DownloadFile() method that you might find useful for retrieving the excel files.

Joel Coehoorn
The language itself is not a problem. I'm more of a C/C++ developper but I have worked a lot with VB/VBS, C#, Java, Bash scripting, etc. I worked with PHP a bit but that's about it when it comes to "web languages".You are right about http requests but I have the impression parsing the raw responses from such a web app would be very complex... Or maybe not...
Decapsuleur
@Decapsuleur: Parsing html response with regexps looks crappy, but works surprisingly well for automatically generated pages.
wuub
I wouldn't use regex- it gets real ugly matching nested tags and the like. Manual string function end up simpler to implement and maintain.
Joel Coehoorn
+1  A: 

There are some things you might try. If the site is a html and reports can be requested by a simple POST or GET then urlib/urlib2 and cookielib Python modules should be enough to fetch an excel document.

Then you can try this: xlrd to extract data from excel.

Also, take a look at: http://pamie.sourceforge.net/. I never tried it myself but looks promising and easy to use.

wuub
Thanks, Pamie works great !The only problem I have now is getting it to work with some of the apps custom widgets :(. (some kind of custom textbox in a header frame...)For now using AutoIt for certain tricky parts seems like a viable solution.Maybe someone knows a way around this limitation.
Decapsuleur
A: 

Try Automation Anywhere. You can automate keystrokes, mouse clicks, positions, etc. and a great tool to extract data from web. And it can save into excel till end. I personally use int on IE 7 and it works fine I am not too sure about IE 6 but can still try it out. There's a demo available of it's web data extraction. Free trial is available there too.

Lewis

Lewis
A: 

Since you can use .NET, you should consider using the Windows Forms WebBrowser control. You can automate it to navigate to the site, press buttons, etc. Once the report page is loaded, you can use code to navigate the HTML DOM to find the data you want - no regular expressions involved.

I did something like this years ago, to extract auction data from eBay.

John Saunders