views:

407

answers:

4

I'm trying to log in to a website and save an HTML page automatically (I want to be able to do this on a regular time interval). From the surface, this is a typical modern website where, if the user navigates directly to a "locked" URL, a log-in form pops up, and after logging in, the user is redirected to the intended page.

I gave mechanize a shot (http://wwwsearch.sourceforge.net/mechanize/) but it wasn't finding some form elements which were needed for login (hidden elements that have some values put in by a javascript function that runs when the user clicks the "log in" button).

I played a bit with the "web browser" control in .NET but quickly lost interest because I couldn't even get it to submit a query on the Google page.

I don't care what the language is; I'll learn it to solve this problem. At a minimum it has to work in Windows.

A simple example, say, typing in a query into the Google search box would be a great bonus.

A: 

Neoload can handle the form filling with authentication, assuming you don't want to collect data, just perform actions. It's a web stress tool, so it's not really meant to be used as a time-based service, but you COULD just leave it running.

Stefan Kendall
If in "collect data" you are including "saving a resulting HTML page" then this doesn't do what I want.
pelesl
+1  A: 

Its being already discussed here.

Basically its gist is you can use selenium, an open source web automation tool, which has api library available in various languages like java, ruby, etc.

Selva
+1  A: 

If I understand you right, you want to log in to only one webpage, and that form always stays the same. You could either reverse engineer the java script, or debug it via a javascript debugger in the browser (e.g. firebug for firefox). Or you can fill in the form in your browser and look at the http request via a network packet sniffer. Once you have all required form data to submit, you can do the same with your program (thats what I did the last time I had a pretty similar task to do). dont forget to store all cookie data you requested back from the webserver and send it with the next request, to 'stay logged in'.

Philip Daubmeier
Sounds like a viable solution, but packet sniffing is well beyond my abilities.
pelesl
+1  A: 

In my experience, the most reliable way is to use javascript. It works well in .Net. To test, browse to the following addresses one after another in Firefox or Internet Explorer:

http://www.google.com
javascript:function f(){document.forms[0]['q'].value='stackoverflow';}f();
javascript:document.forms[0].submit()

That performs a search for "stackoverflow" on Google. To do it in VB .Net using the webbrowser control, do this:

WebBrowser1.Navigate("http://www.google.com")
Do While WebBrowser1.IsBusy OrElse WebBrowser1.ReadyState <> WebBrowserReadyState.Complete
    Threading.Thread.Sleep(1000)
    Application.DoEvents()
Loop
WebBrowser1.Navigate("javascript:function%20f(){document.forms[0]['q'].value='stackoverflow';}f();")
Threading.Thread.Sleep(2000) 'wait for javascript to run
WebBrowser1.Navigate("javascript:document.forms[0].submit()")
Threading.Thread.Sleep(2000) 'wait for javascript to run

Notice how the space in the URL is converted to %20. I'm not certain if this is necessary but it can't hurt. It is important that the first javascript be in a function. The calls to Sleep() are to wait for Google to load and also for the javascript stuff. The Do While Loop might run forever if the page fails to load so for automation purposes have a counter that will timeout after, say, 60 seconds.

Of course, for Google you can just navigate directly to www.google.com?q=stackoverflow but if your site has hidden input fields, etc, then this is the way to go. Only works for HTML sites - flash is a whole other matter.

Eyal
This looks promising. Thanks.
pelesl
As written, this doesn't work---probably because the WebBrowser control runs off the same thread. But, if the Navigate calls are separated, say, into three button click events, it's easy to test. Thanks again.
pelesl
You're probably right. I'm using a different thread in my project.
Eyal