views:

1167

answers:

4

I need to automate a process involving a website that is using a login form. I need to capture some data in the pages following the login page.

I know how to screen-scrape normal pages, but not those behind a secure site.

  1. Can this be done with the .NET WebClient class?
  2. How would I automatically login?
  3. How would I keep logged in for the other pages?
+1  A: 

You can easily simulate user input. You can submit form on the web page from you program by sending post\get request to a website.
Typical login form looks like:

<form name="loginForm" method="post" Action="target_page.html">
   <input type="Text" name="Username">
   <input type="Password" name="Password">
</form>

You can send a post request to the website providing values for Username & Password fields. What happens after you send your request is largely depends on a website, usually you will be redirected to some page. You authorization info will be stored in the sessions\cookie. So if you scrape client can maintain web session\understands cookies you will be able to access protected pages.

It's not clear from your question what language\framework you're going to use. For example there is a framework for screen scraping (including login functionality) written in perl - WWW::Mechanize

Note, that you can face some problems if site you're trying to login to uses java scripts or some kind of CAPTCHA.

aku
A: 

Can you please clarify? Is the WebClient class you speak of the one in HTTPUnit/Java?

If so, your session should be saved automatically.

Nick Stinemates
+4  A: 
Hafthor
A: 

It isn't clear from your question which WebClient class (or language) you are referring to.

If have a Java Runtime you can use the Apache HttpClient class; here's an example I wrote using Groovy that accesses the delicious API over SSL:

   def client = new HttpClient()

   def credentials = new UsernamePasswordCredentials( "username", "password" )
   def authScope = new AuthScope("api.del.icio.us", 443, AuthScope.ANY_REALM)
   client.getState().setCredentials( authScope, credentials )

   def url = "https://api.del.icio.us/v1/posts/get"

   def method = new PostMethod( url )
   method.addParameter( "tag", tag )
   client.executeMethod( method )
Andrew Whitehouse