tags:

views:

14

answers:

1

I need to download a bunch of HTML pages programatically, but they are behind a login. SO what I need... I think... is to do the following.

  1. Use an HTTP POST to upload some form data including the username/password.
  2. Capture the session somehow. Cookies?
  3. Send a series of HTTP GETs to download the pages I need.

#3 is easy, I do it all the time. I don't have a clue how to do #1 and 2.

P.S. I will also glady accept "Hey dummy, just use program blah to crawl the site."

+2  A: 

You need to use a CookieContainer. You can set this on a HttpWebRequest it will collect any cookies received in the response. Then if you set the same CookieContainer instance on subsequent requests, it will post those cookies back to the server.

You can also use WebClient which is much simpler than HttpWebRequest but in order to set a CookieContainer you'll need to derive from WebClient and override the protected GetWebRequest method.

As for posting data such as form fields, I suggest doing it in a browser while running Fiddler and seeing what the browser is posting. Then you'll know what to include in your POST data.

Josh Einstein
Unfotunately is didn't work, the site did other stuff to block me. I paied for the bloody eBook, I just want to be able to download and print it. Oh well, what you taught me is still going to be useful in the future.
Jonathan Allen