views:

1395

answers:

2

Would it be possible to write a screen-scraper for a website protected by a form login. I have access to the site, of course, but I have no idea how to login to the site and save my credentials in C#.

Also, any good examples of screenscrapers in C# would be hugely appreciated.

Has this already been done?

+3  A: 

Sure, this has been done. I have done it a couple of times. This is (generically) called Screen-scraping or Web Scraping.

You should take a look at this question (and also browse the questions under the tag "screen-scraping". Note that Scraping does not only relate to data extraction from a web resource. It also involves submission of data to online forms so as mimic the actions of a user when submitting input such as a Login form.

Cerebrus
+3  A: 

It's pretty simple. You need your custom login (HttpPost) method.

You can come up with something like this (in this way you will get all needed cookies after login, and you need just to pass them to the next HttpWebRequest):

public static HttpWebResponse HttpPost(String url, String referer, String userAgent, ref CookieCollection cookies, String postData, out WebHeaderCollection headers, WebProxy proxy)
    {
        try
        {
            HttpWebRequest http = WebRequest.Create(url) as HttpWebRequest;
            http.Proxy = proxy;
            http.AllowAutoRedirect = true;
            http.Method = "POST";
            http.ContentType = "application/x-www-form-urlencoded";
            http.UserAgent = userAgent;
            http.CookieContainer = new CookieContainer();
            http.CookieContainer.Add(cookies);
            http.Referer = referer;
            byte[] dataBytes = UTF8Encoding.UTF8.GetBytes(postData);
            http.ContentLength = dataBytes.Length;
            using (Stream postStream = http.GetRequestStream())
            {
                postStream.Write(dataBytes, 0, dataBytes.Length);
            }
            HttpWebResponse httpResponse = http.GetResponse() as HttpWebResponse;
            headers = http.Headers;
            cookies.Add(httpResponse.Cookies);

            return httpResponse;
        }
        catch { }
        headers = null;

        return null;
    }
Lukas Šalkauskas
Tip: Even if you do not want the response from a POST, it is important to eat it so the data transfer is flushed and the connection closes cleanly, e.g., { http.GetResponse(); }
Kurt