views:

320

answers:

1

I've done site scraping of secure page of any site on http by below code:

    string cookiedata = "fsfsfsdfsfsfsfsfsdf";
    NetworkCredential credential = new NetworkCredential("xxx", "xxx");

    HttpWebRequest request = HttpWebRequest.Create("https://ysats.com") as HttpWebRequest;

    //set the user agent so it looks like IE to not raise suspicion 
    request.UserAgent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)";
    request.Method = "POST";
    //set the cookie in the request header
    request.Headers.Add("Cookie", cookiedata);
    request.Credentials = credential;

    //get the response from the server
    HttpWebResponse response = (HttpWebResponse)request.GetResponse();
    using (Stream stream = response.GetResponseStream())
    {
        using (StreamReader reader = new StreamReader(stream))
        {
            string pagedata = reader.ReadToEnd();
            //now we can scrape the contents of the secure page as needed
            //since the page contents is now stored in our pagedata string
            Response.Write(pagedata);
        }
    }
    response.Close();

but when I am trying to scrap any site on https:// by this code then i always scrape the login page not secure page not required page.

Please advice what should i do for scraping a secure page of any site on https.

A: 

You need to send a POST request with login details for the website, then scrape the page following the login. You'd also have to make sure your WebClient keeps cookies around.

This will inevitably vary from site to site (what the fields are called, what information is required etc.) so you won't be able to develop a blanket solution, and you'd have to check if the login failed or you'd end up scraping the login page again.

See also this duplicate question.

Andy Shellam
I've used HttpWebResponse class and code provided on below link. http://ryanfarley.com/blog/archive/2008/08/25/scraping-or-programatically-accessing-a-secure-webpage.aspx By it i've accessed the secure page of any site on http but am not able to access the secure page site of https.The code and links provide on the other discussion are not proving enough information to solve my problem.
Ajit
OK not a bad way of doing it actually. I would hesitate to say it's something to do with the cookie not being marked as secure (when cookies are created in newer versions of PHP, you can mark them as being HTTP-only or used for HTTPS.) I wonder if that's happening with your cookie.
Andy Shellam
Could you please tell me how can i use System.Net.WebClient class to access the DOM of secure page of Https web sit.
Ajit
I can't help other than what's in my answer - try submitting a POST request containing the login fields and let the WebClient and the browser negotiate the cookie instead of forcing one on it to try and trick it. You might also get more help if you edit your answer and provide a) the exact code you're using, and b) the website you're trying to access. Sorry I cannot be of more help. (remember not to include the actual cookie value you're using!!)
Andy Shellam