views:

124

answers:

2

Hi everyone,

For the past few days I have been trying to scrape a website but so far with no luck.

The situation is as following: The website I am trying to scrape requires data from a form submitted previously. I have recognized the variables that are required by the web app and have investigated what HTTP headers are sent by the original web app.

Since I have pretty much zero knowledge in ASP.net, thought I'd just ask whether I am missing something here.

I have tried different methods (CURL, get contents and the Snoopy class), here's my code of the curl method:

<?php
$url = 'http://www.urltowebsite.com/Default.aspx';
$fields = array('__VIEWSTATE' => 'averylongvar',
                '__EVENTVALIDATION' => 'anotherverylongvar',
                'A few' => 'other variables');

$fields_string = http_build_query($fields);

$curl = curl_init($url);

curl_setopt_array
(
    $curl,
    array
    (
        CURLOPT_RETURNTRANSFER  =>    true,
        CURLOPT_SSL_VERIFYPEER  =>    0,  //    Not supported in PHP
        CURLOPT_SSL_VERIFYHOST  =>    0,  //        at this time.
        CURLOPT_HTTPHEADER      =>
            array
            (
                'Content-type: application/x-www-form-urlencoded; charset=utf-8',
                'Set-Cookie: ASP.NET_SessionId='.uniqid().'; path: /; HttpOnly'
            ),
        CURLOPT_POST            =>    true,
        CURLOPT_POSTFIELDS      =>    $fields_string,
        CURLOPT_FOLLOWLOCATION => 1
    )
);

$response = curl_exec($curl);
curl_close($curl);

echo $response;
?>

The following headers were requested:

Request Headers

  • Accept:application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,/;q=0.5
  • Content-Type:application/x-www-form-urlencoded
  • User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-us) AppleWebKit/533.18.1 (KHTML, like Gecko) Version/5.0.2 Safari/533.18.5

Form Data

  • A lot of form fields

Response Headers

  • Cache-Control:private
  • Content-Length:30168
  • Content-Type:text/html; charset=utf-8
  • Date:Thu, 09 Sep 2010 17:22:29 GMT
  • Server:Microsoft-IIS/6.0
  • X-Aspnet-Version:2.0.50727
  • X-Powered-By:ASP.NET

When I investigate the headers of the CURL script that I wrote, somehow does not generate the Form data request. Neither is the request method set to POST. This is where it seems to me where things go wrong, but dunno.

Any help is appreciated!!!

EDIT: I forgot to mention that the result of the scraping is a custom session expired page of the remote website.

A: 

Since VIEWSTATE contains the state of the page in a particular situation (and all this state is encoded into a big, apparently messy, string), you cannot be sure that the param you are scraping can be the same for your "mock" request (I'm quite sure that it cannot be the same ;) ).

If you really have to deal with VIEWSTATE and EVENTVALIDATION params my advice is to follow another approach, that is to scrape content via Selenium or with an HtmlUnit like library (but unfortunately I don't know if there's something similar in PHP).

mamoo
Mamoo, thanks for replying, in my post, I forgot to mention that the result of the scraping is a custom session expired page of the remote website.As for the Viewstate and eventvalidation, I refreshed the pages millions of times, these variables do not seem to change, hence I used those same vars in my POST vars.In fact, when I changed one small character in these two variables, the website returned an error.
dandoen
Clearer now, in that case ASP makes not much difference. Probably there's still something missing in your headers or params...
mamoo
Thank mamoo, it is indeed very strange. But I pretty much tried everything, I recreated a seperate html form that submits to the original url, this went fine. And I added the same cookie-header but that did not resolve in a success. The only thing I am unsure of is that I don't see any Form Data in the headers at all. Don't know if this is normal when using CURL.
dandoen
A: 

I guess the website requires rendering in a browser. Did you try using a tool like iMacros or Watir?

SamMeiers