Hi everyone,
For the past few days I have been trying to scrape a website but so far with no luck.
The situation is as following: The website I am trying to scrape requires data from a form submitted previously. I have recognized the variables that are required by the web app and have investigated what HTTP headers are sent by the original web app.
Since I have pretty much zero knowledge in ASP.net, thought I'd just ask whether I am missing something here.
I have tried different methods (CURL, get contents and the Snoopy class), here's my code of the curl method:
<?php
$url = 'http://www.urltowebsite.com/Default.aspx';
$fields = array('__VIEWSTATE' => 'averylongvar',
'__EVENTVALIDATION' => 'anotherverylongvar',
'A few' => 'other variables');
$fields_string = http_build_query($fields);
$curl = curl_init($url);
curl_setopt_array
(
$curl,
array
(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_SSL_VERIFYPEER => 0, // Not supported in PHP
CURLOPT_SSL_VERIFYHOST => 0, // at this time.
CURLOPT_HTTPHEADER =>
array
(
'Content-type: application/x-www-form-urlencoded; charset=utf-8',
'Set-Cookie: ASP.NET_SessionId='.uniqid().'; path: /; HttpOnly'
),
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => $fields_string,
CURLOPT_FOLLOWLOCATION => 1
)
);
$response = curl_exec($curl);
curl_close($curl);
echo $response;
?>
The following headers were requested:
- Request URL: http://www.urltowebsite.com/default.aspx
- Request Method:POST
- Status Code: 200 OK
Request Headers
- Accept:application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,/;q=0.5
- Content-Type:application/x-www-form-urlencoded
- User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-us) AppleWebKit/533.18.1 (KHTML, like Gecko) Version/5.0.2 Safari/533.18.5
Form Data
- A lot of form fields
Response Headers
- Cache-Control:private
- Content-Length:30168
- Content-Type:text/html; charset=utf-8
- Date:Thu, 09 Sep 2010 17:22:29 GMT
- Server:Microsoft-IIS/6.0
- X-Aspnet-Version:2.0.50727
- X-Powered-By:ASP.NET
When I investigate the headers of the CURL script that I wrote, somehow does not generate the Form data request. Neither is the request method set to POST. This is where it seems to me where things go wrong, but dunno.
Any help is appreciated!!!
EDIT: I forgot to mention that the result of the scraping is a custom session expired page of the remote website.