tags:

views:

55

answers:

1

Hi, i'm trying to scrape a website but the page i tryed to scrape contains a redirect to another page.I put FOLLOWLOCATION parameter on curl but i arrive on a url http://localhost/....pageredirected.php and so on

The problem is that redirect works but DOMAIN is not right (because it is mine not scraped page). Here is code:

<?php
// create a new CURL resource
$ch = curl_init();

// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://voli.govolo.it/etape1.cfm?ref=2008052701&amp;destination=484&amp;Provenance=320&amp;Date_Depart=11/9/2010&amp;Date_Retour=18/9/2010&amp;AllerRetour=1&amp;Adultes=1&amp;ENFANTS=0&amp;BEBES=0&amp;dated=110910&amp;dater=180910&amp;TypeClasse=0&amp;langue=it");
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);    

// grab URL and pass it to the browser
$esito = curl_exec($ch);
print_r(curl_getinfo($ch));
echo $esito;
// close CURL resource, and free up system resources
curl_close($ch);
?>

page will be redirect is etape1.cfm TO etape2.cfm but I get 404 Error because i see ttp://localhost/scraping/etape2.cfm?... and not h-ttp://voli.govolo.it/etape2.cfm?...

Why FOLLOWLOCATION doesn't follow right DOMAIN (h-ttp://voli.govolo.it) ??

Thanks

A: 

The problem isn't curl. Part of what that first url sends is this:

<script language="JavaScript" type="text/javascript">
<!--

    function historyDeleteAndRedirect()
    {

        window.location.replace('etape2.cfm?ref=2008052701&destination=484&Provenance=320&Date_Depart=11/9/2010&Date_Retour=18/9/2010&AllerRetour=1&Adultes=1&ENFANTS=0&BEBES=0&dated=110910&dater=180910&TypeClasse=0&langue=it');


    //alert(window.location.href);
    //alert(document.referrer);
    }

//-->
</script>

Since you're not accessing the site in a normal manner, this javascript breaks, as you're really hitting "localhost" rather than "WhateverSiteThisIs.com". Remember, curl works on the server. So you're hitting "http://localhost/etape1.cfm?...... Since the .replace() isn't an absolute URL, your browser is doing the correct thing and re-using localhost.

Marc B