views:

51

answers:

2

How can I send a header to a website as if PHP / Apache is a browser? I'm trying to scrape a site, but it looks like they send a 404 error if it's coming from another server...

Or, if you know any other good ways to scrape content from a site?

Also, here is my current code:

<?php
    $curl_handle=curl_init();
    curl_setopt($curl_handle,CURLOPT_URL,$_GET['url']);
    curl_setopt($curl_handle, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)");
    curl_setopt($curl_handle, CURLOPT_REFERER, "http://google.com");
    curl_setopt($curl_handle,CURLOPT_CONNECTTIMEOUT,2);
    curl_setopt($curl_handle,CURLOPT_RETURNTRANSFER,1);
    $buffer = curl_exec($curl_handle);
    curl_close($curl_handle);
    echo $buffer;
?>

so, I'll be making an AJAX request like:

/spider.php?url=http://target.com

Which returns an empty string. I know this is setup right though because if i switch target with twitter.com it works... what am i missing to make this look like a full browser?

+2  A: 

If you're using the curl, you can use the CURLOPT_HTTPHEADER option, which takes an array of headers you wish to send with the request.

If you're using file_get_contents(), you can pass it a stream context created with stream_create_context().

Daniel Egeberg
Do you know how to change the browser with that?
Oscar Godson
That would be the `User-Agent` header. The `User-Agent` header my browser sends is `Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.8pre) Gecko/20100718 Ubuntu/10.04 (lucid) Namoroka/3.6.8pre` for instance.
Daniel Egeberg
Thanks, i added the code from Daniel also, but it's still returning an empty string for target.com and twitter.com works... any idea why?
Oscar Godson
I think this is a reason why a low level response may help you understand what is going wrong :-D
gnucom
+3  A: 

For cURL, there is CURLOPT_USERAGENT option for that,

curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)");

However it may also check Referer header, which you can set via

curl_setopt($ch, CURLOPT_REFERER, "http://&lt;somesite&gt;");
Daniel Kluev
Check my updated post... target.com doesn't work, returns an empty string, but twitter works. Any ideas?
Oscar Godson
Regarding your code:1. You should try increasing timeout to at least 10.2. You should catch headers too. Do it with curl_setopt($curl_handle, CURLOPT_HEADER, true);3. Before killing your handle, you should retrieve errors from it, with curl_error($curl_handle); It will provide you further hints what exactly went wrong.
Daniel Kluev
You rock! it was a 301 moved to www and if i do http://www.target.com it works. So, how do I follow all 301s until a 200?
Oscar Godson
@Oscar Godson: You should really read the manual (http://php.net/curl_setopt). `CURLOPT_FOLLOWLOCATION` will enable you to follow redirects.
Daniel Egeberg
Awesome thanks man!
Oscar Godson