I am using PHP to build a web crawler to crawl millions of URLs, what is better for me in terms of performance? file_get_contents
or CURL
?
Thanks.
I am using PHP to build a web crawler to crawl millions of URLs, what is better for me in terms of performance? file_get_contents
or CURL
?
Thanks.
Norse's quick benchmark indicates cURL is going to be the better option. And cURL has a lot more options for handing server headers, redirects, authentication, cookies and such, so it'll be much more flexible if you need to expand the functionality of your code in the future.
But, really, don't use PHP. Try Nutch - it probably already does everything you need.
I just did some quick benchmarking on this.
Fetching google.com using file_get_contents took (in seconds):
2.31319094
2.30374217
2.21512604
3.30553889
2.30124092
CURL took:
0.68719101
0.64675593
0.64326
0.81983113
0.63956594
This was using the benchmark class from http://davidwalsh.name/php-timer-benchmark
Look into the curl_multi_* functions in PHP, which can fetch several URLs in parallel.
I found @Norse benchmark extremely hard to believe since in my experience file_get_contents()
is not 3.5 times slower (on average) than CURL, so I ran my own benchmark. Here are the results:
[1] => Array // 1 request to google.com
(
[FGC] => 0.4955058 // 38.88% slower
[CURL] => 0.3582108
)
[5] => Array // 5 requests to google.com
(
[FGC] => 2.2415568 // 24.44% slower
[CURL] => 1.7973249
)
[10] => Array // 10 requests to google.com
(
[FGC] => 4.7877922 // 29.46% slower
[CURL] => 3.6951289
)
[25] => Array // 25 requests to google.com
(
[FGC] => 10.932404 // 10.18% slower
[CURL] => 9.9168329
)
[50] => Array // 50 requests to google.com
(
[FGC] => 22.535982 // 24.74% slower
[CURL] => 18.068931
)
[100] => Array // 100 requests to google.com
(
[FGC] => 44.685283 // 18.57% slower
[CURL] => 37.688820
)
I've tried to implement both methods to perform in the most similar way, I've even used the slow error suppressor operator (@
) on file_get_contents()
to avoid warnings being thrown since CURL also suppresses errors. As you can see file_get_contents()
is at most 39% slower than CURL, not 250%+ slower like @Norse benchmark suggests. The code I've used to do the benchmark is here:
function Benchmark($function, $arguments = null, $iterations = 10000)
{
set_time_limit(0);
if (is_callable($function) === true)
{
$result = microtime(true);
for ($i = 1; $i <= $iterations; ++$i)
{
call_user_func_array($function, (array) $arguments);
}
return round(microtime(true) - $result, 8);
}
return false;
}
function FGC($url, $post = null, $retries = 3)
{
$http = array
(
'method' => 'GET',
);
if (isset($post) === true)
{
$http['method'] = 'POST';
$http['header'] = 'Content-Type: application/x-www-form-urlencoded';
$http['content'] = (is_array($post) === true) ? http_build_query($post, '', '&') : $post;
}
$result = false;
while (($result === false) && (--$retries > 0))
{
$result = @file_get_contents($url, false, stream_context_create(array('http' => $http)));
}
return $result;
}
function CURL($url, $post = null, $retries = 3)
{
$curl = curl_init($url);
if (is_resource($curl) === true)
{
curl_setopt($curl, CURLOPT_FAILONERROR, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
if (isset($post) === true)
{
curl_setopt($curl, CURLOPT_POST, true);
curl_setopt($curl, CURLOPT_POSTFIELDS, (is_array($post) === true) ? http_build_query($post, '', '&') : $post);
}
$result = false;
while (($result === false) && (--$retries > 0))
{
$result = curl_exec($curl);
}
curl_close($curl);
}
return $result;
}
$result = array();
$result[1]['FGC'] = Benchmark('FGC', 'http://www.google.com/', 1);
$result[1]['CURL'] = Benchmark('CURL', 'http://www.google.com/', 1);
sleep(1); // we don't want to get blacklisted by Google
$result[5]['FGC'] = Benchmark('FGC', 'http://www.google.com/', 5);
$result[5]['CURL'] = Benchmark('CURL', 'http://www.google.com/', 5);
sleep(2); // we don't want to get blacklisted by Google
$result[10]['FGC'] = Benchmark('FGC', 'http://www.google.com/', 10);
$result[10]['CURL'] = Benchmark('CURL', 'http://www.google.com/', 10);
sleep(4); // we don't want to get blacklisted by Google
$result[25]['FGC'] = Benchmark('FGC', 'http://www.google.com/', 25);
$result[25]['CURL'] = Benchmark('CURL', 'http://www.google.com/', 25);
sleep(8); // we don't want to get blacklisted by Google
$result[50]['FGC'] = Benchmark('FGC', 'http://www.google.com/', 50);
$result[50]['CURL'] = Benchmark('CURL', 'http://www.google.com/', 50);
sleep(16); // we don't want to get blacklisted by Google
$result[100]['FGC'] = Benchmark('FGC', 'http://www.google.com/', 100);
$result[100]['CURL'] = Benchmark('CURL', 'http://www.google.com/', 100);
echo '<pre>';
print_r($result);
echo '</pre>';
Tested under Windows 7 / Apache 2 / PHP 5.3.1.