views:

8160

answers:

4

I am using PHP to build a web crawler to crawl millions of URLs, what is better for me in terms of performance? file_get_contents or CURL?

Thanks.

+8  A: 

Norse's quick benchmark indicates cURL is going to be the better option. And cURL has a lot more options for handing server headers, redirects, authentication, cookies and such, so it'll be much more flexible if you need to expand the functionality of your code in the future.

But, really, don't use PHP. Try Nutch - it probably already does everything you need.

Joe Mahoney
Also recommend [Pavuk](http://www.pavuk.org/) for crawling. It's fast and has any and every config setting you could want.
Marco
+25  A: 

I just did some quick benchmarking on this.

Fetching google.com using file_get_contents took (in seconds):

2.31319094
2.30374217
2.21512604
3.30553889
2.30124092

CURL took:

0.68719101
0.64675593
0.64326
0.81983113
0.63956594

This was using the benchmark class from http://davidwalsh.name/php-timer-benchmark

Norse
So just to open the page (no parsing, storage, etc) for 1 million urls would take about 11.5 days. (1,000,000 / (60 * 60 * 24)). Assuming ~1sec/page average.
Mike B
@Norse: file_get_contents() seems to be way to slow in your benchmark. Did it actually took 2+ seconds to request google.com ONE time?
Alix Axel
@Mike B: I highly doubt that, check my benchmark.
Alix Axel
@Alix Axel: Thanks for clearing that up :)
Mike B
+6  A: 

Look into the curl_multi_* functions in PHP, which can fetch several URLs in parallel.

http://se2.php.net/manual/en/ref.curl.php

jakber
+8  A: 

I found @Norse benchmark extremely hard to believe since in my experience file_get_contents() is not 3.5 times slower (on average) than CURL, so I ran my own benchmark. Here are the results:

[1] => Array   // 1 request to google.com
(
    [FGC] =>  0.4955058 // 38.88% slower
    [CURL] => 0.3582108
)
[5] => Array   // 5 requests to google.com
(
    [FGC] =>  2.2415568 // 24.44% slower
    [CURL] => 1.7973249
)    
[10] => Array  // 10 requests to google.com
(
    [FGC] =>  4.7877922 // 29.46% slower
    [CURL] => 3.6951289
)    
[25] => Array  // 25 requests to google.com
(
    [FGC] =>  10.932404 // 10.18% slower
    [CURL] => 9.9168329
)    
[50] => Array  // 50 requests to google.com
(
    [FGC] =>  22.535982 // 24.74% slower
    [CURL] => 18.068931
)    
[100] => Array // 100 requests to google.com
(
    [FGC] =>  44.685283 // 18.57% slower
    [CURL] => 37.688820
)

I've tried to implement both methods to perform in the most similar way, I've even used the slow error suppressor operator (@) on file_get_contents() to avoid warnings being thrown since CURL also suppresses errors. As you can see file_get_contents() is at most 39% slower than CURL, not 250%+ slower like @Norse benchmark suggests. The code I've used to do the benchmark is here:

function Benchmark($function, $arguments = null, $iterations = 10000)
{
    set_time_limit(0);

    if (is_callable($function) === true)
    {
        $result = microtime(true);

        for ($i = 1; $i <= $iterations; ++$i)
        {
            call_user_func_array($function, (array) $arguments);
        }

        return round(microtime(true) - $result, 8);
    }

    return false;
}

function FGC($url, $post = null, $retries = 3)
{
    $http = array
    (
        'method' => 'GET',
    );

    if (isset($post) === true)
    {
        $http['method'] = 'POST';
        $http['header'] = 'Content-Type: application/x-www-form-urlencoded';
        $http['content'] = (is_array($post) === true) ? http_build_query($post, '', '&') : $post;
    }

    $result = false;

    while (($result === false) && (--$retries > 0))
    {
        $result = @file_get_contents($url, false, stream_context_create(array('http' => $http)));
    }

    return $result;
}

function CURL($url, $post = null, $retries = 3)
{
    $curl = curl_init($url);

    if (is_resource($curl) === true)
    {
        curl_setopt($curl, CURLOPT_FAILONERROR, true);
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
        curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);

        if (isset($post) === true)
        {
            curl_setopt($curl, CURLOPT_POST, true);
            curl_setopt($curl, CURLOPT_POSTFIELDS, (is_array($post) === true) ? http_build_query($post, '', '&') : $post);
        }

        $result = false;

        while (($result === false) && (--$retries > 0))
        {
            $result = curl_exec($curl);
        }

        curl_close($curl);
    }

    return $result;
}

$result = array();

$result[1]['FGC'] = Benchmark('FGC', 'http://www.google.com/', 1);
$result[1]['CURL'] = Benchmark('CURL', 'http://www.google.com/', 1);

sleep(1); // we don't want to get blacklisted by Google

$result[5]['FGC'] = Benchmark('FGC', 'http://www.google.com/', 5);
$result[5]['CURL'] = Benchmark('CURL', 'http://www.google.com/', 5);

sleep(2); // we don't want to get blacklisted by Google

$result[10]['FGC'] = Benchmark('FGC', 'http://www.google.com/', 10);
$result[10]['CURL'] = Benchmark('CURL', 'http://www.google.com/', 10);

sleep(4); // we don't want to get blacklisted by Google

$result[25]['FGC'] = Benchmark('FGC', 'http://www.google.com/', 25);
$result[25]['CURL'] = Benchmark('CURL', 'http://www.google.com/', 25);

sleep(8); // we don't want to get blacklisted by Google

$result[50]['FGC'] = Benchmark('FGC', 'http://www.google.com/', 50);
$result[50]['CURL'] = Benchmark('CURL', 'http://www.google.com/', 50);

sleep(16); // we don't want to get blacklisted by Google

$result[100]['FGC'] = Benchmark('FGC', 'http://www.google.com/', 100);
$result[100]['CURL'] = Benchmark('CURL', 'http://www.google.com/', 100);

echo '<pre>';
print_r($result);
echo '</pre>';

Tested under Windows 7 / Apache 2 / PHP 5.3.1.

Alix Axel