views:

251

answers:

3

I have a data aggregator that relies on scraping several sites, and indexing their information in a way that is searchable to the user.

I need to be able to scrape a vast number of pages, daily, and I have ran into problems using simple curl requests, that are fairly slow when executed in rapid sequence for a long time (the scraper runs 24/7 basically).

Running a multi curl request in a simple while loop is fairly slow. I speeded it up by doing individual curl requests in a background process, which works faster, but sooner or later the slower requests start piling up, which ends up crashing the server.

Are there more efficient ways of scraping data? perhaps command line curl?

+1  A: 

With a large number of pages, you'll need some sort of multithreaded approach, because you will be spending most of your time waiting on network I/O.

Last time I played with PHP threads weren't all that great of an option, but perhaps that's changed. If you need to stick with PHP, that means you'll be forced to go a multi-process approach: split up your workload into N work units, and run N instances of your script that each receives 1 work unit.

Languages that provide robust and good thread implementations are another option. I've had good experiences with threads in ruby and C, and it seems like Java threads are also very mature and reliable.

Who knows - maybe PHP threads have improved since the last time I played with them (~4 years ago) and are worth a look.

Daniel Papasian
In my experience, Ruby is far better than PHP for multithreaded applications like this one.
Josh
@Josh: And I bet Python beats Ruby to the punch. :P
Alix Axel
I dont have any knowledge of those languages. Are there are examples of this in action?
Yegor
@Alix: I'm sure you're right. I just know Ruby better than Python :-)
Josh
A: 

In my experience running a curl_multi request with a fixed number of threads is the fastest way, could you share the code you're using so we can suggest some improvements? This answer has a fairly decent implementation of curl_multi with a threaded approach, here is the reproduced code:

// -- create all the individual cURL handles and set their options
$curl_handles = array();
foreach ($urls as $url) {
    $curl_handles[$url] = curl_init();
    curl_setopt($curl_handles[$url], CURLOPT_URL, $url);
    // set other curl options here
}

// -- start going through the cURL handles and running them
$curl_multi_handle = curl_multi_init();

$i = 0; // count where we are in the list so we can break up the runs into smaller blocks
$block = array(); // to accumulate the curl_handles for each group we'll run simultaneously

foreach ($curl_handles as $a_curl_handle) {
    $i++; // increment the position-counter

    // add the handle to the curl_multi_handle and to our tracking "block"
    curl_multi_add_handle($curl_multi_handle, $a_curl_handle);
    $block[] = $a_curl_handle;

    // -- check to see if we've got a "full block" to run or if we're at the end of out list of handles
    if (($i % BLOCK_SIZE == 0) or ($i == count($curl_handles))) {
        // -- run the block

        $running = NULL;
        do {
            // track the previous loop's number of handles still running so we can tell if it changes
            $running_before = $running;

            // run the block or check on the running block and get the number of sites still running in $running
            curl_multi_exec($curl_multi_handle, $running);

            // if the number of sites still running changed, print out a message with the number of sites that are still running.
            if ($running != $running_before) {
                echo("Waiting for $running sites to finish...\n");
            }
        } while ($running > 0);

        // -- once the number still running is 0, curl_multi_ is done, so check the results
        foreach ($block as $handle) {
            // HTTP response code
            $code = curl_getinfo($handle,  CURLINFO_HTTP_CODE);

            // cURL error number
            $curl_errno = curl_errno($handle);

            // cURL error message
            $curl_error = curl_error($handle);

            // output if there was an error
            if ($curl_error) {
                echo("    *** cURL error: ($curl_errno) $curl_error\n");
            }

            // remove the (used) handle from the curl_multi_handle
            curl_multi_remove_handle($curl_multi_handle, $handle);
        }

        // reset the block to empty, since we've run its curl_handles
        $block = array();
    }
}

// close the curl_multi_handle once we're done
curl_multi_close($curl_multi_handle);

The trick is to not load too many URLs at once, if you do that the whole process will hang until the slower requests are complete. I suggest using a BLOCK_SIZE of 8 or greater if you have the bandwidth.

Alix Axel
I dont have the code available right now, but its nothing crazy. Loops thru a crawl list of URLs, and it uses exec() to send a process into background to crawl the individual url. Once done, process dies.What does block size mean?
Yegor
@Yegor: Say you have 1000 URLs to check, if you try to request them all at once with `curl_multi_exec()` it will take quite some time because some of those requests will take longer to complete. If you setup `BLOCK_SIZE` to 8, each `curl_multi_exec()` will run 125 times (1000 / 8 = 125) but it will only process 8 URLs at a time - which will most certainly be faster.
Alix Axel
So its pretty identical to putting 8 curl requests inside a loop, and running it 125 times? or would adding 1000 urls at once, and setting a block size faster>
Yegor
@Yegor: If those 8 CURL requests are executed in parallel (via `curl_multi_exec`) then yes, it's pretty much the same thing.
Alix Axel
so how would you get the page into a string in that code?
bluedaniel
A: 

If you want to run single curl requests you can start background processes under linux in PHP like:

proc_close ( proc_open ("php -q yourscript.php parameter1 parameter2 & 2> /dev/null 1> /dev/null", array(), $dummy ));

You can use parameters to give your php script some information about what url's to use, like LIMIT in sql.

You can keep track of the running processes by saving their PIDs somewhere to keep a wanted number of processes running at the same time or kill processes that have not finished in time.

favo