views:

3246

answers:

4

I'm using a simple PHP library to add documents to a SOLR index, via HTTP.

There are 3 servers involved, currently:

  1. The PHP box running the indexing job
  2. A database box holding the data being indexed
  3. The solr box.

At 80 documents/sec (out of 1 million docs), I'm noticing an unusually high interrupt rate on the network interfaces on the PHP and solr boxes (2000/sec; what's more, the graphs are nearly identical -- when the interrupt rate on the PHP box spikes, it also spikes on the Solr box), but much less so on the database box (300/sec). I imagine this is simply because I open and reuse a single connection to the database server, but every single Solr request is currently opening a new HTTP connection via cURL, thanks to the way the Solr client library is written.

So, my question is:

  1. Can cURL be made to open a keepalive session?
  2. What does it take to reuse a connection? -- is it as simple as reusing the cURL handle resource?
  3. Do I need to set any special cURL options? (e.g. force HTTP 1.1?)
  4. Are there any gotchas with cURL keepalive connections? This script runs for hours at a time; will I be able to use a single connection, or will I need to periodically reconnect?
A: 

I don't think curl can be made to keep-alive, but I may be wrong. In any case, if you do use keep-alive, the client needs to be able to reconnect periodically in case of a disconnection, especially because the server will probably not allow you to keep a single connection for hours, or even minutes.

cloudhead
A: 

If you don't care about the response from the request, you can do them asynchronously, but you run the risk of overloading your SOLR index. I doubt it though, SOLR is pretty damn quick.

http://stackoverflow.com/questions/124462/asynchronous-php-calls

UltimateBrent
That's certainly interesting, but it doesn't address connection re-use at all. In fact, it would only make my connection overhead issues worse.
Frank Farmer
+4  A: 

cURL PHP documentation (curl_setopt) says:

CURLOPT_FORBID_REUSE - TRUE to force the connection to explicitly close when it has finished processing, and not be pooled for reuse.

So:

  1. Yes, actually it should re-use connections by default, as long as
  2. you re-use the cURL handle.
  3. by default, cURL handles persistent connections by itself; should you need some special headers, check CURLOPT_HTTPHEADER
  4. the server may send a keep-alive timeout (with default Apache install, it is 15 seconds or 100 requests, whichever comes first) - but cURL will just open another connection when that happens.
Piskvor
+1  A: 
  1. On the server you are accessing keep-alive must be enabled and maximum keep-alive requests should be reasonable. In the case of Apache, refer to the apache docs.

  2. You have to be re-using the same cURL context.

  3. When configuring the cURL context, enable keep-alive with timeout in the header:

    curl_setopt($curlHandle, CURLOPT_HTTPHEADER, array(
        'Connection: Keep-Alive',
        'Keep-Alive: 300'
    ));
    
Oleg Barshay
I wonder if CURL sends a Keep-Alive header by default...
Frank Farmer
Frank, I just re-tested my code and it looks to be on by default. Couldn't hurt to set it explicitly though.
Oleg Barshay