views:

239

answers:

3

I made a simple web crawler using PHP (and cURL). It parses rougly 60 000 html pages and retreive product information (it's a tool on an intranet).

My main concern is the concurrent connection. I would like to limit the number of connection, so whatever happens, the crawler would never use more than 15 concurrent connections.

The server block the IP whenever the limit of 25 concurrent connections by IP is reached and for some reason, I can't change that on the server side, so I have to find a way to make my script never use more than X concurrent connections.

Is this possible?

Or maybe I should rewrite the whole thing in another language?

Thank you, any help is appreciated!

+2  A: 

well you can use curl_set_opt(CURLOPT_MAXCONNECTS, 15); to limit the number of connections. But you might also want to make a simple connection manager if that doesnt do it for you.

prodigitalson
I don't know if I should feel releived, or downright dumb! I was not aware of this option, but still I swear I read all the cURL doc... more than once!Nevermind, I will post my result. Thanks a lot my friend!
Mademoiselle Vagin Cul
A: 

Maybe write a simple connection table:

target_IP           |   active_connections

1.2.3.4                 10
4.5.6.7                 5

each curL call would increase the number of connections, each close decrease it.

You could store the table in a mySQL table, or a Memcache for speed.

When you encounter a IP that already has its maximum connections, you would have to implement a "try later" queue.

Pekka
A: 

My answer to another question has some code for doing this with curl_multi_*.

GZipp