views:

680

answers:

12

I'm the developer of twittertrend.net, I was wondering if there was a faster way to get headers of a URL, besides doing curl_multi? I process over 250 URLs a minute, and I need a really fast way to do this from a PHP standpoint. Either a bash script could be used and then output the headers or C appliation, anything that could be faster? I have primarily only programmed in PHP, but I can learn. Currently, CURL_MULTI (with 6 URLs provided at once, does an ok job, but I would prefer something faster? Ultimately I would like to stick with PHP for any MySQL storing and processing.

Thanks, James Hartig

A: 

Threading ?

coulix
A: 

That is essentially what curl_multi is, and php doesn't support threading and is never going to :( I was wondering if a bash script could execute the cURL any faster, or if a bash script could be made to thread the cURL?

James Hartig
+1  A: 

The easiest way to get the headers of a URL is with get_headers(). Performance wise I don't think you can beat curl_multi, but try benchmarking it and see. It's hard to tell.

Eran Galperin
A: 

Would that be faster than cURL? Also, I would not be able to multithread that at all, so it might be slower in the long run?

James Hartig
Updated my answer. Your best bet it to try both under stress and see what performs better
Eran Galperin
A: 

If you don't mind going into really low level stuff, you could send pipelined raw HTTP 1.1 requests using the socket functions.

It'd help to know where the bottleneck is in what you're currently using - network, CPU, etc...

Ant P.
+1  A: 

re: threading-via-bash-script, it's possible, but unlikely: process creation overhead for such a script will probably kill the speed.

If it's that import to you, start up a daemon that does nothing but such resolution, then connect to the daemon locally. Then you can work on making that daemon do so as fast as possible, in C or C++ or whatever.

pjz
A: 

@pjz gotcha! thanks! any recommendations as where to start?

James Hartig
A: 

curl_multi + these options are probably your best bet:

curl_setopt ($ch, CURLOPT_HEADER, 1);
curl_setopt ($ch, CURLOPT_NOBODY, 1);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);

The only other option may be to use wget with

--server-response

and then multi-thread it using C/C++, Java, etc. I'm not convinced that this would be a faster option in the end.

Pras
A: 

@pras Ok, I will look into the daemon idea first to see if I can get anywhere faster. I might have to end up staying with curl_multi. Another problem is that I cannot get the connect time or anything else from a curl_multi, only the response.

James Hartig
A: 

Alright, I figured out the following: get_headers = .0606 sec per URL cURL = .01235 per URL gethostbynamel = .001025 sec per URL

What I'm going to do is first run gethostbynamel() and then cURL, this should decrease time, because it will resolve a host all the time, and thus cURL will not have to ever get stuck loading a url.

Any objections?

James Hartig
+1  A: 

I think you need a multi-process batch URL fetching daemon. PHP does not support multithreading, but there's nothing stopping you from spawning multiple PHP daemon processes.

Having said that, PHP's lack of a proper garbage collector means that long-running processes can leak memory.

Run a daemon which spawns lots of instances (a configurable, but controlled number) of the php program, which will of course have to be capable of reading a work queue, fetching the URLs and writing the results away in a manner which is multi-process safe; multiple procs shouldn't end up trying to do the same work.

You'll want all of this to run autonomously as a daemon rather than from a web server. Really.

MarkR
alright I will work on this! :) thanks so much
James Hartig
+1  A: 

I recently wrote a blog post on how to speed up curl_multi. Basically I process each request as soon as it finishes and use a queue to keep a large number of requests going at once. I've had good success with this technique and am using it to process ~6000 RSS feeds a minute. I hope this helps!

http://onlineaspect.com/2009/01/26/how-to-use-curl_multi-without-blocking/