views:

440

answers:

7

Greetings All!

I am having some troubles on how to execute thousands upon thousands of requests to a web service (eBay), I have a limit of 5 million calls per day, so there are no problems on that end.

However, I'm trying to figure out how to process 1,000 - 10,000 requests every minute to every 5 minutes.

Basically the flow is: 1) Get list of items from database (1,000 to 10,000 items) 2) Make a API POST request for each item 3) Accept return data, process data, update database

Obviously a single PHP instance running this in a loop would be impossible.

I am aware that PHP is not a multithreaded language.

I tried the CURL solution, basically: 1) Get list of items from database 2) Initialize multi curl session 3) For each item add a curl session for the request 4) execute the multi curl session

So you can imagine 1,000-10,000 GET requests occurring...

This was ok, around 100-200 requests where occurring in about a minute or two, however, only 100-200 of the 1,000 items actually processed, I am thinking that i'm hitting some sort of Apache or MySQL limit?

But this does add latency, its almost like performing a DoS attack on myself.

I'm wondering how you would handle this problem? What if you had to make 10,000 web service requests and 10,000 MySQL updates from the return data from the web service... And this needs to be done in at least 5 minutes.

I am using PHP and MySQL with the Zend Framework.

Thanks!

A: 

To understand your requirements better, you must implement your solution only in PHP? Or you can interface a PHP part with another part written in another language?

Ass3mbler
PHP does not have to be the entire solution. PHP just serves as a means of serving web pages to my users. The stuff above is really a 'background' process, so it can be written in another language.
cappuccino
In that case, use Python / Ruby or, if perf is really an issue, Java. And if perf is so much an issue that you are willing to sell your soul for any micro sec you can grab, then use C++.
e-satis
ok, I've modified the original post with a solution
Ass3mbler
A: 

If you could not go for another language, try to perform this update maybe as php script that runs in the background and not through the apache.

Hippo
I imagine your referring to using the PHP exec command? Would this start a new process or would it be the same as a single loop? This process should be asynchronous, otherwise if the next request needs to wait for the previous to complete, it wont solve the issue.
cappuccino
I meant to perform this script via cron or something similar as a command-line script, independend from the apache or the actual website.
Hippo
+1  A: 

My two suggestions are (a) do some benchmarking to find out where your real bottlenecks are and (b) use batching and cacheing wherever possible.

Mysqli allows multiple-statement queries, so you could definitely batch those database updates.

The http requests to the web service are more likely the culprit, though. Check the API you're using to see if you can get more info from a single call, maybe? To break up the work, maybe you want a single master script to shell out to a bunch of individual processes, each of which makes an api call and stores the results in a file or memcached. The master can periodically read the results and update the db. (Careful to rotate the data store for safe reading and writing by multiple processes.)

grossvogel
The benchmarking is showing that if the script didn't need to make a web service request, the code runs like lighting, considering that i've tested my server which can handle 100,000 inserts in a minute, and these inserts are being done through PHP's mysql extension.Your logic is sound, that is rather than making 10,000 web service requests, make one request and get one response for the 10,000 commands... however, the eBay API does not have that feature :(I guess the real question is, whats the best way to execute many web service requests... using any technology.
cappuccino
"To break up the work, maybe you want a single master script to shell out to a bunch of individual processes, each of which makes an api call and stores the results in a file or memcached."This was the method I used with PHP's multi_curl_session. I had a single script whos function was to get the list of items and fire off a CURL request for each of the items. This worked very well... however, alot of the requests where not being executed, only 200 of the 1,000... it's hitting some sort of limit.
cappuccino
+1  A: 

I've had to do something similar, but with Facebook, updating 300,000+ profiles every hour. As suggested by grossvogel, you need to use many processes to speed things up because the script is spending most of it's time waiting for a response. You can do this with forking, if your PHP install has support for forking, or you can just execute another PHP script via the command line.

exec('nohup /path/to/script.php >> /tmp/logfile 2>&1 & echo $!'), $processId);

You can pass parameters (getopt) to the php script on the command line to tell it which "batch" to process. You can have the master script do a sleep/check cycle to see if the scripts are still running by checking for the process id's. I've tested up to 100 scripts running at once in this manner, at which point the CPU load can get quite high.

Combine multiple processes with multi-curl, and you should easily be able to do what you need.

Brent Baisley
A: 

You can follow Brent Baisley advice for a simple use case.

If you want to build a robuts solution, then you need to :

  • set up a representation of the actions in a table in database that will be your process queue;
  • set up a script that pop this queue and process your action;
  • set up a cron daemon that run this script every x.

This way you can have 1000 PHP scripts running, using your OS parallelism capabilities and not hanging when ebay is taking to to respond.

The real advantage of this system is that you can fully control the firepower you throw at your task by adjusting :

  • the number of request one PHP script does;
  • the order / number / type / priority of the action in the queue;
  • the number or scripts the cron daemon runs.
e-satis
this can also be done purely within PHP
Peter Lindqvist
Yes, you can do that with any turing complete language. You can code an operating system too.
e-satis
A: 

Thanks everyone for the awesome and quick answers!

The advice from Brent Baisley and e-satis works nicely, rather than executing the sub-processes using CURL like i did before, the forking takes a massive load off, it also nicely gets around the issues with max out my apache connection limit.

Thanks again!

cappuccino
A: 

It is true that PHP is not multithreaded, but it can certainly be setup with multiple processes.

I have created a system that resemebles the one that you are describing. It's running in a loop and is basically a background process. It uses up to 8 processes for batch processing and a single control process.

It is somewhat simplified because i do not have to have any communication between the processes. Everything resides in a database so each process is spawned with the full context taken from the database.

Here is a basic description of the system. 1. Start control process 2. Check database for new jobs 3. Spawn child process with the job data as a parameter 4. Keep a table of the child processes to be able to control the number of simultaneous processes.

Unfortunately it does not appear to be a widespread idea to use PHP for this type of application, and i really had to write wrappers for the low level functions.

The manual has a whole section on these functions, and it appears that there are methods for allowing IPC as well.

PCNTL has the functions to control forking/child processes, and Semaphore covers IPC.

The interesting part of this is that i'm able to fork off actual PHP code, not execute other programs.

Peter Lindqvist