views:

294

answers:

6

Which design pattern exist to realize the execution of some PHP processes and the collection of the results in one PHP process?

Background:
I do have many large trees (> 10000 entries) in PHP and have to run recursive checks on it. I want to reduce the absolute execution time.

+8  A: 

From your php script, you could launch another script (using exec) to do the processing. Save status updates in a text file, which could then be read periodically by the parent thread.

Note: to avoid php waiting for the exec'd script to complete, pipe the output to a file:

exec('/path/to/file.php | output.log');

Alternatively, you can fork a script using the PCNTL functions. This uses one php script, which when forked can detect whether it is the parent or the child and operate accordingly. There are functions to send/receive signals for the purpose of communicating between parent/child, or you have the child log to a file and the parent read from that file.

From the pcntl_fork manual page:

$pid = pcntl_fork();
if ($pid == -1) {
     die('could not fork');
} else if ($pid) {
     // we are the parent
     pcntl_wait($status); //Protect against Zombie children
} else {
     // we are the child
}
adam
I "heard" that pcntl isnt a good solution. Any experiences?
powtac
Sorry, not really any practical experience of it. Searching through SO, I'm surprised how many people categorically say there's no way of forking php
adam
I have written a Perl wrapper before to use FORK (in Perl) to execute a PHP script with great results.
Phill Pafford
@powtac can i ask what you're looking for that isn't in the given answers?
adam
There isn't much control over the processes, and I need a more generic solution. for different trees and objects and parts of the code. The Message Queue solution fits more.
powtac
+2  A: 

This might be a good time to consider using a message queue, even if you run it all on one machine.

squeeks
The problem with a message queue is that we need the same global namespace/scope for the different processes.
powtac
Not sure what you need the shared scope and namespace for, but a message queue, coupled with shared memory (eg memcache) could be a possibility.
sfrench
The message que solution fits the best for this large scaled application. With memcache we can control very good what to do when. Because everything is very OO there is not a big problem with the namespace. I have just to fill the objects from and into memcache.
powtac
With memcache I can share the load also on different machines. This means I can scale very well in the future.
powtac
Also memcache stores the PHP objects plain, there is no need to serialize and unserialize them.
powtac
+1  A: 

Using web or CLI?

If you use web, you could intergrate that part in Quercus Then you could use the advantages of JAVA multithreading.

I don't actually know how reliable Quercus is though. I'd also suggest using a kind of message queue and refactoring the code, so it doesn't need the scope.

Maybe you could rebuild the code to a Map/Reduce pattern. You then can run the PHP code in Hadoop Then you can cluster the processing through a couple of machines.

I don't know if it's useful, but I came across another project, called Gearman. It's also used to cluster PHP processes. I guess you can combine that with a reduce script as well, if Hadoop is not the way you want to go.

Michiel
I don't want to use a Java implementation of PHP, it seems a little bit "bloated".
powtac
I tested Quercus, it not so bad but it is not 100% compatible with existing code. Probably using a Hadoop cluster is the fastest solution.
rtacconi
The Map/Reduce is a very good hint!!!
powtac
+2  A: 

If your goal is minimal time - the solution is simple to describe, but not that simple to implement.

You need to find a pattern to divide the work (You don't provide much information in the question in this regard).

Then use one master process that forks children to do the work. As a rule the total number of processes you use should be between n and 2n, where n is the number of cores the machine has.

Assuming this data will be stored in files you might consider using non-blocking IO to maximize the throughput. Not doing so will make most of your process spend time waiting for the disk. PHP has stream_select() that might help you. Note that using it is not trivial.

If you decide not to use select - increasing the number of processes might help.


In regards to pcntl functions: I've written a deamon with them (a proper one with forking, changing session id, the running user, etc...) and it's one of the most reliable piece of software I've written. Because it spawns workers for every task, even if there is a bug in one of the tasks, it does not affect the others.

Emil Ivanov
Good hit, the stream_select() information!
powtac
+1  A: 

The question seems to be a bit confused.

I want to reduce the absolute execution time.

Do you mean elapsed time? Certainly use of the right data-structure will improve throughput, but for a given data-structure, the minmimum order of the algorithm is absolute, and nothing to do with how you implement the algorithm.

Which design pattern exist to realize....?

Design Patterns are something which code is, not a template for writing programs, and a useful tools for curriculum design. To start with a pattern and make your code fit it is in itself an anti-pattern.

Nobody can answer this question withuot knowing a lot more about your data and how its structured, however the key driver for efficiency will be the data-structure you use to implement your tree. If elapsed time is important then certainly look at parallel execution, however it may also be worth considering performing the operation in a different tool - databases are highly optimized for dealing with large sets of data, however note that the obvious method for describing a tree in a relational database is very inefficient when it comes to isolating sub-trees and walking the tree.

In response to Adam's suggesting of forking you replied:

I "heard" that pcntl isnt a good solution. Any experiences?

Where did you hear that? Certainly forking from a CGI or mod_php invoked script is a bad idea, but nothing wrong with doing it from the command line. Do have a google for long running PHP processes (be warned there is a lot of bad information out there). What code you write will vary depending on the underlying OS - which you've not stated.

I suspect that you could solve a large part of your performance issues by identifying which parts of the tree need to be checked and only checking those parts AND triggering the checks when the tree is updated, or at least marking the nodes as 'dirty'.

You might find these helpful:

http://dev.mysql.com/tech-resources/articles/hierarchical-data.html http://en.wikipedia.org/wiki/Threaded_binary_tree

C.

symcbean
+2  A: 

You could use a more efficient data structure, such as a btree. I used once in Java but not in PHP. You can try this script: http://www.phpclasses.org/browse/file/708.html, it is an implementation of btree.

If it is not enough, you can use Hadoop to implement a Map/Reduce pattern, as Michael said. I would not fork PHP process, it does not seem to help for performace.

Personally, I would use PHP as client and put everything in Hadoop. This tutorial might help: http://www.lunchpauze.com/2007/10/writing-hadoop-mapreduce-program-in-php.html.

Another solution can be to use a Java implementation of Btree: http://jdbm.sourceforge.net/. JDBM is an object database using a Btree+ data astructures. Then you can search with PHP by exposing data with a web service or by accessing it directly with Quercus

rtacconi
Hadoop is a good recomendation!
powtac