views:

32

answers:

2

Explanation: Part of the app that I'm creating requires checking of thousands of records and acting on them in a timely manner. So for every record, I want to fork a new process. However, I need a DB connection to do some more checking of that record. As I understand it, the child inherits the DB connection. So subsequent forks have DB errors.

I thought I could pcntl_exec('php /path/script.php'); and then pcntl_fork so that the calling process is not held up.

Or I can pcntl_fork and then pcntl_exec in the child process. Or maybe I should be using exec() instead of pcntl_exec().

My question: Are there any drawbacks or advantages to either order?

Notes: Maybe I'm imagining this issue, as I thought that the calling php process would wait for pcntl_exec to return. But that's not what the docs state:

Returns FALSE on error and does not return on success.

How can a function return a value sometimes and none other times? That sounds like poorly written docs.

fahadsadah's comments state:

Once the executed process ends, control returns to the webserver process.

If that is the case, then I need to fork.

Edit: code for the confused - including me ;)

<?php

class Process
{

    public function __construct($arg = false)
    {
        if ($arg == "child")
        {
            $this->act();
        }
        else
        {
            $this->run();
        }
    }

    public function run()
    {
        echo "parent before fork:", getmypid(), PHP_EOL;
        $pid = @ pcntl_fork();
        echo $pid, PHP_EOL;

        if ($pid == -1)
        {
            throw new Exception(self::COULD_NOT_FORK);
        }
        if ($pid)
        {
        // parent
            echo "parent after fork:", getmypid(), PHP_EOL;
        }
        elseif ($pid == 0)
        {
        // child
            echo "child after fork:", getmypid(), PHP_EOL;
            //echo exec('php Process.php child');
            echo pcntl_exec('/usr/bin/php', array('Process.php', 'child'));
        }
        return 0;
    }

    private function act()
    {
        sleep(1);
        echo "forked child new process:", getmypid(), PHP_EOL;
        return 0;
    }
}

$proc = new Process($argv[1]);

If you uncomment the exec and comment the pcntl_exec, you will see that pcntl_exec replaces the process. Which I'm guessing saves some resources.

A: 

This doesn't make sense. Once you exec() you're running different code so you can't fork() afterwards. Does not return on success.

EJP
Please explain why it does not make sense.
sims
I thought I had already done so. What didn't you understand?
EJP
You surely can fork. You fork the newly executed process. Which, perhaps, is what I'm trying to do. But it might be a bad idea.
sims
The executable that executed the exec() is no longer there to call fork(). That's why the question makes no sense. The *newly executed executable* would have to contain the fork() call. Does it? Why would it? What's it going to do in the parent of the fork()?
EJP
I need all children to be able to access the DB. I've posted some code as proof of concept, and it does work. That's not the question. The question is, which order is better. I think I've figured that out. But perhaps, as symcbean is trying to convince me, it's still a bad idea. The only other way may be gearman, but that also might not be possible. It doesn't even have a package in debian 5.
sims
A: 

This is really confused - you're trying to apply very sophisticated techniques - but you are applying them in completely the wrong way.

fork creates a new running copy of the current process. Exec starts a new process. You would not use them both to start a single process.

But before I get into a en explanation of how to use fork and exec correctly, I should point out that they are not the right tools for addressing this problem.

Batch processing should be avoided wherever possible. Data typically arrives at a finite rate (albeit that the rate may be stochastic) - usually the right approach to avoid batching is to deal with requests synchronously or via queueing. Where batch processing is unavoidable, parallelizing and/or pipelining processing usually improves throughput. While there are many sophisticated methods for achieving this (e.g. map-reduce) simply sharding the data is usually adequate. While your basic idea amounts to sharding into single pieces, this:

1) will be less efficient than dealing with small batches

2) makes it very difficult to limit resource consumption by the system (what if you spawn 500 processes and your DBMS only supports 200 concurrent connections?)

Assuming that you can't deal with the processing synchronously and runiing a queue with multiple subscribers is not practical, I'd suggest just splitting the data into (a limited number of) smaller batches and spawning processes to deal with those. Note that popen(), proc_open() and pcntl_fork() do not block for the duration of execution of the spawned process. (hint - use the modulus operator)

If you want to to launch the processing from an HTTP request (or have another reason for running them in seperate session groups) then have a google for 'PHP long running processes setsid).

symcbean
It's a command line script. It's not dealing with requests. A user changes the the DB, which might trigger certain actions - depending on how the system is configured. So a log is kept of the changes that might possibly trigger and action. A cron job checks this log for actions that need to be performed. So it's not got anything directly to do with HTTP requests.
sims
Yes - you are trying to solve the problem using a command line script - but did you really type in the data by hand? How did it get there? Why didn't you deal with it when the "user changes the DB"
symcbean
Because the user can configure the system to react after a lapse in time. For example, "remind all attendants the meeting is in 15 minutes", is just one simple action the system can/should be able to do.
sims