views:

212

answers:

3

I was having a discussion about the relative cost of fork() Vs thread() for parallelization of a task.

We understand the basic differences between processes Vs Thread

Thread:

  • Easy to communicate between threads
  • Fast context switching.

Processes:

  • Fault talarance.
  • Communicating with parent not a real problem (open a pipe)
  • Communication with other child processes hard

But we disagreed on the start-up cost of processes Vs threads.
So to test the theories I wrote the following code. My question: Is this a valid test of measuring the start-up cost or I am missing something. Also I would be interested in how each test performs on different platforms.

fork.cpp

#include <boost/lexical_cast.hpp>
#include <vector>
#include <unistd.h>
#include <iostream>
#include <stdlib.h>
#include <time.h>

extern "C" int threadStart(void* threadData)
{
    return 0;
}

int main(int argc,char* argv[])
{
    int threadCount =  boost::lexical_cast<int>(argv[1]);

    std::vector<pid_t>   data(threadCount);
    clock_t                 start   = clock();
    for(int loop=0;loop < threadCount;++loop)
    {
        data[loop]  = fork();
        if (data[looo] == -1)
        {
            std::cout << "Abort\n";
            exit(1);
        }
        if (data[loop] == 0)
        {
            exit(threadStart(NULL));
        }
    }
    clock_t                 middle   = clock();
    for(int loop=0;loop < threadCount;++loop)
    {
        int   result;
        waitpid(data[loop], &result, 0);
    }
    clock_t                 end   = clock();

   std::cout << threadCount << "\t" << middle - start << "\t" << end - middle << "\t"<< end - start << "\n";
}

Thread.cpp

#include <boost/lexical_cast.hpp>
#include <vector>
#include <iostream>
#include <pthread.h>
#include <time.h>


extern "C" void* threadStart(void* threadData)
{
    return NULL;
}   

int main(int argc,char* argv[])
{
    int threadCount =  boost::lexical_cast<int>(argv[1]);

    std::vector<pthread_t>   data(threadCount);

    clock_t                 start   = clock();
    for(int loop=0;loop < threadCount;++loop)
    {
        if (pthread_create(&data[loop], NULL, threadStart, NULL) != 0)
        {
            std::cout << "Abort\n";
            exit(1);
        }
    }   
    clock_t                 middle   = clock();
    for(int loop=0;loop < threadCount;++loop)
    {   
        void*   result;
        pthread_join(data[loop], &result);
    }
    clock_t                 end   = clock();

   std::cout << threadCount << "\t" << middle - start << "\t" << end - middle << "\t"<< end - start << "\n";

}

I expect Windows to do worse in processes creation.
But I would expect modern Unix like systems to have a fairly light fork cost and be at least comparable to thread. On older Unix style systems (before fork() was implemented as using copy on write pages) that it would be worse.

Anyway My timing results are:

> uname -a
Darwin Alpha.local 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4~1/RELEASE_I386 i386
> gcc --version | grep GCC
i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5659)
> g++ thread.cpp -o thread -I~/include
> g++ fork.cpp -o fork -I~/include
> foreach a ( 1 2 3 4 5 6 7 8 9 10 12 15 20 30 40 50 60 70 80 90 100 )
foreach? ./thread ${a} >> A
foreach? end
> foreach a ( 1 2 3 4 5 6 7 8 9 10 12 15 20 30 40 50 60 70 80 90 100 )
foreach? ./fork ${a}  >> A
foreach? end
vi A 

Thread:                             Fork:
C   Start   Wait    Total           C   Start   Wait    Total
==============================================================
 1    26     145     171             1   160     37      197
 2    44     198     242             2   290     37      327
 3    62     234     296             3   413     41      454
 4    77     275     352             4   499     59      558
 5    91     107   10808             5   599     57      656
 6    99     332     431             6   665     52      717
 7   130     388     518             7   741     69      810
 8   204     468     672             8   833     56      889
 9   164     469     633             9  1067     76     1143
10   165     450     615            10  1147     64     1211
12   343     585     928            12  1213     71     1284
15   232     647     879            15  1360    203     1563
20   319     921    1240            20  2161     96     2257
30   461    1243    1704            30  3005    129     3134
40   559    1487    2046            40  4466    166     4632
50   686    1912    2598            50  4591    292     4883
60   827    2208    3035            60  5234    317     5551
70   973    2885    3858            70  7003    416     7419
80  3545    2738    6283            80  7735    293     8028
90  1392    3497    4889            90  7869    463     8332
100 3917    4180    8097            100 8974    436     9410

Edit:

Doing a 1000 children caused the fork version to fail.
So I have reduced the children count. But doing a single test also seems unfair so here is a range of values.

A: 

Under Linux fork is a special call to sys_clone, either within the library or within the kernel. Clone has lots of switches to flip on and off, and each of them effects how expensive it is to start.

The actual library function clone is probably more expensive than fork though because it does more, though most of that is on the child side (stack swapping and calling a function by pointer).

nategoose
fork() also has a lot of work to do, e.g. calling atfork functions and cleaning up the stacks/TLS of other threads...
psmears
@psmears: You're right. Since glibc's fork is not exactly the same as `sys_fork` there is a good bit of extra stuff to do.
nategoose
+2  A: 

mumble ... I do not like your solution for many reasons:

  1. You are not taking in account the execution time of child processes/thread.

  2. You should compare cpu-usage not the bare elapsed time. This way your statistics will not depend from, e.g., disk access congestion.

  3. Let your child process do something. Remember that "modern" fork uses copy-on-write mechanisms to avoid to allocate memory to the child process until needed. It is too easy to exit immediately. This way you avoid quite all the disadvantages of fork.

  4. CPU time is not the only cost you have to account. Memory consumption and slowness of IPC are both disadvantages of fork solution.

You could use "rusage" instead of "clock" to measure real resource usage.

P.S. I do not think you can really measure the process/thread overhead writing a simple test program. There are too many factors and, usually, the choice between threads and processes is driven by other reasons than mere cpu-usage.

andcoz
In the end this is my argument. That the difference in cost between creating threads or processes is driven by many factors on usage. But the cost of creating the children should not be one of those factors as the difference is so slight.
Martin York
Whether you want CPU time or wall time depends on whether you're trying to use fewer CPU cycles or just get something done sooner. When parallelizing, you can't expect to use fewer cycles, so you probably only care about getting something done sooner, meaning that wall time is the correct measurement.
Gabe
A: 

What that micro-benchmark shows is that thread creation and joining (there are no fork results when I'm writing this) takes tens or hundreds of microseconds (assuming your system has CLOCKS_PER_SEC=1000000, which it probably has, since it's an XSI requirement).

Since you said that fork() takes 3 times the cost of threads, we are still talking tenths of a millisecond at worst. If that is noticeable on an application, you could use pools of processes/threads, like Apache 1.3 did. In any case, I'd say that startup time is a moot point.

The important difference of threads vs processes (on Linux and most Unix-likes) is that on processes you choose explicitly what to share, using IPC, shared memory (SYSV or mmap-style), pipes, sockets (you can send file descriptors over AF_UNIX sockets, meaning you get to choose which fd's to share), ... While on threads almost everything is shared by default, whether there's a need to share it or not. In fact, that is the reason Plan 9 had rfork() and Linux has clone() (and recently unshare()), so you can choose what to share.

ninjalj