views:

304

answers:

9

A colleague of mine asked me to write a homework for him. Although this wasn’t too ethical I did it, I plead guilty. This is how the problem goes: Write a program in C where the sequence 12 + 22 + ... + n2 is calculated. Assume that n is multiple of p and p is the number of threads. This is what I wrote:

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>

#define SQR(X) ((X) * (X))

int n, p = 10, total_sum = 0;

pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;
/* Function prototype */
void *do_calc(void *arg);

int main(int argc, char** argv)
{
    int i;
    pthread_t *thread_array;
    printf("Type number n: ");
    fscanf(stdin, "%d", &n);

    if (n % p != 0 ) {
        fprintf(stderr, "Number must be multiple of 10 (number of threads)\n");
        exit(-1);
    }


    thread_array = (pthread_t *) malloc(p * sizeof(pthread_t));
    for (i = 0; i < p; i++)
        pthread_create(&thread_array[i], NULL, do_calc, (void *) i);
    for (i = 0; i < p; i++)
        pthread_join(thread_array[i], NULL);

    printf("Total sum: %d\n", total_sum);
    pthread_exit(NULL);
}

void *do_calc(void *arg)
{
    int i, local_sum = 0;
    int thr = (int) arg;
    pthread_mutex_lock(&mtx);
    for (i = thr * (n / p); i < ((thr + 1) * (n / p)); i++)
    local_sum += SQR(i + 1);
    total_sum += local_sum;
    pthread_mutex_unlock(&mtx);
    pthread_exit(NULL);
}

Aside from the logical/syntactic point of view, I was wondering:

  1. how the respective non-multithreaded program would perform
  2. how could I test/see their performance
  3. what would be the program without using threads

Thanks in advance and I’m looking forward to reading your thoughts

+6  A: 

This would depend on how many CPUs you have. With a single CPU core, a computation-bound program will never run faster with multiple threads.

Moreover, since you're doing all the work with the lock held, you'll end up with only a single thread running at any time, so it's effectively single threaded anyway.

bdonlan
you printing faster than me=(I agree.
Imaskar
55% of the accepted answer is this one and the other 45% is ebo's. :)
A: 

i would try to see how much do those calculations take. In case it's a very small fraction of time then i would probably gone for a single process model since spawning a thread for each calculation involves some overhead by it self.

Konstantinos
+1  A: 

As your code is serialised by a mutex in the actual calculation, it will be slower than a non-threaded version. Of course, you could easily have tested this for yourself.

anon
How? This is one of my questions.
Create a version without all the threading code and compare performance.
anon
+9  A: 

You are acquiring the Mutex before the calculations. You should do that immediately before summing to local values.

pthread_mutex_lock(&mtx);
total_sum += local_sum;
pthread_mutex_unlock(&mtx);
ebo
Nice catch, thanks!
If you want bonus points, there are atomic instructions you can use to avoid the lock (they use less processing time) - see http://stackoverflow.com/questions/680097/ive-heard-i-isnt-thread-safe-is-i-thread-safe/680114
Tom Leys
Tom Leys
A: 

to compare performance just remember system time at program start, call it from n=1000 and see system time at the end. compare to non-threaded program result. as bdonlan said, non-threaded will run faster

Imaskar
1000 will be far to low to get any reasonable results, even if the program would be right.
ebo
oh no, n=1000 will overflow integer... n=50
Imaskar
That may be another problem. I was considering the performance viewpoint.
ebo
yeah, low n brings poor statistics, but high n ovrflows integer. it's better to use at least long type and find biggest working 'n' and run test ~100 times. I'm not sure about caching, it may pollute results =(
Imaskar
A: 

1) Single threaded would probably perform a bit better than this, because all calculations are done within a lock and the overhead of locking will add to the total time. You are better off only locking when adding the local sums to the total sum, or storing the local sums in an array and calculating the total sum in the main thread.

2) Use timing statements in your code to measure elapsed time during the algoritm. In the multithreaded case, only measure elapsed time on the main thread.

3) Derived from your code:

int i, total_sum = 0;
for (i = 0; i < n; i++)
  total_sum += SQR(i + 1);
Renze de Waal
3 is a bad idea because then all threads are constantly writing to a shared value. Much better to only write once at the end of the thread once Local_sum is calculated.
Tom Leys
To elaborate slightly, every time a processor writes to a variable, all other processors have to remove that variable from their local cache and re-read it later. This is very very expensive
Tom Leys
@Tom: 3 answers the third question: what would be the program without using threads. Multithreading is not an issue in 3.
Renze de Waal
A: 

A much larger consideration comes to scheduling. The easiest way for kernel-side threading to be implemented is for each thread to get equal time regardless. Processes are just threads with their own memory space. IF all threads get equal time, adding a thread takes you from 1/n of the time to 2/(n + 1) of the time, which is obviously better given > 0 other threads that aren't you.

Actual implementations may and do vary wildly though.

Alex Gartrell
+3  A: 

Don't bother with threading etc. In fact, don't do any additions in a loop at all. Just use this formula:

∑(r = 1; n) r^2 = 1/6 * n (n + 1)(2 n + 1) [1]

[1]http://thesaurus.maths.org/mmkb/entry.html?action=entryById&amp;id=1539

Ben Schwehn
A: 

Off-topic a bit, but maybe avoid the mutex by having each thread write it's result into an array element (so assign "results = calloc(sizeof(int), p)" (btw "p" is an awful name for the variable holding the number of threads) and results[thr] = local_sum), and have the joining thread (well, main()) do the summing of the results. So each thread is responsible for just calculating its total: only main(), which orchestrates the threads, joins their data together. Separation of concerns.

For extra credit (:p), use the arg passed to do_calc() as a way to pass the thread ID and the location to write the result to rather than relying on a global array.

araqnid