





I have the following piece of code, which I want to make parallel in a certain way. I am making a mistake, and hence not all threads are running the loop as I thought it should. It would be great if somebody could help me out identifying that mistake.

This is a code to calculate histograms.

#pragma omp parallel default(shared) private(iIndex2, iIndex1, fDist) shared(iSize, dense) reduction(+:iCount)

chunk = (unsigned int)(iSize / omp_get_num_threads());
threadID = omp_get_thread_num();
svtout << "Number of threads available " << omp_get_num_threads() << endl;
svtout << "The threadID is " << threadID << endl;

//want each of the thread to execute the loop
    for (iIndex1=0; iIndex1 < chunk; iIndex1++)
        for (iIndex2=iIndex1+1; iIndex2 < chunk; iIndex2++)

            fDist = (*this)[iIndex1 + threadID*chunk].distance( (*this)[iIndex2 + threadID*chunk] );
            idx = (int)(fDist/fWidth);

            if ((int)fDist % (int)fWidth >= 0)
               #pragma omp atomic
               dense[idx] += 1;

The iCount variable keeps track of the number of iterations, and I noticed that there is a marked difference between the serial and the parallel version. I guess not all threads are running, and hence the histogram values that I'm obtaining from the parallel program are much less than the actual readings (the dense array stores the histogram values).



you are a looping over chunk, rather than iSize with more than one thread. Try replacing loop bounds with iSize .

Yeah it would work then, but I was thinking that if I could make each thread run the for loops in parallel, then the whole range of iSize would be covered (and hence the subscript like [i + threadNum*chunk]), and probably it would be faster than just putting a parallel for in the outer loop and a pragma atomic before dense. But I guess I should remove the pragma for completely.