ansaurus

Question

Cummulative array summation using OpenCL

Answer 1

+1 A:

Your problem lies in these lines of code:

C[loop] = 0;
C[loop] += dif_acum[featA]*dif_acum[featA];

All threads in your workgroup (well, actually all your threads, but lets come to to that later) are trying to modify this array position concurrently without any synchronization whatsoever. Several factors make this really problematic:

The workgroup is not guaranteed to work completely in parallel, meaning that for some threads C[loop] = 0 can be called after other threads have already executed the next line
Those that execute in parallel all read the same value from C[loop], modify it with their increment and try to write back to the same address. I'm not completely sure what the result of that writeback is (I think one of the threads succeeds in writing back, while the others fail, but I'm not completely sure), but its wrong either way.

Now lets fix this: While we might be able to get this to work on global memory using atomics, it won't be fast, so lets accumulate in local memory:

local float* accum;
...
accum[featA] = dif_acum[featA]*dif_acum[featA];
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 1; i < BLOCKSIZE; i *= 2)
{
    if ((featA % (2*i)) == 0)
        accum[featA] += accum[featA + i];
    barrier(CLK_LOCAL_MEM_FENCE);
}
if(featA == 0)
    C[loop] = accum[0];

Of course you can reuse other local buffers for this, but I think the point is clear (btw: Are you sure that dif_acum will be created in local memory, because I think I read somewhere that this wouldn't be put in local memory, which would make preloading A into local memory kind of pointless).

Some other points about this code:

Your code is seems to be geared to using only on workgroup (you aren't using either groupid nor global id to see which items to work on), for optimal performance you might want to use more then that.
Might be personal preferance, but I to me it seems better to use get_local_size(0) for the workgroupsize than to use a Define (since you might change it in the host code without realizing you should have changed your opencl code to)
The barriers in your code are all unnecessary, since no thread accesses an element in local memory which is written by another thread. Therefore you don't need to use local memory for this.

Considering the last bullet you could simply do:

float As = GetElement(tmpA, 0, featA);
...
float dif_acum = As-B.elements[k*BLOCK_SIZE + featA];

This would make the code (not considering the first two bullets):

__kernel void CompareDescriptors_deb(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float accum[BLOCK_SIZE])
{
   int gpidA = get_global_id(0);
   int featA = get_local_id(0);
   int loop = 0;
   for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){
       DescriptorList tmpA = GetDescriptor(A, i);
       float As = GetElement(tmpA, 0, featA);
       for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
           float dif_acum = As-B.elements[k*BLOCK_SIZE + featA];

           accum[featA] = dif_acum[featA]*dif_acum[featA];
           barrier(CLK_LOCAL_MEM_FENCE);
           for(unsigned int i = 1; i < BLOCKSIZE; i *= 2)
           {
              if ((featA % (2*i)) == 0)
                 accum[featA] += accum[featA + i];
              barrier(CLK_LOCAL_MEM_FENCE);
           }
           if(featA == 0)
              C[loop] = accum[0];
           barrier(CLK_LOCAL_MEM_FENCE);

           loop += 1;
        }
    }
}

Grizzly 2010-09-23 02:43:31

I have to say, first of all, that I'm very thankful with Grizzly's answer. I'm rather new to OpenCL and, although I needed to tweak the example code he gave a little bit, it led me straight into the right direction. Important things that I noticed (by trial and error): threads that not address the array positions needed to be discarded; the SCAN loop needed a little tweaking, namely using an auxiliary buffer to accumulate partial results and checking for boundary conditions to find the terms to be added. Thank you very much again! I'm posting the code that worked for me.

SebastianP 2010-09-24 14:24:01

Answer 2

A:

Thanks to Grizzly, I have now a working kernel. Some things I needed to modify based in the answer of Grizzly:

I added an IF statement at the beginning of the routine to discard all threads that won't reference any valid position in the arrays I'm using.

if(featA > BLOCK_SIZE){return;}

When copying the first descriptor to local (shared) memory (i.g. to Bs), the index has to be specified since the function GetElement returns just one element per call (I skipped that on my question).

Bs[featA] = GetElement(tmpA, 0, featA);

Then, the SCAN loop needed a little tweaking because the buffer is being overwritten after each iteration and one cannot control which thread access the data first. That is why I'm 'recycling' the dif_acum buffer to store partial results and that way, prevent inconsistencies throughout that loop.

dif_acum[featA] = accum[featA];

There are also some boundary control in the SCAN loop to reliably determine the terms to be added together.

if (featA >= j && next_addend >= 0 && next_addend < BLOCK_SIZE){

Last, I thought it made sense to include the loop variable increment within the last IF statement so that only one thread modifies it.

if(featA == 0){
    C[loop] = accum[BLOCK_SIZE-1];
    loop += 1;
}

That's it. I still wonder how can I make use of group_size to eliminate that BLOCK_SIZE definition and if there are better policies I can adopt regarding thread usage.

So the code looks finally like this:

__kernel void CompareDescriptors(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float accum[BLOCK_SIZE], __local float Bs[BLOCK_SIZE])
{

    int gpidA = get_global_id(0);
    int featA = get_local_id(0);

    //global counter to store final differences
    int loop = 0;

    //auxiliary buffer to store temporary data
    local float dif_acum[BLOCK_SIZE];

    //discard the threads that are not going to be used.
    if(featA > BLOCK_SIZE){
        return;
    }

    //loop over all descriptors in A
    for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){

        //take the gpidA-th descriptor
        DescriptorList tmpA = GetDescriptor(A, i);

        //copy the current descriptor to local memory
        Bs[featA] = GetElement(tmpA, 0, featA);

        //loop over all the descriptors in B
        for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
            //take the difference of both current descriptors
            dif_acum[featA] = Bs[featA]-B.elements[k*BLOCK_SIZE + featA];

            //square the values in dif_acum
            accum[featA] = dif_acum[featA]*dif_acum[featA];
            barrier(CLK_LOCAL_MEM_FENCE);

            //copy the values of accum to keep consistency once the scan procedure starts. Mostly important for the first element. Two buffers are necesarry because the scan procedure would override values that are then further read if one buffer is being used instead.
            dif_acum[featA] = accum[featA];

            //Compute the accumulated sum (a.k.a. scan)
            for(int j = 1; j < BLOCK_SIZE; j *= 2){
                int next_addend = featA-(j/2);
                if (featA >= j && next_addend >= 0 && next_addend < BLOCK_SIZE){
                    dif_acum[featA] = accum[featA] + accum[next_addend];
                }
                barrier(CLK_LOCAL_MEM_FENCE);

                //copy As to accum
                accum[featA] = GetElementArray(dif_acum, BLOCK_SIZE, featA); 
                barrier(CLK_LOCAL_MEM_FENCE);
            }

            //tell one of the threads to write the result of the scan in the array containing the results.
            if(featA == 0){
                C[loop] = accum[BLOCK_SIZE-1];
                loop += 1;
            }
            barrier(CLK_LOCAL_MEM_FENCE);

        }
    }
}

SebastianP 2010-09-24 14:52:54

I still think you don't need those local buffers (accept for accum of course), since both dif_acum and Bs are only accessed at position featA, which is the local id of the thread and therefore each element of the arrays is accessed by only one thread. Furthermore using two buffers for the scan loop shouldn't really be necessary, since the the consistency is guaranteed by the barriers (the iterations are seperated by barriers and (at least for the loop I posted) in each iteration each element is only accessed by one thread

Grizzly 2010-09-24 15:25:54

ansaurus

tags:

views:

answers:

Cummulative array summation using OpenCL

related questions