views:

50

answers:

2

I'm calculating the Euclidean distance between n-dimensional points using OpenCL. I get two lists of n-dimensional points and I should return an array that contains just the distances from every point in the first table to every point in the second table.

My approach is to do the regular doble loop (for every point in Table1{ for every point in Table2{...} } and then do the calculation for every pair of points in paralell.

The euclidean distance is then split in 3 parts: 1. take the difference between each dimension in the points 2. square that difference (still for every dimension) 3. sum all the values obtained in 2. 4. Take the square root of the value obtained in 3. (this step has been omitted in this example.)

Everything works like a charm until I try to accumulate the sum of all differences (namely, executing step 3. of the procedure described above, line 49 of the code below).

As test data I'm using DescriptorLists with 2 points each: DescriptorList1: 001,002,003,...,127,128; (p1) 129,130,131,...,255,256; (p2)

DescriptorList2: 000,001,002,...,126,127; (p1) 128,129,130,...,254,255; (p2)

So the resulting vector should have the values: 128, 2064512, 2130048, 128 Right now I'm getting random numbers that vary with every run.

I appreciate any help or leads on what I'm doing wrong. Hopefully everything is clear about the scenario I'm working in.

#define BLOCK_SIZE 128

typedef struct
{
    //How large each point is
    int length;
    //How many points in every list
    int num_elements;
    //Pointer to the elements of the descriptor (stored as a raw array)
    __global float *elements;
} DescriptorList;

__kernel void CompareDescriptors_deb(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float As[BLOCK_SIZE])
{

    int gpidA = get_global_id(0);

    int featA = get_local_id(0);

    //temporary array  to store the difference between each dimension of 2 points
    float dif_acum[BLOCK_SIZE];

    //counter to track the iterations of the inner loop
    int loop = 0;

    //loop over all descriptors in A
    for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){

        //take the i-th descriptor. Returns a DescriptorList with just the i-th
        //descriptor in DescriptorList A
        DescriptorList tmpA = GetDescriptor(A, i);

        //copy the current descriptor to local memory.
        //returns one element of the only descriptor in DescriptorList tmpA
        //and index featA
        As[featA] = GetElement(tmpA, 0, featA);
        //wait for all the threads to finish copying before continuing
        barrier(CLK_LOCAL_MEM_FENCE);

        //loop over all the descriptors in B
        for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
            //take the difference of both current points
            dif_acum[featA] = As[featA]-B.elements[k*BLOCK_SIZE + featA];
            //wait again
            barrier(CLK_LOCAL_MEM_FENCE);
            //square value of the difference in dif_acum and store in C
            //which is where the results should be stored at the end.
            C[loop] = 0;
            C[loop] += dif_acum[featA]*dif_acum[featA];
            loop += 1;
            barrier(CLK_LOCAL_MEM_FENCE);
        }
    }
}
+1  A: 

Your problem lies in these lines of code:

C[loop] = 0;
C[loop] += dif_acum[featA]*dif_acum[featA];

All threads in your workgroup (well, actually all your threads, but lets come to to that later) are trying to modify this array position concurrently without any synchronization whatsoever. Several factors make this really problematic:

  1. The workgroup is not guaranteed to work completely in parallel, meaning that for some threads C[loop] = 0 can be called after other threads have already executed the next line
  2. Those that execute in parallel all read the same value from C[loop], modify it with their increment and try to write back to the same address. I'm not completely sure what the result of that writeback is (I think one of the threads succeeds in writing back, while the others fail, but I'm not completely sure), but its wrong either way.

Now lets fix this: While we might be able to get this to work on global memory using atomics, it won't be fast, so lets accumulate in local memory:

local float* accum;
...
accum[featA] = dif_acum[featA]*dif_acum[featA];
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 1; i < BLOCKSIZE; i *= 2)
{
    if ((featA % (2*i)) == 0)
        accum[featA] += accum[featA + i];
    barrier(CLK_LOCAL_MEM_FENCE);
}
if(featA == 0)
    C[loop] = accum[0];

Of course you can reuse other local buffers for this, but I think the point is clear (btw: Are you sure that dif_acum will be created in local memory, because I think I read somewhere that this wouldn't be put in local memory, which would make preloading A into local memory kind of pointless).

Some other points about this code:

  1. Your code is seems to be geared to using only on workgroup (you aren't using either groupid nor global id to see which items to work on), for optimal performance you might want to use more then that.
  2. Might be personal preferance, but I to me it seems better to use get_local_size(0) for the workgroupsize than to use a Define (since you might change it in the host code without realizing you should have changed your opencl code to)
  3. The barriers in your code are all unnecessary, since no thread accesses an element in local memory which is written by another thread. Therefore you don't need to use local memory for this.

Considering the last bullet you could simply do:

float As = GetElement(tmpA, 0, featA);
...
float dif_acum = As-B.elements[k*BLOCK_SIZE + featA];

This would make the code (not considering the first two bullets):

__kernel void CompareDescriptors_deb(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float accum[BLOCK_SIZE])
{
   int gpidA = get_global_id(0);
   int featA = get_local_id(0);
   int loop = 0;
   for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){
       DescriptorList tmpA = GetDescriptor(A, i);
       float As = GetElement(tmpA, 0, featA);
       for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
           float dif_acum = As-B.elements[k*BLOCK_SIZE + featA];

           accum[featA] = dif_acum[featA]*dif_acum[featA];
           barrier(CLK_LOCAL_MEM_FENCE);
           for(unsigned int i = 1; i < BLOCKSIZE; i *= 2)
           {
              if ((featA % (2*i)) == 0)
                 accum[featA] += accum[featA + i];
              barrier(CLK_LOCAL_MEM_FENCE);
           }
           if(featA == 0)
              C[loop] = accum[0];
           barrier(CLK_LOCAL_MEM_FENCE);

           loop += 1;
        }
    }
}
Grizzly
I have to say, first of all, that I'm very thankful with Grizzly's answer. I'm rather new to OpenCL and, although I needed to tweak the example code he gave a little bit, it led me straight into the right direction. Important things that I noticed (by trial and error): threads that not address the array positions needed to be discarded; the SCAN loop needed a little tweaking, namely using an auxiliary buffer to accumulate partial results and checking for boundary conditions to find the terms to be added. Thank you very much again! I'm posting the code that worked for me.
SebastianP
A: 

Thanks to Grizzly, I have now a working kernel. Some things I needed to modify based in the answer of Grizzly:

I added an IF statement at the beginning of the routine to discard all threads that won't reference any valid position in the arrays I'm using.

if(featA > BLOCK_SIZE){return;}

When copying the first descriptor to local (shared) memory (i.g. to Bs), the index has to be specified since the function GetElement returns just one element per call (I skipped that on my question).

Bs[featA] = GetElement(tmpA, 0, featA);

Then, the SCAN loop needed a little tweaking because the buffer is being overwritten after each iteration and one cannot control which thread access the data first. That is why I'm 'recycling' the dif_acum buffer to store partial results and that way, prevent inconsistencies throughout that loop.

dif_acum[featA] = accum[featA];

There are also some boundary control in the SCAN loop to reliably determine the terms to be added together.

if (featA >= j && next_addend >= 0 && next_addend < BLOCK_SIZE){

Last, I thought it made sense to include the loop variable increment within the last IF statement so that only one thread modifies it.

if(featA == 0){
    C[loop] = accum[BLOCK_SIZE-1];
    loop += 1;
}

That's it. I still wonder how can I make use of group_size to eliminate that BLOCK_SIZE definition and if there are better policies I can adopt regarding thread usage.

So the code looks finally like this:

__kernel void CompareDescriptors(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float accum[BLOCK_SIZE], __local float Bs[BLOCK_SIZE])
{

    int gpidA = get_global_id(0);
    int featA = get_local_id(0);

    //global counter to store final differences
    int loop = 0;

    //auxiliary buffer to store temporary data
    local float dif_acum[BLOCK_SIZE];

    //discard the threads that are not going to be used.
    if(featA > BLOCK_SIZE){
        return;
    }

    //loop over all descriptors in A
    for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){

        //take the gpidA-th descriptor
        DescriptorList tmpA = GetDescriptor(A, i);

        //copy the current descriptor to local memory
        Bs[featA] = GetElement(tmpA, 0, featA);

        //loop over all the descriptors in B
        for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
            //take the difference of both current descriptors
            dif_acum[featA] = Bs[featA]-B.elements[k*BLOCK_SIZE + featA];

            //square the values in dif_acum
            accum[featA] = dif_acum[featA]*dif_acum[featA];
            barrier(CLK_LOCAL_MEM_FENCE);

            //copy the values of accum to keep consistency once the scan procedure starts. Mostly important for the first element. Two buffers are necesarry because the scan procedure would override values that are then further read if one buffer is being used instead.
            dif_acum[featA] = accum[featA];

            //Compute the accumulated sum (a.k.a. scan)
            for(int j = 1; j < BLOCK_SIZE; j *= 2){
                int next_addend = featA-(j/2);
                if (featA >= j && next_addend >= 0 && next_addend < BLOCK_SIZE){
                    dif_acum[featA] = accum[featA] + accum[next_addend];
                }
                barrier(CLK_LOCAL_MEM_FENCE);

                //copy As to accum
                accum[featA] = GetElementArray(dif_acum, BLOCK_SIZE, featA); 
                barrier(CLK_LOCAL_MEM_FENCE);
            }

            //tell one of the threads to write the result of the scan in the array containing the results.
            if(featA == 0){
                C[loop] = accum[BLOCK_SIZE-1];
                loop += 1;
            }
            barrier(CLK_LOCAL_MEM_FENCE);

        }
    }
}
SebastianP
I still think you don't need those local buffers (accept for accum of course), since both dif_acum and Bs are only accessed at position featA, which is the local id of the thread and therefore each element of the arrays is accessed by only one thread. Furthermore using two buffers for the scan loop shouldn't really be necessary, since the the consistency is guaranteed by the barriers (the iterations are seperated by barriers and (at least for the loop I posted) in each iteration each element is only accessed by one thread
Grizzly