ansaurus

Question

Generate all combinations of a char array inside of a CUDA __device__ kernel

Answer 1

A:

Let's see:

When filling your charset array, __syncthreads() will be sufficient as you are not interested in writes to global memory (more on this later)
Your if statements are not correctly resetting your loop iterators:
- In z < m, then both x == m and y == m and must both be set to 0.
- Similar for w
Each thread is responsible for writing one set of 4 characters in charset, but every thread writes the same 4 values. No thread does any independent work.
You are writing each threads results to global memory without atomics, which is unsafe. There is no guarantee that the results won't be immediately clobbered by another thread before reading them back.
You are reading the results of computation back from global memory immediately after writing them to global memory. It's unclear why you are doing this and this is very unsafe.
Finally, there is no reliable way in CUDA to to a synchronization between all blocks, which seems to be what you are hoping for. Calling __threadfence only applies to blocks currently executing on the device, which can be subset of all blocks that should run for a kernel call. Thus it doesn't work as a synchronization primitive.

It's probably easier to calculate initial values of x, y, z and w for each thread. Then each thread can start looping from its initial values until it has performed tasksPerThread iterations. Writing the values out can probably proceed more or less as you have it now.

EDIT: Here is a simple test program to demonstrate the logic errors in your loop iteration:

int m = 2;
int x = 0, y = 0, z = 0, w = 0;

for (int i = 0; i < m * m * m * m; i++)
{
 printf("x: %d y: %d z: %d w: %d\n", x, y, z, w);
 if(x < m) {
  x++;
 } else if(y < m) {
  x = 0; // = 0
  y++;
 } else if(z < m) {
  y = 0; // = 0
  z++;
 } else if(w < m) {
  z = 0;
  w++;; // = 0
 }
}

The output of which is this:

x: 0 y: 0 z: 0 w: 0
x: 1 y: 0 z: 0 w: 0
x: 2 y: 0 z: 0 w: 0
x: 0 y: 1 z: 0 w: 0
x: 1 y: 1 z: 0 w: 0
x: 2 y: 1 z: 0 w: 0
x: 0 y: 2 z: 0 w: 0
x: 1 y: 2 z: 0 w: 0
x: 2 y: 2 z: 0 w: 0
x: 2 y: 0 z: 1 w: 0
x: 0 y: 1 z: 1 w: 0
x: 1 y: 1 z: 1 w: 0
x: 2 y: 1 z: 1 w: 0
x: 0 y: 2 z: 1 w: 0
x: 1 y: 2 z: 1 w: 0
x: 2 y: 2 z: 1 w: 0

Eric 2009-11-24 11:49:13

Hi, thanks for your answer !my idea was that "__device__ uchar4 charset_global" is a kind of master array.each thread block should fetch "current value of charset_global" to shared charset[128], do next combination (fill in some computation with the char set here) and finally write the "already computed by the thread" combination to the charset_global var. (so the next thread can use the "already done combination" as offset).i hope you got me right ;))ps. "Your if statements are not correctly resetting your loop iterators" - should be right-working on userland - origin: combfunc aocp

sead 2009-11-24 11:58:56

I have no idea what 'right-working in userland' means, but you can see using the code in my edit that there are indeed problems with the loop iteration.

Eric 2009-11-24 14:07:26

The algorithm you are describing (in your comment) is a serial algorithm. That is, no thread can calculate a unique password until it gets the result from a previous thread. No threads can operate in parallel because they would start with the same initial password and permute it in the same way, producing duplicate output. The way to parallelize this is to understand that you will be generating 74^N possible combinations and each thread will generate 74^N/M of those combinations completely independent of what any other thread does.

Eric 2009-11-24 14:13:22

Answer 2

A:

Incidentally, your loop bound is overly complex. You don't need to do all that work to compute the endIdx, instead you can do the following, making the code simpler.

for(int idx = myThreadIdx ; idx < N ; idx += totalThreads)

Tom 2009-11-24 14:35:36

ansaurus

tags:

views:

answers:

Generate all combinations of a char array inside of a CUDA device kernel

related questions