views:

76

answers:

1

Hi everybody,

I've been working on an AES CUDA application and I have a kernel which performs ECB encryption on the GPU. In order to assure the logic of the algorithm is not modified when running in parallel I send a known input test vector provided by NIST and then from host code compare the output with the know test vector output provided by NIST with an assert. I have run this test on my NVIDIA GPU which is a 8600M GT. This is running under Windows 7 and the driver version is 3.0. Under this sceneario everything works perfect and the assert succeeds.

Now, when the application is run on a Quadro FX 770M. The same application is launched, the same test vectors are sent but the result obtained is incorrect and the assert fails!!. This runs on Linux with the same driver version The kernels are executed by 256 threads. Within the kernels and to skip arithmetic pre computed lookup tables of 256 elements are used. These tables are originally loaded in global memory, 1 thread out of the 256 threads launching the kernel colaborate in loading 1 element of the lookup table and moves the element into a new lookup table in shared memory so access latency is decreased.

Originally, I thought about syncrhonization problems due to Clock speed differences between GPUs. So may be threads were using values still not loaded into shared memory or somehow values which were still not processed, making the output to mess up and finally get it incorrect.

In here the known test vectors are declared, so basically they are sent to AES_set_encrption which is in charge to setup the kernel

void test_vectors ()
{ 

  unsigned char testPlainText[]  = {0x6b, 0xc1, 0xbe, 0xe2, 0x2e, 0x40, 0x9f, 0x96, 0xe9, 0x3d, 0x7e, 0x11, 0x73, 0x93, 0x17, 0x2a}; 
     unsigned char testKeyText[] =  {0x60, 0x3d, 0xeb, 0x10, 0x15, 0xca, 0x71, 0xbe, 0x2b, 0x73, 0xae, 0xf0, 0x85, 0x7d, 0x77,0x1f, 0x35, 0x2c, 0x07, 0x3b, 0x61, 0x08, 0xd7, 0x2d, 0x98, 0x10, 0xa3, 0x09, 0x14, 0xdf, 0xf4};
     unsigned char testCipherText[] = {0xf3, 0xee, 0xd1, 0xbd, 0xb5, 0xd2, 0xa0, 0x3c, 0x06, 0x4b, 0x5a, 0x7e, 0x3d, 0xb1, 0x81, 0xf8};

 unsigned char out[16] = {0x0};
     //AES Encryption
AES_set_encrption( testPlainText, out, 16, (u32*)testKeyText);

 //Display encrypted data
 printf("\n  GPU Encryption: "); 
 for (int i = 0; i < AES_BLOCK_SIZE; i++)
         printf("%x", out[i]);

 //Assert that the encrypted output is the same as the NIST testCipherText vector 
 assert (memcmp (out, testCipherText, 16) == 0);
}

In here the setup function is in charge to allocate memory, call the kernel and send the results back to the hos. Notice I have syncrhonize before sending back to the host so at that point everything should be finished, which makes me think the problem is within the kernel..

__host__ double AES_set_encrption (... *input_data,...*output_data, .. input_length, ... ckey )

 //Allocate memory in the device and copy the input buffer from the host to the GPU
  CUDA_SAFE_CALL( cudaMalloc( (void **) &d_input_data,input_length ) ); 
  CUDA_SAFE_CALL( cudaMemcpy( (void*)d_input_data, (void*)input_data, input_length, cudaMemcpyHostToDevice ) ); 

     dim3 dimGrid(1);
     dim3 dimBlock(THREAD_X,THREAD_Y); // THREAD_X = 4 & THREAD_Y = 64
  AES_encrypt<<<dimGrid,dimBlock>>>(d_input_data);

     cudaThreadSynchronize();

     //Copy the data processed by the GPU back to the host 
  cudaMemcpy(output_data, d_input_data, input_length, cudaMemcpyDeviceToHost);

  //Free CUDA resources
  CUDA_SAFE_CALL( cudaFree(d_input_data) );
}

And finally within the kernel I have a set of AES rounds computed. Since I was thinking that the syncrhonization problem was then within the kernel I set __syncthreads(); after each round or computational operation to make sure all threads were moving at the same time so no uncomputed values migth be evaluated.. But still that did not solved the problem..

Here is the output when I use the 8600M GT GPU which works fine:

AES 256 Bit key

NIST Test Vectors:

PlaintText: 6bc1bee22e409f96e93d7e117393172a

Key: 603deb1015ca71be2b73aef0857d7781

CipherText: f3eed1bdb5d2a03c64b5a7e3db181f8

GPU Encryption: f3eed1bdb5d2a03c64b5a7e3db181f8

Test status: Passed

And here is when I use the Quadro FX 770M and fails!!

AES 256 Bit key NIST Test Vectors:

PlaintText: 6bc1bee22e409f96e93d7e117393172a

Key: 603deb1015ca71be2b73aef0857d7781

CipherText: f3eed1bdb5d2a03c64b5a7e3db181f8

GPU Encryption: c837204eb4c1063ed79c77946893b0

Generic assert memcmp (out, testCipherText, 16) == 0 has thrown an error

Test status: Failed

What might be the reason why 2 GPUs compute different results even when they process same kernels??? I will appreciate any hint or troubleshooting any of you could give me or any step in order to fix this issue

Thanks in advance!!

+1  A: 

disclaimer: I don't know anything about AES encryption.

Do you use double precision? You are probably aware, but just to be sure - I believe that both of the cards you are using are compute capabality 1.1 which does not support double precision. Perhaps the cards or the platforms convert to single precision in different ways...? Anyone know? Truthfully, the IEEE floating point deviations are well specified, so I'd be suprised.

jmilloy
Also, I just joined.... how do I add a comment to the original post (as opposed to adding an answer)?
jmilloy
I think you need more `rep` to edit. But I gave you a +1 to help you along :)
leppie
Hi, Thank you very much for your advise, as a matter of fact we could track the bug and it was indeed not related to device code. Overview: OpenSSL is in charge to prosses the round keys. For some reason the OpenSSL windows version prossess a different round key than the Linux version so the AES round keys are sent already different which leads to a different device output. We're trying to catch this bug as well. As soon as we get it I'll post the answer. Nevertheless hints are still welcome.. THanks!
Bartzilla