I would ask this in the CUDA forums but for some reason I can't get past the first page the registration, so here goes:
nVidia Card: 9800 GT
CUDA toolkit 3.0
Compiled for: compute capability 1.1
Scenario 1:
float result = 0;
float f1 = tex2D( tex, u, v );
float f2 = tex2D( tex, u + 1; v + 1 );
long long ll1 = __float2ll_rn( f1 );
long long ll2 = __float2ll_rn( f2 );
__syncthreads();
long long temp = ll1 + ll2;
__syncthreads();
if( temp == 0 ) // any test that uses the value of temp
{
result = 1.0f; // what ever value
}
__syncthreads();
if( threadIdx.x != 0 ) // some condition that causes within block thread divergence
{
result= __ll2float_rn( __float2ll_rn( f1 ) );
}
// write result to device memory array at (u,v)
Lots of __syncthreads() for my paranoia. If I run the above, then result
always come back as 4.593672e+18f
no matter what input I set the texture (googling number does not show anything special). Here are some observations, changes highlighted with -->
Scenario 2:
float result = 0;
float f1 = tex2D( tex, u, v );
float f2 = tex2D( tex, u + 1; v + 1 );
--> long long ll1 = __float2int_rn( f1 ); --> long long ll2 = __float2int_rn( f2 );
__syncthreads();
long long temp = ll1 + ll2;
__syncthreads();
if( temp == 0 ) // any test that uses the value of temp
{
result = 1.0f; // what ever value
}
__syncthreads();
if( threadIdx.x != 0 ) // some condition that causes within block thread divergence
{
result= __ll2float_rn( __float2ll_rn( f1 ) );
}
// write result to device memory array at (u,v)
Produced correct output if I replace first first two __float2ll_rn
with __float2int_rn
, the final __ll2float_rn( __float2ll_rn( f1 ) )
remains as is.
Scenario 3:
float result = 0;
float f1 = tex2D( tex, u, v );
float f2 = tex2D( tex, u + 1; v + 1 );
long long ll1 = __float2ll_rn( f1 );
long long ll2 = __float2ll_rn( f2 );
__syncthreads();
long long temp = ll1 + ll2;
__syncthreads();
--> // if( temp == 0 ) // any test that uses the value of temp --> // { --> // result = 1.0f; // what ever value --> // }
__syncthreads();
if( threadIdx.x != 0 ) // some condition that causes within block thread divergence
{
result= __ll2float_rn( __float2ll_rn( f1 ) );
}
// write result to device memory array at (u,v)
If I comment out the forking condition that reads the value of temp (which depends on the texture read and the float to long long conversion), then it returns correct value.
Scenario 4:
float result = 0;
float f1 = tex2D( tex, u, v );
float f2 = tex2D( tex, u + 1; v + 1 );
long long ll1 = __float2ll_rn( f1 );
long long ll2 = __float2ll_rn( f2 );
__syncthreads();
long long temp = ll1 + ll2;
__syncthreads();
if( temp == 0 ) // any test that uses the value of temp
{
--> // result = 1.0f; // what ever value --> float a = 0; // change to variable other than 'result' }
__syncthreads();
if( threadIdx.x != 0 ) // some condition that causes within block thread divergence
{
result= __ll2float_rn( __float2ll_rn( f1 ) );
}
// write result to device memory array at (u,v)
Also produces correct output.
Scenario 5:
float result = 0;
float f1 = tex2D( tex, u, v );
float f2 = tex2D( tex, u + 1; v + 1 );
long long ll1 = __float2ll_rn( f1 );
long long ll2 = __float2ll_rn( f2 );
__syncthreads();
long long temp = ll1 + ll2;
__syncthreads();
--> for( unsigned int i = 0; i < 200; ++i ) --> { --> result *= 1.2f; // wait till temp is loaded properly --> }
__syncthreads();
if( temp == 0 ) // any test that uses the value of temp
{
result = 1.0f; // what ever value
}
__syncthreads();
if( threadIdx.x != 0 ) // some condition that causes within block thread divergence
{
result= __ll2float_rn( __float2ll_rn( f1 ) );
}
// write result to device memory array at (u,v)
This also works, it looks like temp needs some time to load.
Any ideas?
Thanks in advance.