views:

58

answers:

0

I would ask this in the CUDA forums but for some reason I can't get past the first page the registration, so here goes:

nVidia Card: 9800 GT

CUDA toolkit 3.0

Compiled for: compute capability 1.1

Scenario 1:

float result = 0;

float f1 = tex2D( tex, u,     v     );
float f2 = tex2D( tex, u + 1; v + 1 );

long long ll1 = __float2ll_rn( f1 );
long long ll2 = __float2ll_rn( f2 );

__syncthreads();

long long temp = ll1 + ll2;

__syncthreads();

if( temp == 0 )  // any test that uses the value of temp
{
    result = 1.0f;  // what ever value
}

__syncthreads();

if( threadIdx.x != 0 )  // some condition that causes within block thread divergence
{
    result= __ll2float_rn( __float2ll_rn( f1 ) );
}

// write result to device memory array at (u,v)

Lots of __syncthreads() for my paranoia. If I run the above, then result always come back as 4.593672e+18f no matter what input I set the texture (googling number does not show anything special). Here are some observations, changes highlighted with -->

Scenario 2:

float result = 0;

float f1 = tex2D( tex, u,     v     );
float f2 = tex2D( tex, u + 1; v + 1 );

--> long long ll1 = __float2int_rn( f1 ); --> long long ll2 = __float2int_rn( f2 );

__syncthreads();

long long temp = ll1 + ll2;

__syncthreads();

if( temp == 0 )  // any test that uses the value of temp
{
    result = 1.0f;  // what ever value
}

__syncthreads();

if( threadIdx.x != 0 )  // some condition that causes within block thread divergence
{
    result= __ll2float_rn( __float2ll_rn( f1 ) );
}

// write result to device memory array at (u,v)

Produced correct output if I replace first first two __float2ll_rn with __float2int_rn, the final __ll2float_rn( __float2ll_rn( f1 ) ) remains as is.

Scenario 3:

float result = 0;

float f1 = tex2D( tex, u,     v     );
float f2 = tex2D( tex, u + 1; v + 1 );

long long ll1 = __float2ll_rn( f1 );
long long ll2 = __float2ll_rn( f2 );

__syncthreads();

long long temp = ll1 + ll2;

__syncthreads();

--> // if( temp == 0 ) // any test that uses the value of temp --> // { --> // result = 1.0f; // what ever value --> // }

__syncthreads();

if( threadIdx.x != 0 )  // some condition that causes within block thread divergence
{
    result= __ll2float_rn( __float2ll_rn( f1 ) );
}

// write result to device memory array at (u,v)

If I comment out the forking condition that reads the value of temp (which depends on the texture read and the float to long long conversion), then it returns correct value.

Scenario 4:

float result = 0;

float f1 = tex2D( tex, u,     v     );
float f2 = tex2D( tex, u + 1; v + 1 );

long long ll1 = __float2ll_rn( f1 );
long long ll2 = __float2ll_rn( f2 );

__syncthreads();

long long temp = ll1 + ll2;

__syncthreads();

if( temp == 0 )  // any test that uses the value of temp
{

--> // result = 1.0f; // what ever value --> float a = 0; // change to variable other than 'result' }

__syncthreads();

if( threadIdx.x != 0 )  // some condition that causes within block thread divergence
{
    result= __ll2float_rn( __float2ll_rn( f1 ) );
}

// write result to device memory array at (u,v)

Also produces correct output.

Scenario 5:

float result = 0;

float f1 = tex2D( tex, u,     v     );
float f2 = tex2D( tex, u + 1; v + 1 );

long long ll1 = __float2ll_rn( f1 );
long long ll2 = __float2ll_rn( f2 );

__syncthreads();

long long temp = ll1 + ll2;

__syncthreads();

--> for( unsigned int i = 0; i < 200; ++i ) --> { --> result *= 1.2f; // wait till temp is loaded properly --> }

__syncthreads();

if( temp == 0 )  // any test that uses the value of temp
{
    result = 1.0f;  // what ever value
}

__syncthreads();

if( threadIdx.x != 0 )  // some condition that causes within block thread divergence
{
    result= __ll2float_rn( __float2ll_rn( f1 ) );
}

// write result to device memory array at (u,v)

This also works, it looks like temp needs some time to load.

Any ideas?

Thanks in advance.