tags:

views:

360

answers:

1

Given the following piece of code, generating a kind of code dictionary with CUDA using thrust (C++ template library for CUDA):

thrust::device_vector<float> dCodes(codes->begin(), codes->end());
thrust::device_vector<int> dCounts(counts->begin(), counts->end());
thrust::device_vector<int> newCounts(counts->size());

for (int i = 0; i < dCodes.size(); i++) {
    float code = dCodes[i];
    int count = thrust::count(dCodes.begin(), dCodes.end(), code);

    newCounts[i] = dCounts[i] + count;

    //Had we already a count in one of the last runs?
    if (dCounts[i] > 0) {
        newCounts[i]--;
    }

    //Remove
    thrust::detail::normal_iterator<thrust::device_ptr<float> > newEnd = thrust::remove(dCodes.begin()+i+1, dCodes.end(), code);
    int dist = thrust::distance(dCodes.begin(), newEnd);
    dCodes.resize(dist);
    newCounts.resize(dist);
}

codes->resize(dCodes.size());
counts->resize(newCounts.size());

thrust::copy(dCodes.begin(), dCodes.end(), codes->begin());
thrust::copy(newCounts.begin(), newCounts.end(), counts->begin());

The problem is, that i've noticed multiple copies of 4 bytes, by using CUDA visual profiler. IMO this is generated by

  1. The loop counter i
  2. float code, int count and dist
  3. Every access to i and the variables noted above

This seems to slow down everything (sequential copying of 4 bytes is no fun...).

So, how i'm telling thrust, that these variables shall be handled on the device? Or are they already?

Using thrust::device_ptr seems not sufficient for me, because i'm not sure whether the for loop around runs on host or on device (which could also be another reason for the slowliness).

+2  A: 

for every reiteration of i, size, index, code, etc. have to be copied from host to device.. the way you have your program, there is not much you can do. For best results, consider moving entire i loop on the device, this way you will not have host to device copies.

Trust is great for some things, however where performance is concerned and algorithm does not quite fit available functions, you may have to rewrite for best performance without using thrust algorithms explicitly.

aaa
Solved it with complete new device code, using thrust to copy everything i need.
macs