views:

1349

answers:

4

I am somewhat familiar with the CUDA visual profiler and the occupancy spreadsheet, although I am probably not leveraging them as well as I could. Profiling & optimizing CUDA code is not like profiling & optimizing code that runs on a CPU. So I am hoping to learn from your experiences about how to get the most out of my code.

There was a post recently looking for the fastest possible code to identify self numbers, and I provided a CUDA implementation. I'm not satisfied that this code is as fast as it can be, but I'm at a loss as to figure out both what the right questions are and what tool I can get the answers from.

How do you identify ways to make your CUDA kernels perform faster?

A: 

The CUDA profiler is rather crude and doesn't provide a lot of useful information. The only way to seriously micro-optimize your code (assuming you have already chosen the best possible algorithm) is to have a deep understanding of the GPU architecture, particularly with regard to using shared memory, external memory access patterns, register usage, thread occupancy, warps, etc.

Maybe you could post your kernel code here and get some feedback ?

The nVidia CUDA developer forum forum is also a good place to go for help with this kind of problem.

Paul R
+1  A: 

If you're developing on Linux then the CUDA Visual Profiler gives you a whole load of information, knowing what to do with it can be a little tricky. On Windows you can also use the CUDA Visual Profiler, or (on Vista/7/2008) you can use Nexus which integrates nicely with Visual Studio and gives you combined host and GPU profile information.

Once you've got the data, you need to know how to interpret it. The Advanced CUDA C presentation from GTC has some useful tips. The main things to look out for are:

  • Optimal memory accesses: you need to know what you expect your code to do and then look for exceptions. So if you are always loading floats, and each thread loads a different float from an array, then you would expect to see only 64-byte loads (on current h/w). Any other loads are inefficient. The profiling information will probably improve in future h/w.
  • Minimise serialization: the "warp serialize" counter indicates that you have shared memory bank conflicts or constant serialization, the presentation goes into more detail and what to do about this as does the SDK (e.g. the reduction sample)
  • Overlap I/O and compute: this is where Nexus really shines (you can get the same info manually using cudaEvents), if you have a large amount of data transfer you want to overlap the compute and the I/O
  • Execution configuration: the occupancy calculator can help with this, but simple methods like commenting the compute to measure expected vs. measured bandwidth is really useful (and vice versa for compute throughput)

This is just a start, check out the GTC presentation and the other webinars on the NVIDIA website.

Tom
Thanks, Tom. Is that presentation available for download?
John Dibling
It's the same one I linked to before! http://www.nvidia.com/content/GTC/videos/GTC09-1086.flv http://www.nvidia.com/content/GTC/videos/GTC09-1086.mp4
Tom
Thanks, Tom. I didn't see your response to my comment the other day, I guess.
John Dibling
A: 

If you are using Windows... Check Nexus:

http://developer.nvidia.com/object/nexus.html

crick3r
+1 Thx for the link. I had just gone to check that out when you posted.
John Dibling
A: 

I hung back because I'm no CUDA expert, and the other answers are pretty good IF the code is already pretty near optimal. In my experience, that's a big IF, and there's no harm in verifying it.

To verify it, you need to find out if the code is for sure not doing anything it doesn't really have to do. Here are ways I can see to verify that:

  • Run the same code on the vanilla processor, and either take stackshots of it, or use a profiler such as Oprofile or RotateRight/Zoom that can give you equivalent information.

  • Running it on a CUDA processor, and doing the same thing, if possible.

What you're looking for are lines of code that have high occupancy on the call stack, as shown by the fraction of stack samples containing them. Those are your "bottlenecks". It does not take a very large number of samples to locate them.

Mike Dunlavey
You're right, that is a big IF. And a major concern of mine is how to determine that you have selected the correct algorithm in the first place. Correct algorithms on the CPU may be very different than correct algorithms on the GPU.
John Dibling
@John-Dibling: Well, in that case I would see how I could get stackshots on a CUDA processor. I would hunt pretty hard for a debugger that could step or pause at least one processor, and show its state. I've been tuning code, embedded and otherwise, for 30 years, and that's the method I use. The only profilers that can come close to standing up to it are the ones that take stack samples and summarize at line/instruction level.
Mike Dunlavey