ansaurus

Question

My OpenCL kernel is slower on faster hardware.. But why?

Answer 1

A:

I ran into the same issue when I was testing out OpenCL on my MacBook. I believe it's because the GeForce 9400M has a higher bus speed to the main memory bank than the Geforce 9600M GT. So even though the Geforce 9600M GT has much more power than the GeForce 9400M the time required to copy the memory to the GPU is too long to see the benefit of the more powerful GPU on your situation. It could also be caused by inappropriate worker group sizes.

Also I found this site very helpful in my OpenCL experience.

http://www.macresearch.org/opencl

Kendall Hopkins 2010-04-12 13:38:47

Thanks Kendall, but the macresearch.org thing is what I based my code on actually :P The worker group size is automatically set by passing in the null parameter.

matdumsa 2010-04-12 17:52:16

Try using different sizes. The default isn't always the best.

Kendall Hopkins 2010-04-12 20:02:54

ok so I’ve tried different size.. the auto-detected size is 16 on both card.. I can get up to 17 but it decreases perf on both cards.. I get an error above 17. Weird, weird, weird...

matdumsa 2010-04-12 23:38:25

I'm sorry, I meant to the local work size, not the global one. The local one should be able to range from about 64-512.

Kendall Hopkins 2010-04-13 00:21:58

Answer 2

+1 A:

The performance is not the only difference between a GeForce 9400M and a Geforce 9600M GT. A big one is that one is a discrete GPU. With this come a slew of differences, amongst which the following can have an impact:

tendency of drivers to batch more commands
memory is not uniform. the GPU generally only accesses its own memory, and the driver moves memory back and forth over the PCI-E bus.

I'm sure I'm missing some...

Here are a bunch of ideas you can try:

avoid calling clFinish. The way you call it between the memory load and the execution forces the driver to do more work than necessary. It stalls the GPU.
profile your code to see what is taking the time. I'm not aware of support for CL performance analysis yet, but with your clFinish calls, it gives you a 1st order estimate by simply measuring the CPU side. Note that it's hard in general to distinguish what is due to latency and what is due to throughput.

Bahbar 2010-04-13 12:13:45

Thanks Bahbar, I tried removing the suggested cl_finish but with no success.. Then I tried removing them all (even the unsafe one) and I still get the same run time…An interesting thing though is that OpenCL runtimes takes twice as long (both GeForce) if I unplug the power cord of my computer and let it run on battery..

matdumsa 2010-04-13 14:38:24

Answer 3

A:

I get the same results, and I'm unsure why. My kernel involves very minimal copying to/from (I presend all needed data for all kernel calls, and only return a 512x512 image). It's a raytracer, so the kernel work vastly outweighs the copy back (400+ms to 10ms). Still, the 9600M GT is about 1.5x-2x slower.

According to nVidia's listing, the 9600M GT should have 32 SPs (twice the number of the 9400M). It's presumably clocked higher too.

The 9600M GT does seem faster in some cases, e.g. games. See these links: http://www.videocardbenchmark.net/video_lookup.php?cpu=GeForce+9600M+GT http://www.videocardbenchmark.net/video_lookup.php?cpu=GeForce+9600M+GT

According to ars technica:

Furthermore, an interesting tidbit about Snow Leopard's implementation is revealed by early tests. Though Snow Leopard doesn't seem to enable dual GPUs or on-the-fly GPU switching for machines using the NVIDIA GeForce 9400M chipset—a limitation carried over from Leopard—it does appear that the OS can use both as OpenCL resources simultaneously. So even if you have the 9600M GT enabled on your MacBook Pro, if OpenCL code is encountered in an application, Snow Leopard can send that code to be processed by the 16 GPU cores sitting pretty much dormant in the 9400M. The converse is not true, though—when running a MacBook Pro with just the 9400M enabled, the 9600M GT is shut down entirely to save power, and can't be used as an OpenCL resource.

This seems to be the opposite of what we are seeing. Also, I am explicitly setting up a CL context on only one device at a time.

There are some suggestions in the ars forums that the 9600M GT doesn't support doubles as well, which would explain this problem. I might try to write up a synthetic benchmark to test this hypothesis.

Benjamin Horstman 2010-05-19 15:21:28

ansaurus

tags:

views:

answers:

My OpenCL kernel is slower on faster hardware.. But why?

related questions