views:

509

answers:

3

Hi folks,

As I was finishing coding my project for a multicore programming class I came up upon something really weird I wanted to discuss with you.

We were asked to create any program that would show significant improvement in being programmed for a multi-core platform. I’ve decided to try and code something on the GPU to try out OpenCL. I’ve chosen the matrix convolution problem since I’m quite familiar with it (I’ve parallelized it before with open_mpi with great speedup for large images).

So here it is, I select a large GIF file (2.5 MB) [2816X2112] and I run the sequential version (original code) and I get an average of 15.3 seconds.

I then run the new OpenCL version I just wrote on my MBP integrated GeForce 9400M and I get timings of 1.26s in average.. So far so good, it’s a speedup of 12X!!

But now I go in my energy saver panel to turn on the “Graphic Performance Mode” That mode turns off the GeForce 9400M and turns on the Geforce 9600M GT my system has. Apple says this card is twice as fast as the integrated one.

Guess what, my timing using the kick-ass graphic card are 3.2 seconds in average… My 9600M GT seems to be more than two times slower than the 9400M..

For those of you that are OpenCL inclined, I copy all data to remote buffers before starting, so the actual computation doesn’t require roundtrip to main ram. Also, I let OpenCL determine the optimal local-worksize as I’ve read they’ve done a pretty good implementation at figuring that parameter out..

Anyone has a clue?

edit: full source code with makefiles here http://www.mathieusavard.info/convolution.zip

cd gimage
make
cd ../clconvolute
make
put a large input.gif in clconvolute and run it to see results
A: 

I ran into the same issue when I was testing out OpenCL on my MacBook. I believe it's because the GeForce 9400M has a higher bus speed to the main memory bank than the Geforce 9600M GT. So even though the Geforce 9600M GT has much more power than the GeForce 9400M the time required to copy the memory to the GPU is too long to see the benefit of the more powerful GPU on your situation. It could also be caused by inappropriate worker group sizes.

Also I found this site very helpful in my OpenCL experience.

http://www.macresearch.org/opencl

Kendall Hopkins
Thanks Kendall, but the macresearch.org thing is what I based my code on actually :P The worker group size is automatically set by passing in the null parameter.
matdumsa
Try using different sizes. The default isn't always the best.
Kendall Hopkins
ok so I’ve tried different size.. the auto-detected size is 16 on both card.. I can get up to 17 but it decreases perf on both cards.. I get an error above 17. Weird, weird, weird...
matdumsa
I'm sorry, I meant to the local work size, not the global one. The local one should be able to range from about 64-512.
Kendall Hopkins
+1  A: 

The performance is not the only difference between a GeForce 9400M and a Geforce 9600M GT. A big one is that one is a discrete GPU. With this come a slew of differences, amongst which the following can have an impact:

  • tendency of drivers to batch more commands
  • memory is not uniform. the GPU generally only accesses its own memory, and the driver moves memory back and forth over the PCI-E bus.

I'm sure I'm missing some...

Here are a bunch of ideas you can try:

  • avoid calling clFinish. The way you call it between the memory load and the execution forces the driver to do more work than necessary. It stalls the GPU.
  • profile your code to see what is taking the time. I'm not aware of support for CL performance analysis yet, but with your clFinish calls, it gives you a 1st order estimate by simply measuring the CPU side. Note that it's hard in general to distinguish what is due to latency and what is due to throughput.
Bahbar
Thanks Bahbar, I tried removing the suggested cl_finish but with no success.. Then I tried removing them all (even the unsafe one) and I still get the same run time…An interesting thing though is that OpenCL runtimes takes twice as long (both GeForce) if I unplug the power cord of my computer and let it run on battery..
matdumsa
A: 

I get the same results, and I'm unsure why. My kernel involves very minimal copying to/from (I presend all needed data for all kernel calls, and only return a 512x512 image). It's a raytracer, so the kernel work vastly outweighs the copy back (400+ms to 10ms). Still, the 9600M GT is about 1.5x-2x slower.

According to nVidia's listing, the 9600M GT should have 32 SPs (twice the number of the 9400M). It's presumably clocked higher too.

The 9600M GT does seem faster in some cases, e.g. games. See these links: http://www.videocardbenchmark.net/video_lookup.php?cpu=GeForce+9600M+GT http://www.videocardbenchmark.net/video_lookup.php?cpu=GeForce+9600M+GT

According to ars technica:

Furthermore, an interesting tidbit about Snow Leopard's implementation is revealed by early tests. Though Snow Leopard doesn't seem to enable dual GPUs or on-the-fly GPU switching for machines using the NVIDIA GeForce 9400M chipset—a limitation carried over from Leopard—it does appear that the OS can use both as OpenCL resources simultaneously. So even if you have the 9600M GT enabled on your MacBook Pro, if OpenCL code is encountered in an application, Snow Leopard can send that code to be processed by the 16 GPU cores sitting pretty much dormant in the 9400M. The converse is not true, though—when running a MacBook Pro with just the 9400M enabled, the 9600M GT is shut down entirely to save power, and can't be used as an OpenCL resource.

This seems to be the opposite of what we are seeing. Also, I am explicitly setting up a CL context on only one device at a time.

There are some suggestions in the ars forums that the 9600M GT doesn't support doubles as well, which would explain this problem. I might try to write up a synthetic benchmark to test this hypothesis.

Benjamin Horstman