tags:

views:

86

answers:

3

what is the best nvidia Video Card for cuda development. a single GTX 295 has 2 GPUs, is it possible to have 2 GTX 295 and use the 4 GPUs in my cuda code?
is it better to get two 480 cards rather than two 295? would a fermi be better than both cards?

+3  A: 

what is the best nvidia Video Card for cuda development.

Whatever fits in your budget and suits your needs. I know this is a bit vague, but after all it really is as simple as that ;)

a single GTX 295 has 2 GPUs, is it possible to have 2 GTX 295 and use the 4 GPUs in my cuda code?

Sure, it is. The only drawback is that the 2 GPUs on the GTX 295 share a single PCI. Whether this is relevant for you or not depends if the application needs intensive communication with the host or not.

is it better to get two 480 cards rather than two 295? would a fermi be better than both cards?

From the point of view of raw peak performance a GTX 295 (which is almost 2x GTX 280, not considering the shared PCI) is better than a 480. However the GF10x series architecture improved on many points compared to the GT200, for details see the "Fermi whitepaper" and the "Fermi Tuning Guide".

If you're planning to use double precision, the GF10x series has much improved double precision support, but it's good to know that this is capped on GeForce cards to 1/8-th of the single precision performance (normally it's about half)

Therefor, I would suggest that unless you have a strong reason to get lots of GFlops (Folding@Home?) in the form of soon to be outdated hardware, get a GTX 480 or a 470 if you want to save ~25%.

pszilard
A: 

is it possible to have 2 GTX 295 and use the 4 GPUs in my cuda code?

Yes. Or quad, if you're totally insane.

is it better to get two 480 cards rather than two 295?

Arguable. 295 as a dual-gpu has slightly more raw oomph, but 480 as a 40nm-process card without the dual-gpu overhead may use its resources better. Benchmarks vary. Of course the Fermi 4xx range has more modern feature support (3D, DirectX, OpenCL etc).

But dual-295 is going to have seriously huge PSU and cooling requirements. And dual-480 runs almost as hot. Not to mention the expense. What are you working on that you think you're going to need this? Have you considered the more mainstream parts, eg 460, which is generally considered to offer a better price/performance than the troubled 470–480 (GF100) part?

bobince
+1  A: 

Direct answer: I would go with one or maybe two GTX 480's. But I think my reasoning is a bit different from @bobince or @pszilard.

Backgroud: I just made the same decision you're facing, but our situations may be vastly different.

I'm a statistics graduate student in a department with minimal funding for gpu computing resources, the campus does have one fermi box hooked up to two nodes that I have access to. But these were in linux -- which I love -- but I really want to use nSight to benchmark and tune my code, so I need windows -- so I decided to purchase a development box which I dual boot, Ubuntu x64 for production runs and Win 7 with VS 2010 (a battle which I'm presently fighting) and nSight 1.5 for development. That said, back to the reason why I bought two GTX 480's (EVGA is awesome!!) and not two GTX 285's or 295's.

I've spent the past two years developing a couple of CUDA kernels. The trickiest part of the development, for me, is the memory management. I spent the better part of three months trying to squeeze a Cholesky decomposition & back substitution into 16 single-precision registers -- the max you can use before either the GTX 285 or 295 incur a 50% performance penalty (literally 3 weeks going from 17 to 16 registers). For me, the fact that all Fermi architectures have double the registers means that those three months would've gained me about 10% improvement on a GTX 480 instead of 50% on GTX 285 and hence, probably not worth my time -- in truth a bit more subtle than that, but you get the drift.

If you're fairly new to CUDA -- which you probably are since you're asking -- I would say 32 registers is HUGE. Second, I think the L1 cache of the Fermi architecture can directly translate to faster global memory accesses -- of course it does, but I haven't measured the impact directly yet. If you don't need the global memory as much, you can trade the bigger L1 cache for triple the shared memory -- which was also a tight squeeze for me as the matrix sizes increased.

Then I would agree with @pszilard that if you need double precision, Fermi is definitely the way to go -- though I'd still write your code in single precision first, tune it, and then migrate to double.

I don't think that concurrent kernel execution will matter for you -- it's really cool, the delays to kernel completion can be orders of magnitude less -- but you're probably going to focus on one kernel first, not parallel kernels. If you want to do streaming or parallel kernels, then you need Fermi -- the 285 / 295's simply can't do it.

And lastly, the drawback of going with the 295's is that you have to write two layers of parallelism: (1) one to distribute blocks (or kernels?) across the cards and (2) the gpu kernel itself. If you're just starting out, it's much easier to keep the parallelism in one place (on a single card) as opposed to fighting two battles at once.

Ps. If you haven't written your kernels yet, you might consider getting only one card and waiting six months to see if the landscape changes again -- though I have no idea when the next cards are to be released.

PPs. I absolutely loved running my cuda kernel on the GTX 480 which I had debugged / designed on a Tesla C1070 and instantly realizing a 2x speed improvement. Money well spent.

M. Tibbits
@M. Tibbits: good point about the double amount of registers on Fermi + potentially triple amount of shared memory. However, as a starter I would guess one does not realize the *huge* potential benefits. One has to hit the register-wall and suffer from it first to be able to really appreciate the changes Fermi brought :)
pszilard
True, but it's incredibly easy to use a bunch of registers, especially if you use the -arch sm_20 because then the compiler will use as many as possible (unless you specifically provide the maxregcount flag). For example, I had a kernel which used 16 registers when compiled -arch sm_10 through sm_13, but when I compiled it for -arch sm_20 it jumped to 48 registers!?! It took some tweaking and the -maxregcount 20 flag to get things under control.
M. Tibbits