views:

324

answers:

3

I realize this is more of a hardware question, but this is also very relevant to software, especially when programming for mult-threaded multi-core/cpu environments.

Which is better, and why? Whether it be regarding efficiency, speed, productivity, usability, etc.

1.) A computer/server with 4 quad-core CPUs?

or

2.) A computer/server with 16 single-core CPUs?

Please assume all other factors (speed, cache, bus speeds, bandwidth, etc.) are equal.

Edit:

I'm interested in the performance aspect in general. As to if it's particularly better at one aspect and horrible (or not preferable) at another, then I'd like to know that as well.

And if I have to choose, I'd be most interested which is better in regards to I/O-bound applications, and compute-bound applications.

+11  A: 

That's not an easy question to answer. Computer architecture is unsurprisingly rather complicated. Below are some guidelines but even these are simplifications. A lot of this will come down to your application and what constraints you're working within (both business and technical).

CPUs have several (2-3 generally) levels of caching on the CPU. Some modern CPUs also have a memory controller on the die. That can greatly improve the speed of swapping memory between cores. Memory I/O between CPUs will have to go on an external bus, which tends to be slower.

AMD/ATI chips use HyperTransport, which is a point-to-point protocol.

Complicating all this however is the bus architecture. Intel's Core 2 Duo/Quad system uses a shared bus. Think of this like Ethernet or cable internet where there is only so much bandwidth to go round and every new participant just takes another share from the whole. Core i7 and newer Xeons use QuickPath, which is pretty similar to HyperTransport.

More cores will occupy less space, use less space and less power and cost less (unless you're using really low powered CPUs) both in per-core terms and the cost of other hardware (eg motherboards).

Generally speaking one CPU will the the cheapest (both in terms of hardware AND software). Commodity hardware can be used for this. Once you go to the second socket you tend to have to use different chipsets, more expensive motherboards and often more expensive RAM (eg ECC fully buffered RAM) so you take a massive cost hit going from one CPU to two. It's one reason so many large sites (including Flickr, Google and others) use thousands of commodity servers (although Google's servers are somewhat customized to include things like a 9V battery but the principle is the same).

Your edits don't really change much. "Performance" is a highly subjective concept. Performance at what? Bear in mind though that if your application isn't sufficiently multithreaded (or multiprocess) to take advantage of extra cores then you can actually decrease performance by adding more cores.

I/O bound applications probably won't prefer one over the other. They are, after all, bound by I/O not CPU.

For compute-based applications well it depends on the nature of the computation. If you're doing lots of floating point you may benefit far more by using a GPU to offload calculations (eg using Nvidia CUDA). You can get a huge performance benefit from this. Take a look at the GPU client for Folding@Home for an example of this.

In short, your question doesn't lend itself to a specific answer because the subject is complicated and there's just not enough information. Technical architecture is something that has to be designed for the specific application.

cletus
Without considering cost and space, can you elaborate on which is better in regards to different aspects? Consider the edits on my post if you may please.
Sev
If you have memory bandwidth-hungry tasks, a multi-socket system may be able to offer more bandwidth per task if the OS is NUMA aware. However, if the tasks are synchronization-heavy (ie share a large amount of frequently modified data) then the higher memory latency of a multi-socket system could hurt.
@cletus: my choice of the word "peformance" was to further imply that i'm looking for an all-encompassing answer. But to make things easier, I later specified options if a broad answer is difficult to achieve. As for the floating point answer you gave; well, you're suggesting a GPU to offload calcs, which is fine, but doesn't answer my question directly. Understandably so. Probably not much research has been done directly on this topic. In general though, thank you for the excellent answer!
Sev
Sev, I think you may be not understanding that your questions simple answer is "It depends" as there are numerous factors to consider and cletus does a good job at taking an initial stab at it.
JB King
To elaborate on JB King's note: this stuff is not only complicated, it is always in flux. Engineers look at each piece of state-of-the-art hardware and say "Where are the bottlenecks, and how can I improve them consistent with my choice of (good, fast, cheep)?" And the answers may be different for the next generation.
dmckee
Oh ya, I understand. And cletut did do a good job at taking an initial stab at it, which I noted and also accepted his answer for that very reason :)
Sev
+3  A: 

Well, the point is that all other factors can't really be equal.

The main problem with multi-CPU is latency and bandwidth when the two CPU sockets have to intercommunicate. And this has to happen constantly to make sure their local caches aren't out of sync. This incurs latency, and sometimes can be the bottleneck of your code. (Not always of course.)

SPWorley
+1  A: 

It depends on the architecture to some extent; BUT a quad core CPU is pretty much the same (or better) than 4 physically separate CPUs due to the reduced communication (i.e doesn't have to go off die and not travel very far, which is a factor), and shared resources.

Mitch Wheat
So you're saying more cores are for sure better than more single-core cpu's? I wish there were some benchmarks available to prove this.
Sev
However a single processor may have shared caches between some cores. If the cores are working on different part of the memory the processor will spend most of the time invalidating cache lines and fetching data from the main memory through the bus.
Ben