That's not an easy question to answer. Computer architecture is unsurprisingly rather complicated. Below are some guidelines but even these are simplifications. A lot of this will come down to your application and what constraints you're working within (both business and technical).
CPUs have several (2-3 generally) levels of caching on the CPU. Some modern CPUs also have a memory controller on the die. That can greatly improve the speed of swapping memory between cores. Memory I/O between CPUs will have to go on an external bus, which tends to be slower.
AMD/ATI chips use HyperTransport, which is a point-to-point protocol.
Complicating all this however is the bus architecture. Intel's Core 2 Duo/Quad system uses a shared bus. Think of this like Ethernet or cable internet where there is only so much bandwidth to go round and every new participant just takes another share from the whole. Core i7 and newer Xeons use QuickPath, which is pretty similar to HyperTransport.
More cores will occupy less space, use less space and less power and cost less (unless you're using really low powered CPUs) both in per-core terms and the cost of other hardware (eg motherboards).
Generally speaking one CPU will the the cheapest (both in terms of hardware AND software). Commodity hardware can be used for this. Once you go to the second socket you tend to have to use different chipsets, more expensive motherboards and often more expensive RAM (eg ECC fully buffered RAM) so you take a massive cost hit going from one CPU to two. It's one reason so many large sites (including Flickr, Google and others) use thousands of commodity servers (although Google's servers are somewhat customized to include things like a 9V battery but the principle is the same).
Your edits don't really change much. "Performance" is a highly subjective concept. Performance at what? Bear in mind though that if your application isn't sufficiently multithreaded (or multiprocess) to take advantage of extra cores then you can actually decrease performance by adding more cores.
I/O bound applications probably won't prefer one over the other. They are, after all, bound by I/O not CPU.
For compute-based applications well it depends on the nature of the computation. If you're doing lots of floating point you may benefit far more by using a GPU to offload calculations (eg using Nvidia CUDA). You can get a huge performance benefit from this. Take a look at the GPU client for Folding@Home for an example of this.
In short, your question doesn't lend itself to a specific answer because the subject is complicated and there's just not enough information. Technical architecture is something that has to be designed for the specific application.