views:

222

answers:

8

Intel has just unveiled a new 48 core CPUs. More than just the number of cores, this new architecture seems to introduce a lot of interesting features, such as this one:

Things get interesting here - Intel is saying that they have removed hardware cache coherency which effectively means each "tile" will be completely separate in what it stores in local L2 cache. All cache communication between cores and tiles will thus be handled by the mesh data communication system and the dedicated "message buffer" on each tile.

What will these new architectures imply for us programmers? How will we tackle the complexity of tomorow's CPUs?

+1  A: 

Herb Sutter issued several articles on effective concurrency.

As introduction you can read his excellent article "A Fundamental Turn Toward Concurrency in Software".

Marcin Gil
A: 

We will need to make many small tasks that solve small problems but are chained together. Sounds very unix-ish

Big monolithic sequential code will become more and more disdained.

Simeon Pilgrim
+2  A: 

The ideal situation would be for it to not mean anything at all for the 90% of the programmers.

As with almost everything, we should have a programming paradigm supported with one or several programming languages that hide the complexities of multitasking from the usual programmer. It's quite difficult to program anything with threads, more if you have to use 20 o 30 to get the program to really use up the CPU.

There are several proposals like Parallels Extensions, Microsoft Axum or Erlang to start with, specially Erlang has a long story of successes over the years.

Jorge Córdoba
+4  A: 

It means we should all be learning Erlang or some other functional language designed for concurrency.

Programming for concurrency by manually coordinating the creation and interaction of multiple threads using mutexes and other techniques is as cumbersome and error prone as attempting OO design in a language that doesn't natively support OO. Immutable data and a functional programming style are naturally suited to concurrency (as evidenced by the numerous functional languages designed specifically for concurrency). I suspect that as processor performance expands via improvements in parallelism more than clock-speed developers will increasingly look toward functional languages to make use of that processing power until ultimately functional languages predominate.

Wedge
Exactly. The primarily procedural, sequential model of computing just doesn't work well in highly parallel environments. A shift to a computing model and language that are designed around parallelism seems pretty much inevitable.
kyoryu
A: 

We'll probably need to postpone our utopian dream of a magical compiler that parallelizes code for you without your having to think about it.

Crashworks
+1  A: 

The quote sounds like you are talking about Larrabee? On that kind of architecture you just can't ignore multicore, because the processing cores are simpler and would be slower running the code that compilers generate at the moment - especially because they execute instructions in-order (which saves die size and results in a simpler core).

In general, multicore is great for servers, and typical server code scales good without modifications.

On the desktop most single-threaded code is fine for current generations. If the program is expensive in computing costs, then developers can rather easily offload work to multiple cores.

But, I think, many core systems are not suited for desktop usage. When applications will be optimized for it, a new kind of lag will be introduced, which in sum may introduce noticeable UI delays in applications. For the user it may feel less responsive, which in turn will be perceived as "slow"! The cause of this will be much more scheduling work when developers slice the workload into smaller packets and queue them up in work queues and send signals to start processing them and so on.

This IMHO even applies to message passing based code (erlang & co.) - though there the delays will be shorter than in locks & signal based code, but they also sum up.

Probably this will be solved after some generations of many core desktop architectures. But well, the start will lag, the systems will feel slow most of the time (and only fast for heavy long time taking processing jobs).

frunsi
Try to remember your old DOS system, or early graphical operating systems. They usually felt more responsive then current systems (when no background tasks were running). The initial cost for doing something (without doing anything real - from the users perspective) has grown (scheduling, event queues, process separation ...), but raw processing tasks are faster. I think with many core systems this trend will continue on another level.
frunsi
Why must optimizing a program for multicore introduce lag?
Crashworks
Anytime one core communicates to another core, those cores have to synchronize somehow: either by high-level locks+condition-vars+signals or by low-level facilities, e.g. "sender" writes at a memory location and "receiver" regularly polls the location for changes. The former method adds lag, the latter adds lag (usually less lag, but requires busy-waiting, which is often not desired). Its a difference in clock cycles if one core executes "A,B,.." or one core C first instructs two other cores to do their A and B work simultanously - work is parallel then, but startup costs grow. This sums up..
frunsi
But surely you'd only bother to do this if the speed gain of multithreading exceeded the cost, for a net improvement in performance. From the user's point of view, UI lag is from the moment of input until the completion of the task (or at least until the next message pump with an "in progress" bar), not from the moment of input until the processors start to work on it.
Crashworks
Yes, true for current generation multi core processors. But AFAIK the new many core processors have fast but simple in order processors. And their single cores will (may?) be slower than current cores. So we have to bother (much?) earlier to parallelize code. UI: well, yes, maybe the lag will not be noticeable (using one input handling thread that just instructs other cores and stays responsive).
frunsi
These are sort of leading questions because I actually do write latency-sensitive UI code for systems of multiple in-order cores (eg Xenon). It's completely possible to get a decrease in total lag from threading up a task -- you just need to remember that "lag" is measured from start to end of the task as a whole, and decide how much it can be parallelized based on that. There's a tradeoff between startup time and execution time, with an optimal point at some number of threads.
Crashworks
+3  A: 

removed hardware cache coherency which effectively means each "tile" will be completely separate in what it stores in local L2 cache.

That's nice, it will probably mean that programs that have low-coupling systems, that do mostly independent work will each have their own processor and cache, leading to a higher rate of cache-hits and better individual performance. The bottleneck will shift to the communication between these modules, so probably efficient low-level tools will have to be written specifically to optimize this in order for most programmers to gain real benefit from the architecture.

Programming will have to focus on writing modules with a very low coupling, what will expose their functionality and offer it as a service. As compilers will become aware of scalable processor architectures, they will "know" to assign these modules each to its own processor, with the possibility of having module redundancy across processors - running multiple copies of the same module each on its own processor, with a dynamical number of such duplications, in order to scale to the number of requests per time unit that each module has.

So basically, load-balancing will become an option for the desktop as well, especially for the core OS functionality that has to be called frequently by a large number of processes in the same time - having redundant copies of the kernel will ensure that more processes will do more work in the same time, without having to wait for the OS.

Also, virtualization will become more a tool of the trade, being integrated within the compiler. Code written natively for different platforms will be able to cooperate from within the same program unit, as different bits will run atop different processors, and the communication architecture will seamlessly integrate them. Different parts of the same application will be written in different programming languages, as they will be compiled separately and deployed as services.

luvieere
I strongly disagree, loosing cache coherency is only "nice" if you'redesigning the hardware. It causes a huge headache for the programmers.Low coupling systems should run equally well on a cache coherentsystem as on a non-cc system given a sensible scheduler. Non-CCis not a feature, it's a trade-off for something else (such asmany cores on the same bus). I cannot think of a single commerciallysuccessful non-CC shared memory machine (but I'm willing to beproven wrong).
Per Ekman
Everything is a trade-off and new situations imply new ways of doing things - it's not going to help if you write software that takes advantage of CC - it is going to if you understand the limits and use them to your advantage.
luvieere
A: 

"Unveiled" meaning "showed a prototype": the press release says Intel will get up to 6- and 8-core CPUs in 2010. 48-core chips aren't going to hit production for years, if ever.

Anyway, I've been hearing this "oh no dozens of cores" for years. As an industry we have enough trouble making single-threaded sequential code work correctly. When Firefox crashes or IE can't render right, you realize that adding more cores to the situation would not help one bit. At work I've got a 4-core box and it's already more cores than any of my software can use. Unless you're a gamer or doing HPC or cloud hosting, you're probably not CPU-bound very much of the time, if at all.

If you look at a typical system, the bottleneck is in user interaction. So why aren't they putting all their energy into optimizing that? I guess because they're a chip company, and making faster CPUs is the hammer they're using to hit everything, no matter how non-nail-like they are.

It's a press release about an experimental design. It's like Ferrari boasting about a new 10L 18-cylinder engine: it sounds cool, and it might sell more street Ferraris, but 99.999% of us aren't actually going to have to worry about driving at 130mph.

Ken