removed hardware cache coherency
which effectively means each "tile"
will be completely separate in what it
stores in local L2 cache.
That's nice, it will probably mean that programs that have low-coupling systems, that do mostly independent work will each have their own processor and cache, leading to a higher rate of cache-hits and better individual performance. The bottleneck will shift to the communication between these modules, so probably efficient low-level tools will have to be written specifically to optimize this in order for most programmers to gain real benefit from the architecture.
Programming will have to focus on writing modules with a very low coupling, what will expose their functionality and offer it as a service. As compilers will become aware of scalable processor architectures, they will "know" to assign these modules each to its own processor, with the possibility of having module redundancy across processors - running multiple copies of the same module each on its own processor, with a dynamical number of such duplications, in order to scale to the number of requests per time unit that each module has.
So basically, load-balancing will become an option for the desktop as well, especially for the core OS functionality that has to be called frequently by a large number of processes in the same time - having redundant copies of the kernel will ensure that more processes will do more work in the same time, without having to wait for the OS.
Also, virtualization will become more a tool of the trade, being integrated within the compiler. Code written natively for different platforms will be able to cooperate from within the same program unit, as different bits will run atop different processors, and the communication architecture will seamlessly integrate them. Different parts of the same application will be written in different programming languages, as they will be compiled separately and deployed as services.