We've been using PARLANSE, a parallel programming langauge with explicit partial-order specification of concurrency for the last decade, to implement a scalable program analysis and transformation system (DMS Software Reengineering Toolkit) that mostly does symbolic rather than numeric computation. PARLANSE is a compiled, C-like language with traditional scalar data types character, integer, float, dynamic data types string and array, compound data types structure and union, and lexically-scoped functions. While most of the language is vanilla (arithmetic expressions over operands, if-then-else statements, do loops, function calls), the parallelism is not. Parallelism is expressed by defining a "precedes" relation over blocks of code (e.g, a before b, a before c, d before c)
written as
(|; a (... a's computation)
(<< a) b ( ... b's computation ... )
(<< a) c ( ....c's computation ...)
(>> c) d ( ... d's computation...)
)|;
where the << and >> operators refer to "order in time". The PARLANSE compiler can see these parallel computations and preallocate all the structures necessary for computation
grains a,b,c,d, and generate custom code to start/stop each one, thus minimizing the overhead to start and stop these parallel grains.
See this link for parallel iterative deepening search for optimals solutions to the 15-puzzle, which is the 4x4 big-brother of the 8-puzzle. It only uses potential parallel as a parallelism construct (|| a b c d ) which says there are no partial order constraints on the computations a *b* c *d*, but it also uses speculation and asynchronously aborts tasks that won't find solutions. Its a lot of ideas in a pretty small bit of code.
PARLANSE runs on multicore PCs. A big PARLANSE program (we've built many with 1 million+ lines or more) will have thousands of these partial orders, some of which call functions that contain others.
So far we've had good results with up to 8 CPUs, and modest payoff with up to 16, and we're still tuning the system. (We think a real problem with larger numbers of cores on current PCs is memory bandwidth: 16 cores pounding a memory subsystem creates a huge bandwidth demand).
Most other languages don't expose the parallelism so it is hard to find, and the runtime systems pay a high price for scheduling computation grains by using general-purpose scheduling primitives. We think that's a recipe for disaster or at least poor performance because of Amhdahl's law: if the number of machine instructions to schedule a grain is large compared to the work, you can't be efficient. OTOH, if you insist on computation grains with many machine instructions to keep the scheduling costs relatively low, you can't find computation grains that are independent and so you don't have any useful parallelism to schedule. So the key idea behind PARLANSE is to minimize the cost of scheduling grains, so that grains can be small, so that there can be many of them found in real code. The insight into this tradeoff came from the abject failure of the pure dataflow paradigm, that did everything in parallel with tiny parallel chunks (e.g., the add operator).
We've been working on this on and off for a decade. Its hard to get this right. I don't see how folks that haven't been building parallel langauges and using/tuning them for this time frame have any serious chance of building effective parallel systems.