views:

68

answers:

3

Hi,

I have a serial(nonparallel) application written in C. I have modified and re-written it using Intel Threading Building Blocks. When I run this parallel version on an AMD Phenom II machine which is a quad-core machine, I get a performance gain of more than 4X which conflicts with the Amdahl's law. Can anyone give me a reason why this is happening?

Thanks, Rakesh.

+4  A: 

If you rewrite the program, you could make it more efficient. Amdahl's law only limits the amount of speedup due to parallelism, not how much faster you can make your code by improving it.

You could be realizing the effects of having 4x the cache, since now you can use all four procs. Or maybe having less contention with other processes running on your machine. Or you accidentally fixed a mispredicted branch.

TL/DR: it happens.

Borealid
A: 

Can anyone give me a reason why this is happening?

In a word, caches.

Each core has its own L1 cache and, therefore, simply by using more cores you have increased the amount of cache in-play which has, in turn, brought more of your data closer to where it will be processed. That alone can improve performance significantly (as if you had a bigger cache on a single core). When combined with near linear speedup from effective parallelization, you can see superlinear performance improvements overall.

Jon Harrop
+1  A: 

It's known as "super linear speedup", and can occur for a variety of reasons, though the most common root cause is probably cache behaviour. Usually when superlinear speedup occurs, it's a clue that you could make the sequential version more efficient.

For example, suppose you have a processor where some of the cores share an L2 cache (a common architecture these days), and suppose your algorithm makes multiple traversals of a large data structure. If you perform the traversals in sequence, then each traversal will have to pull the data into the L2 cache afresh, whereas if you perform the traversals in parallel then you may well avoid a large number of those misses, as long as the traversals run in step (getting out of step is a good source of unpredictable performance here). To make the sequential verison more efficient you could interleave the traversals, thereby improving locality.

Simon Marlow