views:

405

answers:

7

Today I was doing the thread ring exercise from the Programming Erlang book and googled for other solutions to compare. I found that the language shootout has exactly this the same problem as a benchmark. I had the impression that this is an area where Erlang should be fastest, but turns out that C and C++ are again on top. My suspicion is that the C/C++ programs are not following the rules which say "pass the token from thread to thread". It seems, after reading them, that they both manipulate some shared memory and global variables which is different from the Erlang code but I could be wrong.

My question is: are they doing the same thing or the C/C++ code is conceptually different (and faster) from the Erlang one?

And another question: why is Haskell faster than Erlang when the solution are very similar?

+6  A: 

The C version is using LWP, which I think is a user-space threading library. To what extent this is "unfair" is up for debate: I'd look at things like whether it supports true pre-emptive concurrency in the sense that you can make blocking system calls in a thread without blocking all the other threads (you can do this in Haskell, can you in Erlang?).

Haskell's threads are slightly more lightweight than Erlang's, because as I understand it an Erlang thread comes with a local heap (in the standard implementation) whereas Haskell threads all share the same heap.

Simon Marlow
What is the memory overhead of a Haskell(GHC) thread?
Daniel Velkov
Minimal. You can create thousands of threads without running into memory problems.
jrockway
How much, comparing with Erlang's 300 bytes?
Daniel Velkov
A GHC thread is represented by a thread state object: http://hackage.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapObjects#ThreadStateObjects.You can do the math yourself by looking at the definition: http://darcs.haskell.org/ghc/includes/rts/storage/TSO.h
sclv
@djv 68 bytes on a 32-bit machine, 112 bytes on a 64-bit machine (may vary slightly depending on the version of GHC). Plus the stack, which is variable and expanded on demand: by default we allocate 1k stacks.
Simon Marlow
+4  A: 

Ultimately, message-passing on modern machines is implemented using some form of shared memory to pass the messages (along with either locks or atomic instructions). So all the C and C++ implementations are really doing is inlining the implementation of message-passing straight into their code. A similar benchmark that uses a fast message-passing library in C, also benchmarked against Haskell and Erlang, can be found in this paper: http://www.cs.kent.ac.uk/pubs/2009/2928/index.html (section 5.1)

The speed of the various approaches is really determined by the concurrent run-time systems involved. Haskell has had a lot of good work done in this area, which leaves it ahead of Erlang. Of course, measuring speed on micro-benchmarks is often mis-leading, and leaves out important factors like the readability of the code. A question to bear in mind might be: which of the solutions in the shoot-out would you be happy to maintain?

Neil Brown
depends on which one is making more money :)
Tom
>> Of course, measuring speed on micro-benchmarks is often mis-leading, and leaves out important factors like the readability of the code. << Which is one of the reasons why the benchmarks game shows the program source code ;-)
igouy
+4  A: 

I don't think I'd call it cheating. The primary, fundamental difference between multiple threads and multiple processes is that multiple threads share a single address space. As such, specifying multiple threads (rather than multiple processes) seems to me like tacit permission to take advantage of the shared address space (at least in the absence of some very specific definition of "passing" that prohibited this).

What it comes down to is this: Erlang doesn't really have threads, as such -- it has processes with asynchronous communications. The processes are (intentionally) isolated from each other to a large degree. On one hand, this makes development considerably easier -- in particular, one process can only affect another via specific, well-defined channels of communication. Under the hood, it uses lots of tricks (almost certainly including shared memory) to optimize its processes and take advantage of what's possible in a specific implementation/situation (such as all the processes running in a single, shared address space). Nonetheless, having to keep all the tricks hidden keeps it from being quite as efficient as something like the C version where the "tricks" are all explicit and completely exposed.

I'd use a real-life analogy to explain the difference. Think of the threads/processes as people. Erlang enforces a professional working relationship where communications are all carefully recorded in memos. The C and C++ versions are more like a husband and wife who might communicate with a single word that doesn't mean anything to anybody else, or even just a single quick glance.

The latter is extremely efficient when it works -- but it's a lot more open to subtle misunderstandings, and if the two have a fight you probably don't want to be in the same room. For the manager, people in purely professional relationships are a lot easier to manage even if their peak efficiency isn't quite a high.

Jerry Coffin
+2  A: 

not following the rules

Given how many really quite different approaches there are to programming concurrency, I did find it very difficult to both be inclusive enough to bring in different language implementations and yet retain some vague comparability.

Now look at the performance of the same programs measured with different run time configuration and note how much that matters -

SMP quad core 
1.0 C GNU gcc #3    4.16    4.17    16,952  607   100% 0% 1% 0%
4.2 Haskell GHC     17.99   17.42   4,020   307   100% 1% 1% 1%
7.5 Erlang HiPE     31.12   31.13   10,504  273   0% 0% 0% 100%

No SMP one core
1.0 C GNU gcc #3    4.16    4.16    16,952  607   1% 0% 0% 100%
1.4 Haskell GHC     5.89    5.89    3,204   307   0% 0% 0% 100%
2.6 Erlang HiPE     10.64   10.63   9,276   273   0% 0% 0% 100%
igouy
How about requiring that a fully pre-emptive thread abstraction is used? (no thread can prevent any other from making progress)
Simon Marlow
Without considering whether or not that would be a "good thing" which of the first 10 programs would be excluded?
igouy
Regarding the slower 4 core results for GHC, are we allowed to use processor affinity to improve that?
Simon Marlow
I have no idea how many programs, if any, would be excluded by requiring pre-emptive threads. It seems like a sensible requirement though - the benchmark has most value if it measures the performance of an abstraction that is useful to the majority of users. Non pre-emptive threads have their uses, but I think most users would need pre-emption.
Simon Marlow
GHC threads still don't pre-empt until an allocation, do they?
Ganesh Sittampalam
Hynek -Pichi- Vychodil
@Simon Marlow - allowed to use processor affinity to improve that? - That's a good question. Can you set processor affinity and still use +RTS -N4 -RTS ?
igouy
@Simon Marlow >> I have no idea how many programs, if any, would be excluded by requiring pre-emptive threads. << And you think I'm going to be able to figure it out? :-)
igouy
@Ganesh that's true. I consider it a bug. Most programs affected are benchmarks, though.
Simon Marlow
@igouy for this benchmark we just need to fix threads to a single core with forkOnIO, no flags should be necessary. You can also fix OS threads to cores with +RTS -qa, which might help a little.
Simon Marlow
The C++ program sets processor affinity.
Daniel Velkov
@Simon Marlow >> forkOnIO << Let me ask a stupid question - Can you have part of the program pinned to a particular core and the rest of the program using multiple cores?
igouy
@iguoy sure. forkOnIO creates a Haskell thread that is pinned to one core for its lifetime, and forkIO creates a Haskell thread that is migrated automatically by the runtime for load-balancing.
Simon Marlow
@Simon Marlow >> are we allowed to use processor affinity to improve that? << Yes use forkOnIO.
igouy
@igouy I see it already uses forkOnIO :) The slowdown relative to the 1-core version is therefore due to the overhead of the extra locking in the MVar and thread operations in the threaded version of the runtime.
Simon Marlow
@Simon Marlow - Oh! So it's a "genuine" difference.
igouy
@iguoy in a sense yes - however the C program here runs within a single OS thread, so if that is allowed then the Haskell program should be allowed to drop the -threaded flag, bringing its performance back up. However, I think another defensible positiion is to disqualify this C program on the grounds that it isn't using pre-emptive threads (I checked: LWP is a cooperative threading abstraction), and move it to the "interesting alternative" programs.
Simon Marlow
@Simon Marlow >> the grounds that it isn't using pre-emptive threads << Except that there isn't a requirement to use pre-emptive threads.
igouy
@Simon Marlow - Instead, what task would pre-emptive threads perform faster?
igouy
@iguoy "Except that there isn't a requirement to use pre-emptive threads" sure - it was just a suggestion. If you decide to keep the C program as is, then I think it's reasonable for us to request that the Haskell program be compiled without -threaded.
Simon Marlow
@iguoy "Instead, what task would pre-emptive threads perform faster" I'm not sure I understand what you're asking here. The point of pre-emption is that it's a more useful abstraction: the abstraction gives stronger fairness guarantees, so it's easier to program with in practice. Without pre-emption it's easy to accidentally write programs that deadlock. Most people want pre-emption when they use threads.
Simon Marlow
@Simon Marlow - Reluctant as I am to exclude a whole class of language implementations, I think your suggestion has enough merit (even if it not-coincidentally pushes a Haskell program to the top).
igouy
+3  A: 

why is Haskell faster than Erlang when the solution are very similar?

Haskell GHC is a compiled, native-code optimized language implementation with very fast threads. It is generally much faster than Erlang/HIPE. Erlang doesn't have a monopoly on lightweight threads :-)

Don Stewart
+4  A: 

I would answer by another question: how is the Erlang runtime implemented under the hood ?

Chances are it's implemented in C or a similar system language (I doubt they did it all in assembly). Or at the very least, that the concepts they express can be expressed as efficiently in C.

Now, why do you find it so strange that an optimized C version (the shootout certainly does not show average level code) would beat the Erlang version, considering that Erlang adds its own level of complexity / indirection ?

Whatever the type of benchmark, it's always possible for a C implementation to beat the most polished program in another language... built on top of C, simply because you take the C-version it generates then removes the bits that you don't need.

On the other hand:

  • how much time did it take you to write your code ?
  • what's your degree of trust that it does the right thing ? won't deadlock ?
  • which one would you rather maintain ?

Sometimes "faster" just isn't worth the cost.

Matthieu M.
Yes, that was my question, if the different solutions are on the same level of abstraction. I think we all agree that they aren't in this case. As for your questions - definitely not the C/C++ programs.
Daniel Velkov
A: 

One thing to note in that benchmark is that there is only one token to pass around. Which means that in practice it's just a single threaded program reading and writing from/to memory.

I would expect the result to come out different on a multiprocessor machine (or make that a cluster) where the threads/processes have to pass M tokens around in some random order.

Hmm. Also give the developers' of the benchmark solutions equal number of hours to finish their solution. Then I would expect Erlang to come out on top (or close to the top at least).

Daniel Luna
"One thing to note" +1 "Then I would expect" -1
igouy
You are probably right. That's assuming a bit too much. Thank you for your comment.
Daniel Luna