views:

1162

answers:

2

Updated: This question contains an error which makes the benchmark meaningless. I will attempt a better benchmark comparing F# and Erlang's basic concurrency functionality and inquire about the results in another question.

I am trying do understand the performance characteristics of Erlang and F#. I find Erlang's concurrency model very appealing but am inclined to use F# for interoperability reasons. While out of the box F# doesn't offer anything like Erlang's concurrency primitives -- from what I can tell async and MailboxProcessor only cover a small portion of what Erlang does well -- I've been trying to understand what is possible in F# performance wise.

In Joe Armstrong's Programming Erlang book, he makes the point that processes are very cheap in Erlang. He uses the (roughly) the following code to demonstrate this fact:

-module(processes).
-export([max/1]).

%% max(N) 
%%   Create N processes then destroy them
%%   See how much time this takes

max(N) ->
    statistics(runtime),
    statistics(wall_clock),
    L = for(1, N, fun() -> spawn(fun() -> wait() end) end),
    {_, Time1} = statistics(runtime),
    {_, Time2} = statistics(wall_clock),
    lists:foreach(fun(Pid) -> Pid ! die end, L),
    U1 = Time1 * 1000 / N,
    U2 = Time2 * 1000 / N,
    io:format("Process spawn time=~p (~p) microseconds~n",
          [U1, U2]).

wait() ->
    receive
        die -> void
    end.

for(N, N, F) -> [F()];
for(I, N, F) -> [F()|for(I+1, N, F)].

On my Macbook Pro, spawning and killing 100 thousand processes (processes:max(100000)) takes about 8 microseconds per processes. I can raise the number of processes a bit further, but a million seems to break things pretty consistently.

Knowing very little F#, I tried to implement this example using async and MailBoxProcessor. My attempt, which may be wrong, is as follows:

#r "System.dll"
open System.Diagnostics

type waitMsg =
    | Die

let wait =
    MailboxProcessor.Start(fun inbox ->
        let rec loop =
            async { let! msg = inbox.Receive()
                    match msg with 
                    | Die -> return() }
        loop)

let max N =
    printfn "Started!"
    let stopwatch = new Stopwatch()
    stopwatch.Start()
    let actors = [for i in 1 .. N do yield wait]
    for actor in actors do
        actor.Post(Die)
    stopwatch.Stop()
    printfn "Process spawn time=%f microseconds." (stopwatch.Elapsed.TotalMilliseconds * 1000.0 / float(N))
    printfn "Done."

Using F# on Mono, starting and killing 100,000 actors/processors takes under 2 microseconds per process, roughly 4 times faster than Erlang. More importantly, perhaps, is that I can scale up to millions of processes without any apparent problems. Starting 1 or 2 million processes still takes about 2 microseconds per process. Starting 20 million processors is still feasible, but slows to about 6 microseconds per process.

I have not yet taken the time to fully understand how F# implements async and MailBoxProcessor, but these results are encouraging. Is there something I'm doing horribly wrong?

If not, is there some place Erlang will likely outperform F#? Is there any reason Erlang's concurrency primitives can't be brought to F# through a library?

EDIT: The above numbers are wrong, due to the error Brian pointed out. I will update the entire question when I fix it.

+11  A: 

In your original code, you only started one MailboxProcessor. Make wait() a function, and call it with each yield. Also you are not waiting for them to spin up or receive the messages, which I think invalidates the timing info; see my code below.

That said, I have some success; on my box I can do 100,000 at about 25us each. After too much more, I think possibly you start fighting the allocator/GC as much as anything, but I was able to do a million too (at about 27us each, but at this point was using like 1.5G of memory).

Basically each 'suspended async' (which is the state when a mailbox is waiting on a line like

let! msg = inbox.Receive()

) only takes some number of bytes while it's blocked. That's why you can have way, way, way more asyncs than threads; a thread typically takes like a megabyte of memory or more.

Ok, here's the code I'm using. You can use a small number like 10, and --define DEBUG to ensure the program semantics are what is desired (printf outputs may be interleaved, but you'll get the idea).

open System.Diagnostics 

let MAX = 100000

type waitMsg = 
    | Die 

let mutable countDown = MAX
let mre = new System.Threading.ManualResetEvent(false)

let wait(i) = 
    MailboxProcessor.Start(fun inbox -> 
        let rec loop = 
            async { 
#if DEBUG
                printfn "I am mbox #%d" i
#endif                
                if System.Threading.Interlocked.Decrement(&countDown) = 0 then
                    mre.Set() |> ignore
                let! msg = inbox.Receive() 
                match msg with  
                | Die -> 
#if DEBUG
                    printfn "mbox #%d died" i
#endif                
                    if System.Threading.Interlocked.Decrement(&countDown) = 0 then
                        mre.Set() |> ignore
                    return() } 
        loop) 

let max N = 
    printfn "Started!" 
    let stopwatch = new Stopwatch() 
    stopwatch.Start() 
    let actors = [for i in 1 .. N do yield wait(i)] 
    mre.WaitOne() |> ignore // ensure they have all spun up
    mre.Reset() |> ignore
    countDown <- MAX
    for actor in actors do 
        actor.Post(Die) 
    mre.WaitOne() |> ignore // ensure they have all got the message
    stopwatch.Stop() 
    printfn "Process spawn time=%f microseconds." (stopwatch.Elapsed.TotalMilliseconds * 1000.0 / float(N)) 
    printfn "Done." 

max MAX

All this said, I don't know Erlang, and I have not thought deeply about whether there's a way to trim down the F# any more (though it's pretty idiomatic as-is).

Brian
uh oh. I suspected I might be making some sort of error like this.
Tristan
Just to confirm, I should replace both wait with wait()? I get much much worse results like this, in the range of 500 microseconds. I haven't digested what is really happening, so your second point is probably an even bigger issue.
Tristan
The above code gives me about 1000 microseconds per process, which is 100x worse than Erlang. But I'm not sure these two examples are the same. Decrementing a mutable variable seems very un-Erlang. My understanding is that the actors should only talk through messages.
Tristan
Well, the issue is that in F#, starting a mailbox and sending a message are completely fire-and-forget. Without those synchronization points, the main program just races through, and potentially only a handful of mailboxes are actually started or messages actually received before the final output is printed.
Brian
I see, although I'm not positive Erlang is doing something different. Perhaps this is a bad benchmark. I will try a Ring example, where actors pass messages around in a circle. That will ensure I understand what is actually alive and processing.
Tristan
+7  A: 

Erlang's VM doesn't uses OS threads or process to switch to new Erlang process. It's VM simply counts function calls into your code/process and jumps to other VM's process after some (into same OS process and same OS thread).

CLR uses mechanics based on OS process and threads, so F# has much higher overhead cost for each context switch.

So answer to your question is "No, Erlang much more faster to spawing and killing Erlang's processes".

P.S. You can find results of that practical contest interesting.

ssp
That is what I thought, which is why I found my results confusing. But they were wrong. However is this an inherent limitation or just the way things are current designed? I haven't look at it fully but a single threaded async is described here: http://cs.hubfs.net/blogs/hell_is_other_languages/archive/2008/08/03/6506.aspx
Tristan
It's nice example of userspace threads. You can see to F#'s sources, their async uses CLR ThreadPool.
ssp
F# asyncs are userspace threads. They don't have the overhead associated with OS threads. They can employ the thread pool to exploit multiple cores, but there's never more threads than the hardware supports natively, and the common kinds of context switch remain within a single OS thread. With a single (non-hyperthreading) core they are purely user space threads like Erlang.
RD1
As F# as .NET hasn't userspace thread implementation.You can ask thread's id for your async { let! = ... } code and see it yourself. How many system threads will use your code exactly, depends of many factors. For example my code creates >800 threads while using async {} for TCP sockets and about 50 threads for RAW sockets.
ssp
@ssp I ran Brian's code above, adding all the thread IDs within the async to a set and printing them at the end. There were exactly 5 threads, even though 100,000 agents (and asyncs) are started.Even though the thread pool is used, scheduling is done almost entirely in user space. The only exception is if you use blocking I/O, requiring the thread pool to be replenished while worker threads are blocked. But, asyncs are intended to be used with non-blocking I/O.I'd guess that your example uses blocking I/O? Can you post a link to the code?
RD1
ssp
@ssp Yeah, I tried your code - interesting. It looks like the extra threads block due to a race condition in the implementation of timeouts. When I changed it to use Observable.Timeout instead these extra threads were avoided (but the cost involved with them was tiny anyway). So, if you're basing your answer on this example then I think it is not representative at all. Normally async tasks swap amongst a small number of threads.
RD1
@RD1: Can you send your patch (or race condition description) to [email protected]?I don't say each async {} creates new thread. I say under the hood they use system threads and when you call async {} your computation will be sent to another thread, but will this thread new one or allready created depends of many factors.Do you seen source\fsharp\FSharp.Core\control.fs from F# distro?The source of async builder shows threads/threadpool usage.
ssp
@ssp: Yeah, I've been looking at control.fs trying to understand why there's an excessive number of threads when a timeout is included. I'll post to fsbugs when I've properly isolated the issue. Note that the extra threads show as being blocked with 0.00000ms CPU time, and the threadpool only reports 4 worker threads having been created - so I think the extra threads are outside the threadpool.
RD1