views:

109

answers:

3

I recently ran into a bug where an entire Erlang application died, yielding a log message that looked like this:

=INFO REPORT==== 11-Jun-2010::11:07:25 ===
     application: myapp
     exited: shutdown
     type: temporary

I have no idea what triggered this shutdown, but the real problem I have is that it didn't restart itself. Instead, the now-empty Erlang VM just sat there doing nothing.

Now, from the research I've done, it looks like there are other "start types" you can give an application: 'transient' and 'permanent'.

If I start a Supervisor within an application, I can tell it to make a particular process transient or permanent, and it will automatically restart it for me. However, according to the documentation, if I make an application transient or permanent, it doesn't restart it when it dies, but rather it kills all the other applications as well.

What I really want to do is somehow tell the Erlang VM that a particular application should always be running, and if it goes down, restart it. Is this possible to do?

(I'm not talking about implementing a supervisor on top of my application, because then it's a catch 22: what if my supervisor process crashes? I'm looking for some sort of API or setting that I can use to have Erlang monitor and restart my application for me.)

Thanks!

+3  A: 

You should be able to fix this in the top-level supervisor: set the restart strategy to allow one million restarts every second, and the application should never crash. Something like:

init(_Args) ->
    {ok, {{one_for_one, 1000000, 1},
          [{ch3, {ch3, start_link, []},
            permanent, brutal_kill, worker, [ch3]}]}}.

(Example adapted from the OTP Design Principles User Guide.)

legoscia
Great, thanks very much for your answer. I see now that the reason it died was indeed because the max restart limit was hit. I don't necessarily want to just disable that though, since if it actually gets into a restart loop then we may need to restart the entire app. Is there a way to have it restart the app if the AllowedRestarts/MaxSeconds limit is hit, instead of shutting down the app?
Nick
In the case you describe you would add a supervisor to your supervisor. The behavior which OTP uses is that when an exit signal is sent to the process which does the start call to the application (i.e. when the top level supervisor dies) it assumes that the application has failed to fix the error and it will shutdown the application and possible the node depending on the config. I guess the point is that your applications should not crash, and if they do the error is serious enough to only be resolved by a node restart.
Lukas
+3  A: 

You can use heart to restart the entire VM if it goes down, then use a permanent application type to make sure that the VM exits when your application exits.

Ultimately you need something above your application that you need to trust, whether it is a supervisor process, the erlang VM, or some shell script you wrote - it will always be a problem if that happens to fail also.

Greg Rogers
Okay, thanks. That sort of solution will work fine for me in this case.However, what if I wanted to run more than one application at once, and have them restart independently as needed?With all the fancy process supervision features Erlang includes, I find it amazing that I can't seem to do something as simple as restart an application when it goes down....
Nick
+1  A: 

Use Monit, then setup your application to terminate by using a supervisor for the whole application with a reasonable restart frequency. If the application terminates, the VM terminates, and monit restarts everything.

I could never get Heart to be reliable enough, as it only restarts the VM once, and it doesn't deal well with a kill -9 of the erlang VM.

inaka