tags:

views:

242

answers:

1

I have an application distributed over 2 nodes. When I halt() the first node the failover works perfectly, but ( sometimes ? ) when I restart the first node the takeover fails and the application crashes since start_link returns already started.

SUPERVISOR REPORT  <0.60.0>                                 2009-05-20 12:12:01
===============================================================================
Reporting supervisor                          {local,twitter_server_supervisor}

Child process
   errorContext                                                     start_error
   reason                                         {already_started,<2415.62.0>}
   pid                                                                undefined
   name                                                                    tag1
   start_function                                {twitter_server,start_link,[]}
   restart_type                                                       permanent
   shutdown                                                               10000
   child_type                                                            worker

ok

My app

start(_Type, Args)->
    twitter_server_supervisor:start_link( Args ).

stop( _State )->
    ok.

My supervisor :

start_link( Args ) ->
    supervisor:start_link( {local,?MODULE}, ?MODULE, Args ).

Both nodes are using the same sys.config file.

What am I not understanding about this process that the above should not work ?

A: 

It seems like your problem stem from twitter server supervisor trying to start one of its children. Since the error report complains about the child with start_function

{twitter_server,start_link,[]}

And since you are not showing that code, I can only guess that it is trying to register a name for itself, but there is already a process registered with that name.

Even more guessing, the reason shows a Pid, the Pid that has the name that we tried to grab for ourself:

{already_started,<2415.62.0>}

The Pid there has a non-zero initial integer, if it was zero it means it is a local process. From which I deduce that you are trying to register a global name, and you are connected to another node where there is already a process globally registered by that name.

Christian