views:

77

answers:

1

We have clustered MSMQ for a set of NServiceBus services, and everything runs great until it doesn't. Outgoing queues on one server start filling up, and pretty soon the whole system is hung.

More details:

We have a clustered MSMQ between servers N1 and N2. Other clustered resources are only services that operate directly on the clustered queues as local, i.e. NServiceBus distributors.

All of the worker processes live on separate servers, Services3 and Services4.

For those unfamiliar with NServiceBus, work goes into a clustered work queue managed by the distributor. Worker apps on Service3 and Services4 send "I'm Ready for Work" messages to a clustered control queue managed by the same distributor, and the distributor responds by sending a unit of work to the worker process's input queue.

At some point, this process can get completely hung. Here is a picture of the outgoing queues on the clustered MSMQ instance when the system is hung:

Clustered MSMQ Outgoing Queues in Hung State

If I fail over the cluster to the other node, it's like the whole system gets a kick in the pants. Here is a picture of the same clustered MSMQ instance shortly after a failover:

Clustered MSMQ Outgoing Queues After Failover

Can anyone explain this behavior, and what I can do to avoid it, to keep the system running smoothly?

A: 

How are your endpoints configured to persist their subscriptions?

What if one (or more) of your service encounters an error and is restartet by the Failoverclustermanager? In this case, this service would never receive one of the "I'm Ready for Work" message from the other services again.

When you fail over to the other node, I guess that all your services send these messages again and, as a result, everything gets back working.

To test this behavior do the following.

  1. Stop and restart all your services.
  2. Stop only one of the services.
  3. Restart the stopped service.
  4. If your system does not hang, repeat this with each single service.

If your system now hangs again, check your configurations. It this scenario your at least one, if not all, services lose the subscriptions between restarts. If you did not do so already, persist the subscription in a database.

Sensei76
Subscriptions are already persisted in a shared database. The clustered distributor stores its state in a clustered MSMQ queue. If a worker is restarted by the failover cluster manager, one of the first things it does (on any startup) is to send the ReadyMessage.
David
It is true that the worker sends the ReadyMessage on start. I am asking for the persisted Subscriptions because I had a similar problem. One of the subscriptions was not correctly saved in DB, so after a restart, although it send its message, the others completely ignored it because they checked the db only. Only exception of this was when all services were together restarted, then the messages of the service in question were received again. On service restart: Messages failed again.
Sensei76