views:

334

answers:

3

Windows NLB works great and removes computer from the cluster when the computer is dead.

But what happens if the application dies but the server still works fine? How have you solved this issue?

Thanks

+1  A: 

By not using NLB.

Hardware load balancers often have configurable "probe" functions to determine if a server is responding to requests. This can be by accessing the real application port/URL, or some specific "healthcheck" URL that returns only if the application is healthy.

Other options on these look at the queue/time taken to respond to requests

Cisco put it like this:

The Cisco CSM continually monitors server and application availability using a variety of probes, in-band health monitoring, return code checking, and the Dynamic Feedback Protocol (DFP). When a real server or gateway failure occurs, the Cisco CSM redirects traffic to a different location. Servers are added and removed without disrupting service—systems easily are scaled up or down.

(from here: http://www.cisco.com/en/US/products/hw/modules/ps2706/products_data_sheet09186a00800887f3.html#wp1002630)

Paul
A: 

Presumably with Windows NLB there is some way to programmatically set the weight of nodes? The nodes should self-monitor and if there is some problem (e.g. a particular node is low on disc space), set its weight to zero so it receives no further traffic.

However, this needs to be carefully engineered and have further human monitoring to ensure that you don't end up with a situation where one fault causes the entire cluster to announce itself down.

You can't really hope to deal with a "byzantine general" situation in network load balancing; an appropriately broken node may think it's fine, appear fine, but while being completely unable to do any actual work. The trick is to try to minimise the possibility of these situations happening in production.

MarkR
A: 

There are multiple levels of health check for a network application.

  1. is the server machine up?
  2. is the application (service) running?
  3. is the service accepting network connections?
  4. does the service respond appropriately to a "are you ok" request?
  5. does the service perform real work? (this will also check back-end systems behind the service your are probing)

My experience with NLB may be incomplete, but I'll describe what I know. NLB can do 1 and 2. With custom coding you can add the other levels with varying difficulty. With some network architectures this can be very difficult.

Most hardware load balancers from vendors like Cisco or F5 can be easily configured to do 3 or 4. Level 5 testing still requires custom coding.

Darron