How to make active services highly available?

views:

174

answers:

+6 Q:

How to make active services highly available?

I know that with Network Load Balancing and Failover Clusteringwe can make passive services highly available. But what about active apps?

Example: One of my apps retrieves some content from a external resource in a fixed interval. I have imagined the following scenarios:

Run it in a single machine. Problem: if this instance falls, the content won't be retrieved
Run it in each machine of the cluster. Problem: the content will be retrieved multiple times
Have it in each machine of the cluster, but run it only in one of them. Each instance will have to check some sort of common resource to decide whether it its turn to do the task or not.

When I was thinking about the solution #3 I have wondered what should be the common resource. I have thought of creating a table in the database, where we could use it to get a global lock.

Is this the best solution? How does people usually do this?

By the way it's a C# .NET WCF app running on Windows Server 2008

+3 A:

For such problems they have invented message queues. Imagine the case when your clustered applications all listen to a message queue (clustered itself :-)). At some point in time one instance gets your initial command to download your external resource. If successful, your instance flushes the message and instead it posts another one for a later execution time that's equal to 'the run time' + 'interval'. But in case the instance dies during processing, that's not a problem. The message is rolled back in the queue (after timeout) and some other instance can pick it up. A bit of transactions, a bit of message queues

I am on the JEE side of the world so can help you with coding details

peter 2010-04-20 00:54:09

up-vote b/c this is a good pattern to follow, however I think that your answer isn't quite applicable to the OP since he's looking at availability options specific to NLB and clustering, not enterprise arch.

Josh E 2010-04-23 16:36:14

give a look to the Amazon Simple Queue Service, you can use a similar implementation (or even buy their service).

dwery 2010-04-26 18:52:18

I think there's two concepts here : disponibility/availability & processing.

The principe should be to dispatch the work using multiple services and host them in a platform which ensure availability.

Check Azure platform (and Windows Azure Queue), it may help you to conceptualize service dispatch :

http://www.microsoft.com/windowsazure/windowsazure/

http://msdn.microsoft.com/en-us/azure/cc994380.aspx

NB: Don't focalize on the cloud computing stuff but on the Compute & Storage concept.

JoeBilly 2010-04-22 09:18:30

my question is: how cloud solutions, like Google App Engine, execute periodic tasks? I don't know if they have periodic tasks in Azure.

Jader Dias 2010-04-24 02:18:02

In some cases people find it useful to have 3 machines doing all of the requests, and then compare the results at the end, to make sure that the result is absolutely correct and no hardware failure caused any problems while processing it. This is what they do on for instance airplanes.

At other times, you can live with having a single bad result and a small downtime to switch to a new service, but just want the next one to be ok. In that case solution number 3 with a heart beat monitor is an excellent setup.

Other times again, people just needs to be notified with an SMS that their service is down and the application will just use some obsolete data until you manually perform some kind of failover.

In your case, I would say the latter is probably more useful for you. Since you cannot really depend on the service at the other end being available, you would still have to come up with a solution for what to do in that case. Giving back obsolete data may be what is good for you, and it may not be. Sorry to have to say: It depends.

Cine 2010-04-22 09:34:18

I'm already sure that solution 3 is the one for me, what I am unsure about is the synchronization method.

Jader Dias 2010-04-24 02:19:25

The question does not mention what type of content is being retrieved but it is probably a safe assumption that it varies with time (e.g. stock quotes) and there may be no guarantee that 3 servers making requests at slightly different times will receive the same data.

Tuzo 2010-04-26 14:36:41

@Tuzo in my case the data is updated only every 2 minutes and is fetched once every 1m50s

Jader Dias 2010-04-26 15:07:43

+1 A:

I have once implemented something similar using your solution #3.

Create a table called something like resource_lock, with a column (e.g. locking_key) that will contain a locking key.

Then at each interval, all instance of your app will:

Run a query like 'update resource_lock set resource_key = 1 where resource_key is null'. (you can of course also insert a server-specific id, a timestamp, etc.)
If 0 rows updated: do nothing - another app instance is already fetching the resource.
If 1 row updated: fetch the resource and set locking_key back to null.

There are two advantages with this:

If one of your servers fails, the resource will still be fetched by the servers that are still running.
You leave the locking to the database, this saves you from implementing it yourself.

Eric Eijkelenboom 2010-04-23 07:34:26

what if occurs a failure during the execution of the process?

Jader Dias 2010-04-24 02:21:31

Then ask yourself: is it realistic to expect that the resource will be fetched successfully when trying again? If yes: implement some sort of retry mechanism. If no: just skip and wait for the next interval. I guess it also depends on how important it is that the resource is fetched each and every time.

Eric Eijkelenboom 2010-04-24 10:32:36

@Eric I was asking about the row value. If the process that updated it to `1` stops, the value probably will stay that way and no process will ever fetch that resource again.

Jader Dias 2010-04-25 01:37:25

Well, of course you should always have some sort of finally-block where you release the resource. But you mean some sort of crash that you cannot recover from? In that case a timestamp may be better than zeros and ones. Then the condition for checking the lock could e.g. be (resource_key is null or resource_key < (current time - 2 hours)).

Eric Eijkelenboom 2010-04-25 05:27:54

+1 A:

From the simplicity point of view, the quickest/easiest way to accomplish what you're looking for would be to 'round-robin' your cluster so that for every request, a machine is selected (by a cluster management service or some such) to process a request. Actual client requests don't go directly to the machine that handles it; they instead point to a single endpoint, which acts as a proxy to distribute incoming requests to machines based on availability and load. To quote the below-referenced link,

Network Load Balancing is a way to configure a pool of machines so they take turns responding to requests. It’s most commonly seen implemented in server farms: identically configured machines that spread out the load for a web site, or maybe a Terminal Server farm. You could also use it for a firewall(ISA) farm, vpn access points, really, any time you have TCP/IP traffic that has become too much load for a single machine, but you still want it to appear as a single machine for access purposes.

As for your application being "active", that requirement does not factor into this equation since whether 'active' or 'passive', the application still makes a request to your servers.

Commercial load balancers exist for serving HTTP-style requests, so that may be worth looking into, but with the load balancing features of W2k8, you may be best served tapping into those.

For more info on how to configure that in Win2k8, see this article.

this article is much more technical and focuses on using NLB with Exchange, but the principles should still apply to your situation.

see here for another detailed walk-through of NLB setup and configuration.

Failing that, you may be well served by searching / posting on ServerFault, since your application code is not (and should not be) strictly aware that the NLB even exists.

EDIT: added another link.

EDIT (the 2nd): The OP has corrected my erroneous conclusion in the 'active' vs. 'passive' concept. My answer to that is very similar to my original answer, save that the 'active' service (which, since you're using WCF, could easily be a windows service) could be split into two parts: the actual processing portion, and the management portion. The management portion would run on a single server, and act as a round-robin load balancer for the other servers doing the actual processing. It's slightly more complicated than the original scenario, but I believe it would provide a good deal of flexibility as well as offer a clean separation between your processing and management logic.

Josh E 2010-04-23 16:34:23

You didn't understand what I meant by active. In the active scenario my servers wouldn't receive any requests. Instead, they would generate it.

Jader Dias 2010-04-24 02:15:47

my apologies - I'll update my answer to reflect that

Josh E 2010-04-26 14:57:36

@Josh thank you for your update

Jader Dias 2010-04-26 15:08:33

@Josh but I have a problem with your solution. The "management portion" is already separated in my case, and I am concerned about the high availability of it. I don't want to run it in a single server.

Jader Dias 2010-04-26 15:10:06

hmm. in that case, you could have logic that would allow for a 'master' server responsible for retrieving and disseminating updates. The master would be 'elected' by the servers in the cluster, with a new election taking place in the event that the master doesn't respond in a timely fashion to dissemination events. This is similar to how some windows network services operate

Josh E 2010-04-26 20:24:51

+1 A:

There are some requirements that you probably know but have not been described in the question that make giving an informed answer challenging. Some of these questions are:

Does the task have to complete successfully?
If the task does/does not complete successfully, "who" needs to know and what type of actions need to be performed?
What is the behavior if the task has not completed when the time comes to run the task again? Should it run or not?
How important is it that jobs run at the specified interval? If the interval is every 5 minutes does it have to be every 5 minutes or could the task run after 5 minutes and 10 seconds?

The first step is to answer how the periodic task will be scheduled to run. One option is a Windows Scheduled Task but that is not inherently highly available but it may be possible to work around that. If you are using SQL Server, another alternative would be to use SQL Server Agent as a scheduler since it will failover as part of SQL Server.

The next step to determine is how to invoke the WCF application. The easiest option would be to trigger a job to invoke the WCF service through a NLB IP address. This could be considered a no-no if the database server (or other server in that zone) is calling to the application zone (of course there are always exceptions such as MSDTC).

Another option would be use the queue model. This would be the most reliable in most situations. e.g. SQL Server Agent could execute a stored procedure to enter a record in a queue table. Then on each application server a service could poll looking for a queued record to process. Access to the record in the queue would be serialized by the database so that the first server in would run the job (and that job would only run once).

Depending on the answers to the opening questions in this answer you may have to add some more error handling. If the retrieval of the external resource is usually fairly short, you may want to simply keep the queue record locked with a select for update and when the task is completed update the status (or delete the record if you wish). This will block other service instances from processing the record while it is being processed on another server and if a crash occurs during processing the transaction should be rolled back and another service in the cluster can pick up the record. (Although, you could increase the transaction timeout to as long as you think you need.)

If keeping a database lock for a long time is not viable then you could change the logic and add some monitoring to the services. Now, when a job is started processing, its status would be changed from queued to running and the server that is processing the record would be updated on the record. Some sort of service status table could be created and each service instance would update the the current time every time they poll. This would allow other services in the cluster to reprocess jobs that show as running but the service they are supposed to be running on hasn't "checked in" within a certain period.

This approach also has limitations: what if the task actually completed but somehow database connectivity was lost -- the job could potentially run again. Of course, I don't think the problem of having atomic database actions combined with other non-transactional resources (e.g. web request, file system) is going to easily be solved. I'm assuming you are writing a file or something -- if the external content is also placed into a database then a single transaction will guarantee that everything is consistent.

Tuzo 2010-04-26 05:53:13

i liked the SQL Server Agent suggestion. I am sure many RDBMSs have similar features.

Jader Dias 2010-04-26 12:10:21

ansaurus

tags:

views:

answers:

How to make active services highly available?

related questions