views:

500

answers:

6

I work on a small scale application (about 5000 users), but we do maintain some important user preference data. Whenever we release an upgrade we check if there are users online (we do it after hours and usually there are none) and then just put an outage page and apply the new build (both UI and DB changes). It all takes about half and hour and is usually pain free.

But I always wonder how sites like Amazon or Ebay or Google push upgrades to production. I know its phased out over time and servers etc, but thousands of users are logged in at any time and are continuously updating data. I know that there is load balancing such that if one server is taken down, the users session is seamlessly transferred to another machine, and similar database options, but it still seems overwhelming to keep everything running (although with fewer servers) but still upgrade UI , DB and functionality smoothly.

Are there specific guidelines and strategies somewhere about deployment for large websites? Any whitepapers? What are the best practices in that area?

EDIT: I said half an hour, but that includes the time from which we take the app down, to the time we get it back up. This includes UI and functionality smoke tests, DB Consistency checks and a small load test. The time to actually 'deploy' is infact less than two minutes.

+1  A: 

Firstly, I think the important thing to remember is that this is not an "all or nothing" problem. Depending on your application you have to weigh up the cost of development against minimal outages.

That said, there are a number of low hanging fruit that you could go for without too much cost/effort. A deploy-time of half an hour, best case, seems long even for a relatively complex application.

Automating the deploy steps can reduce your outage significantly. We've used Capistrano (www.capify.org) extensively for web applications and it works really well. Capistrano started as a Ruby on Rails deployment tool but it works equally well for other environments.

Add some non-destructive automated tests to verify a new production deployment - call it what you like but "smoke tests" seem to describe it. Capistrano has build-in roll back semantics that will quickly roll back if the smoke tests fail.

These techniques is relatively simple to get going and can reduce your outage to a few minutes. This may be good enough for many applications.

Upgrading without any outage is more challenging. Goolge and Amazon use a lot of sharding and asynchronous technology to scale up and roll out big applications.

For most of us mere mortals it means simply coding each new version of the application to be backwards compatible with the previous version of the database. Default values for columns or hiding a feature if a table doesn't exist. That allows you to deploy the app first and apply database changes afterward.

To deploy application code without outage you can remove half the nodes from a load balancer, do the upgrade on the disabled nodes and then repeat the process for the remaining nodes. You still need to find ways around maintaining user sessions and the like.

leonm
RE make the app backwards compatible with previous versions of the database: I've always gone the other way: Make the database backwards compatible with previous versions of the app. Seems to me that 90% of the time, DB changes are ADDING columns or tables. Rarely do we take something away. So wouldn't it make more sense to deploy the new schema first, then deploy the new app? I did have one time that we did some restructuring and so we actually did it in three steps: deploy a DB with the new stuff, deploy the new version of the app, then remove the stuff that was no longer used.
Jay
leonm
+1  A: 

You might want to take a look at Viget Deployment : http://github.com/vigetlabs/viget%5Fdeployment

It's capistrano recipes. The idea is to have the following tree :

app
|- current
|- revisions
|-- 20090901xxxxx
|-- 20090902yyyyy

Every time you deploy, the task creates a new revision named after the current date and time. The "current" application is a symlink to the current version. So you can easily upgrade, check it works and then change the symlink. And you can just as easy revert back to a previous version.

As said before, there's no magic answer to this. What I'm saying here is only one solution.

Damien MATHIEU
+1  A: 

Half an hour seems like an awfully long time to deploy a new version. How big is your application? Most of my work is JSP/Servlets. So we drop a new WAR file on, the servlet container unpacks it, the new version is on in, well, I never timed it, but I'd think maybe a mninute, tops.

I have nothing to add to the substantive question. I don't do the physical new deploys to production, only on test servers, so there I've simply relied on "drop it on, if we catch you in the middle of a transaction, oops, too bad". :-) But for the apps I've worked on, I think shutting the server down for ten minutes to do a new deploy wouldn't deeply disturb anyone. I did work on a system that had several hundred thousand users from all around the world, so doing deploys in the middle of the night Eastern Standard Time woul be a pretty inconvenient time for our users in Afghanistan and Australia. But as I say, going off-line for ten minutes wouldn't cause anybody great anxiety, so I don't think it was ever a big deal.

Actually Google, despite their huge size, is surely not the best example of a new deploy requiring careful control. If they dropped a new version and a few thousand users got a 500 error or whatever, so what? Those users would surely just say, "Huh, I wonder what happened", resubmit, and everything would be cool. I'd be more worried about an on-line order site, where we want to make very sure that a service interruption doesn't mean that we take the customer's money but not record that we need to ship them merchandise or vice versa. But then, that's what transaction processing and rollbacks and all that are all about. If the worst thing that happened was you clicked checkout, got an error message, clicked it again, and everything worked, I'll bet few users would think twice about it.

Jay
+1  A: 

If you have a redundant environment (you should have it) like 2 frontend + 2 databases (master+slave) an upgrade is as simple as moving all traffic to one of the "functional units", upgrade the other, then switch. Persistence of the user's sessions should be granted by default using the database or some persistence manager (we use Terracotta like a charm) because you don't want your customers loose their data if a server suddenly goes down for an unexpected outage.

The steps needed to do so may differ depending on specific features and structure of your services but in a "standard" Apache + Tomcat + Mysql scenario they should look something like:

. Switch your DNS/load balancer to one single frontend

. Wait for all the users to stop connecting to the other one (based on the DNS cache timeout or sticky sessions or whatever)

. Stop synchronization/replication between databases if you have to alter the schema structure as well

. Upgrade the "paused" frontend and db.

. Pull up the upgraded servers again.

. Test that everything is working fine.

. Switch your DNS/load balancer to the other frontend and repeat

Depending on your environment you might have to promote slave to master (and vice-versa) and perform a manual database re-sync during the upgrade procedure.

I think this approach is unfit for a Google-scale datacenter but it works well for small environments with just a few servers.

Of course all the tasks may be automated by launching shell scripts or using a tool like Capistrano (as leonm suggested).

MariusPontmercy
+1  A: 

The tools, techniques and infrastructure required to make a large site fault tolerant are useful for upgrades and downgrades too. Automation is key in these cases.

The datastore they use support record/object-level schema upgrades (see Google's bigtable or Amazon's simpledb).

Roberto Lupi
+1  A: 

I agree with the other comments-- 30 minutes to upgrade an app with only 5000 users seems unusually long. Automation may shrink your downtime to a small enough time to make any other solution unnecessary. That said, downtime is not your only problem-- you also need to think about rollback. IMHO, enabling rollback (of code and data) is more important than reducing deployment downtime, since DB rollbacks usually take longer than deployments do, and (unless you think it through ahead of time) rollbacks can sometimes cause permanent loss of data since your last backup.

I'm not a fan of building web apps to handle older and newer database schemas (or databases to handle older and newer apps) because you're vastly increasing test cost by creating a big increase in codepaths which only get used during upgrade.

As a cheaper alternative to backwards compatibility, I like using a "read-only mode" during DB-change deployments. Once you put the app into read-only mode, you can migrate the DB to a new one while still being able to quickly roll back to the old one (without losing data) if something goes wrong. Since most web apps read data much more than write it, you can often turn off writes with minimal user disruption.

With read-only mode, a no-downtime deployment can look like this:

  1. deploy a new version of the app to run side-by-side with the old one (e.g. on a different IP address)
  2. put the app into read-only mode. users who attempt writes get a friendly error message.
  3. run DB migration scripts which create new DB, migrate data from old DB, and make any schema changes.
  4. validate that new app works against new DB.
  5. put the new app into read-only mode. this prevents users from seeing inconsistent data and UI as web servers are switched from the old to new app version. If a little inconsistency is OK, skip this step.
  6. switch live traffic over to new app version (e.g. by changing configuration at the load balancer to point to new IPs)
  7. if anything goes wrong in the steps above, simply turn off read-only mode. Your app is now working normally against the old DB. You can fix the problem and retry the deployment later.
  8. if the new app is working OK, put the new app in read-write mode
  9. if anything goes wrong, switch the load-balancer back to the old app version. This will revert you to the old DB and you'll lose data written since the upgrade, but this is usually OK (and you can always insert the data as part of your migration script when you retry the deployment later after fixing the problem)
  10. at your leisure, clean up the old DB and the old app version

The main advantage of the approach above is cheap/easy rollback, plus reduced test cost since apps only need to know about one version of the DB.

Obviously I'm simplifying here (e.g. if you have a partitioned or sharded DB, it's harder) but I suspect your 5000-user app has a single DB to worry about! :-)

Justin Grant