views:

726

answers:

11

When you roll out changes to a live web site, how do you go about checking that the live system is working correctly? Which tools do you use? Who does it? Do you block access to the site for the testing period? What amount of downtime is acceptable?

+11  A: 

I tend to do all of my testing in another environment (not the live one!). This allows me to push the updates to the live site knowing that the code should be working ok, and I just do sanity testing on the live data - make sure I didn't forget a file somewhere, or had something weird go wrong.

So proper testing in a testing or staging environment, then just trivial sanity checking. No need for downtime.

zigdon
+2  A: 

If you have a set of load-balanced servers, you will be able to take one by one offline separately and update it. No downtime for the users!

NimsDotNet
+1  A: 

Some of that depends on if you're updating a database as well. In the past, if the DB was being updated we downed the site for a planned (and published) maintenance period - usually something really off hours where impact was minimal. If the update doesn't involve the DB then, in a load balanced environment, we'd take 1 box out of the mix, deploy & test. If that was successful, it went into the mix and the other box (assuming 2 boxes) was brought out and updated/tested.

Note: We're NOT testing the code, just that the deployment went smoothly so down time any way was minimal. As has been mentioned, the code should have already passed testing in another environment.

Chuck
A: 

Run your main server on a port other than 80. Stick a lightweight server (e.g. nginx) in front of it on port 80. When you update your site, start another instance on a new port. Test. When you are satisfied that it has been deployed correctly, edit your proxy config file, and restart it. In nginx's case, this results in zero downtime or failed requests, and can also provide performance improvements over the more typical Apache-only hosting option.

Of course, this is no substitute for a proper staging server, it is merely a 'polite' way of performing the handover with limited resources.

Jim
+3  A: 

At work, we spend a period of time with the code frozen in the test environment. Then after a few weeks of notice, we take the site down at midnight Friday night, work through the night deploying and validating, and put it up Saturday late morning. Traffic statistics showed us this was the best time frame to do it.

Tom Ritter
+1  A: 

Have a cute, disarming image and/or backup page. Some sites implement simple javascript games to keep you busy while waiting for the update.

Eg, fail whale.

Adam Davis
+1  A: 

IMHO long downtimes (hours) are acceptable for a free site. If you educate your users enough they'll understand that it's a necessity. Maybe give them something to play with until the website goes back up (eg. flash game, webcam live feed showing the dev team at work, etc). For a website that people pay to access, a lot of people are going to waste your time with complaints if you feed them regular downtime. I'd avoid downtime like the plague and roll out updates really slowly and carefully if I were running a service that charges users.

In my current setup I have a secondary website connected to the same database and cache as the live copy to test my changes.

I also have several "page watcher" scripts running on cron jobs that use regular expressions to check that the website is rendering key pages properly.

Gilles
+1  A: 

The answer is that "it depends". First of all, on the kind of environment you are releasing into. Is it "hello, world" type of website on a shared host somewhere, or a google.com with half a mil servers? Is there typically one user per day, or more like couple million? Are you publishing HTML/CSS/JPG, or is there a big hairy backend with SQL servers, middle tier servers, distributed caches, etc?

In general -- if you can afford to have separate environments for development, QA, staging, and production -- do have those. If you have the resources -- create the ecosystem so that you can build the complete installable package with 1 (one) click. And make sure that the same binary install can be successfully installed in DEV/QA/STAGE/PROD with another single click... There's tons of stuff written on this subject, and you need to be more specific with your question to get a reasonable answer

A: 

To test everything as well as possible on a separate dev site before going live, I use Selenium (a web page tester) to run through all the navigable parts of the site, fill dummy values into forms, check that those values appear in the right places as a result, etc.

It's powerful enough to check a lot of javascript or dynamic stuff too.

Then a quick run-through with Selenium again after upgrading the live site verifies that the update worked and that there are no missing links or database errors.

It's saved me a few times by catching subtle errors that I would have missed just manually flicking through.

Also, if you put the live site behind some sort of "reverse proxy" or load balancer (if it's big), that makes it easy to switch back to the previous version if there are problems.

Colin Coghill
A: 

The only way to make it transparent to your users is to put it behind a load balanced proxy. You take one server down while you update another server. Then when you done updating you put the one you updated online and take the other one down. That's how we do it.

If you have any sort of 'beta" build, don't roll it out on the live server. If you have a 'live, busy site' chances are people are going to pound on it and break something.

This is a typical high availbility setup, to maintain high availability you'll need 3 servers minimum. 2 live ones and 1 testing server. Plus any oter extra servers if you want to have a dedicated DB or something.

paan
+4  A: 

Lots of good advice already.

As people have mentioned, if you don't have single point involved, it's simple to just phase in changes by upgrading an app server at a time. But that's rarely the case, so let's ignore that and focus on the difficult bits.

Usually there is a db in there which is common to everything else. So that means downtime for the whole system. How do you minimize that?

Automation. Script the entire deployment procedure. This (especially) includes any database schema changes. This (especially) includes any data migration you need between versions of the schema.

Quality control. Make sure there are tests. Automated acceptance tests (what the user sees and expects from a business logic / experience perspective). Consider having test accounts in the production system which you can script to test readonly activities. If you don't interact with other external systems, consider doing write activities too. You may need to filter out test account activity in certain parts of the system, especially if they deal with money and accounting. Bean counters get upset, for good reasons, when the beans don't match up.

Rehearse. Deploy in a staging environment which is as identical as possible to production. Do this with production data volumes, and production data. You need to feel how long an alter table takes. And you need to check that an alter table works both structurally, and with all foreign keys in the actual data.

If you have massive data volumes, schema changes will take time. Maybe more time than you can afford to be down. One solution is to use phased data migrations, so that the schema change is populated with "recent" or "current" (let's say one or three months old) data during the downtime, and the data for the remaining five years can trickle in after you are online again. To the end user things look ok, but some features can't be accessed for another couple of hours/days/whatever.

jplindstrom