views:

386

answers:

7

Yesterday's Stack Overflow downtime got me thinking about this a bit...

I live in Australia (though this is probably true for most people in a non-US timezone) and am constantly greeted with "... is down for maintenance" right in the middle of my work day. In fact, in the last week we've had Google Wave, SO and Campfire all take turns going down. (Sad, Sad Panda :()

Being in Australia, the middle of the day on Monday, one of the busiest times of the week is normally when service operators based in the US (as it's Sunday night there) decide to do maintenance. I realise that services like SO and Google Wave are free so fair's fair but especially when Campfire went down I thought, "Surely we pay the same as any other client for this application and can therefore expect the same level of service?"

While I've worked developing web applications for a number of years, I've almost always worked on projects involving internal systems for a highly localised user base. So I've never had the "When's the best time to take the system down" issue.

But I wonder, is there a way to perform graceful maintenance of a web application? (let's assume it's something already in production state for simplicity) I'm sure there are SO members out there who have and do tackle this issue often... How do you do it? Is it possible not to adversely affect overseas users of your service?

Also, if anyone has any insight into how big players such as Google or Twitter handle this kind of thing I'd be really interested to hear about it. I know they DO have downtime but it's not as much as you'd expect for the number users they support and features they release. So they must have some kind of way of handling maintenance (at least minor stuff)...

Edit: I've taken on board advice from a couple of members here and posted this question on Server Fault to see what the SysAdmin perspective on this issue is. You can find the question here:

http://serverfault.com/questions/116543/graceful-maintenance-of-web-applications

+4  A: 

One way would be to have extra web servers. Take one off line to update it then put it back up and do the same to the rest, one at a time. This only works for the web servers and not the database servers. With the database it gets a lot more complicated.

Josh Curren
Thanks Josh, this is the kind of thing I was looking for. Obviously the applications I described are complex, database-driven applications... I wonder how you get around the DB issue
Ganesh Shankar
Plenty of possibilities, depending on your application, the level of database access desired during maintenance. Maintaining a temporary read-only snapshot is one; logging database queries and replaying them (or a modified version) into the "real" database another; a third would be using versioned stored procedures and/or views for database access so two schemas could appear to coexist in the same database.
Nicholas Riley
DB migration: How about migrate 1 instance of the db, then the other, with migration scripts to transfer transactions between them that replication cannot. Once they are back in sync, replication can come back on.
Tim Williscroft
+2  A: 

It's been my experience that serious maintenance requires that users not be making changes to the database, so the site has to go down at some time. The logical time to do this would be when usage is least heavy, but that doesn't solve the problem.

The last time we had to do some major maintenance, I scheduled it during a time when everyone in our office was watching the end of the Olympic torch relay. We practiced on a test server beforehand and figured out how to do the upgrade as fast as possible by parallelizing the tasks. The whole thing was down for less than an hour. Other good times would be during United States stat holidays that are not celebrated outside the United States, but those may not occur often enough.

Probably the best solution would be to make a read-only copy of the database and redirect traffic to that database server during the upgrade.

Scott
Global 'stop-work' events are good. These aren't always predictable though (e.g. there was a massive relax on local servers when MJ was in the news, as everyone was on news servers instead. This couldn'tve been accounted for in advance though.)
glasnt
+4  A: 

I use a variety of US-based services that tend to do the same thing - shut down for maintenance in the middle of my working day. It's annoying, but it's tolerable when they:

  1. Communicate the upcoming maintenance activities ahead of time. 24 hours' warning is good. Some warn 48 hours in advance.
  2. Make the downtime window as short as possible. One service I use has shortened their offline time from two hours to half an hour in recent months, and they're often back up in 15 minutes or so.
  3. If it's unplanned maintenance, they show an approximate estimate for when things will be back up so I don't sit there hitting the "Refresh" button again and again.
  4. If it's a planned upgrade, announce what the new features are. It's great to know my banging my head on the keyboard while the service was down was not in vain, and I ('the user') have been rewarded with some new functionality :).

I myself try to get #2 as short as humanly possible - that is my priority. I communicate planned downtime in advance and apologise afterwards if the site had to go down unexpectedly.

I also keep the site open, but read-only during maintenance activities - at least users can look up things if they just need some information, and it keeps search engines happy, too.

Rowlf
+1  A: 

one solution would be not to use relational database, you see, majority of downtimes are because of schema updates and those can take some time.

don't get me wrong, math theory behind relational model is amazing but all those advantages are not for free. your application is always married to particular schema and upgrade to application often means upgrade of database schema as well.

if you would use something like berkeleydb, couchdb, simpledb or other document or key-value database, schema updates are no longer so easy (because there is no enforced schema anyway) and this simply forces you to write code that is more forgiving to data coming from database. this is why google doesn't have downtimes because their google bigtable is not relational and their applications are forced to be written in such a way that they must pretty much expect anything coming from database.

should everyone switch to non-relational databases to eliminate downtimes? absolutely not, it's not worth unless you are google or amazon.

lubos hasko
Nonsense. It is possible to have a large database driven system with no downtime due to schema changes. NASDAQ would be an obvious example. Replication would be and careful control of the database to which sites are pointed would be one solution. The vast majority of schema changes take fractions of a second even on large database system. For those that do not, there are solutions to greatly accelerate the time to change.
Thomas
@thomas, I don't see how NASDAQ is "obvious" example. it's read-only website. The largest read-write website based on relational database is wikipedia and do you know what happens during schema upgrade? no edits are allowed. pretty much all websites (big and small) based on relational databases have at least read-only downtimes. Is it possible to have no downtime with relational database during upgrades? sure but usually at very high cost... so high, that nobody is doing it, not even wikipedia.
lubos hasko
How do you figure that NASDAQ is read only? I'm not talking (just) about NASDAQ's website. I'm talking about the NASDAQ exchange itself. Wikipedia is non-profit. Perhaps they are unable to afford a more expansive database setup. Perhaps they do not care about downtime. Perhaps there are other reasons than changes to the database that cause Wikipedia to go down. I stand by what I said. It is possible to build a database driven system with no downtime due to schema changes but it takes careful planning during the design phase of the system.
Thomas
@thomas, NASDAQ stock exchange is "open" only a few hours a day. there is no trading during night, weekends and public holidays. this is perfect time for them to perform upgrades of their systems. even my bank, insurance company, government agency I'm dealing with is taking the system down while they're performing upgrades. just give me one example of website running relational database that doesn't limit functionality at all during upgrades... I'm not asking whether it's possible, everything is possible, it's just that nobody is doing it because it's not worth extra cost and complexity.
lubos hasko
+2  A: 

If you have a concern for downtime for international users (either from the US or from Australia), then the solution is separate databases in each of the majority timezones you plan on supporting (say the database's timezone +- 3-5 hours). I would definitely suggest putting a service layer in front of your databases to abstract the schema and structure of the database from its consumers.

Thomas
I was wondering if something like this was possible... definitely an interesting concept
Ganesh Shankar
+1  A: 

Best way to choose a downtime is to look at your traffic stats and see when the least users are active on your application.

This will vary with the site - one Australian site might have mostly Australian users, another might have mostly US or European users.

Some sites might be used by people during office hours (of their users), others have their peak in the evenings(of their users).

If you are lucky there will be an hour or two per day when not too many users are inconvenienced. Perhaps you need to wait until some quiet time during the weekend.

Sometimes it is possible to give the users prior warning about the downtime, so they can work around it.

Best is to craft the changes to the app to minimise downtime. For example, sometimes database schema changes required for the new version can be made while the old version continues to run, by switching round your load balancer you can avoid downtime altogether in those cases.

gnibbler
+1  A: 

The database side isn't that complicated, your app just has to be prepared for it, to some degree, so avoiding terrible things like select *, and insert where you're relying on column order are a must, to name a few.

Basically, database changes that are additive and don't destroy data are fine to do whenever, you can have a slave, take it off replication, make all the changes, start having it replicate. Once it's caught up, do a really fast fail over to the slave from the master. The slave is now the master and the master can perform updates.

For contractive changes that remove columns or similarly destructive things, I typically would do those with a one release lag, this means it's easier for fail over, you still want to take a backup of the db pre-upgrade, just in case.

For things like column splits and merges, you can have triggers that sync the data for the time being so you have a fall back strategy and it's captured in both places.

Also for the time period where there is that bit of down time, you could put a queue in front of the database to hold the queries while the fail over is taking place, perhaps a connection proxy?

Saem