views:

804

answers:

8

In the ears of working in multiple teams, I've met multiple infrastructure managers that instituted a policy of weekly server reboots. As a developer, I was always against the policy - it seems that this is a hack to work around software bugs and hardware instabilities, instead of correcting them.

What are the people's opinions, positive and negative points regarding the policy?

A: 

Our servers are all Linux servers at work, and we don't ever reboot and haven't had any problems. I agree that it's a hack at best, and I also think it probably has something to do with the first response people used to always give when supporting Windows issues: "Have you rebooted your computer?"

Now as to why it might be beneficial, you may have applications that get into a weird state or that have memory leaks that a restart would resolve.

A big negative to me is that you've got to schedule weekly downtime for the servers. For some that's not an issue, and for others that's a huge issue.

commondream
A: 

Obviously if the source of a problem cannot be fixed in a timely fashion, it has to be worked around. Scheduling a reboot to fix it is an easy way out to save the business if that works.

Sure, it mentally hurts and shouldn't be needed and it would be best to work against such a solution, especially if one's in control of the problematic software or in a position to bitch-slap the producers for a fix or simply replace it. But if not..?

I remember doing it for the servers in a Citrix farm, in the end they were rebooted every night with a half-complicated script waiting for users to log off, locking logins to specific servers and then rebooting the free ones. The reason was an old 16bit 4GL client application that we simply couldn't get rid of which tended to sever overall user responsiveness after a few days of uptime.

I agree though that mostly it seems to be based on not being smart enough to figure out the cause and fixing it - not everyone is as well-versed in maintenance or motivated as we'd like.

Oskar Duveborn
+3  A: 

This is a foolish policy.

Here's why:

  • If you need to reboot a server weekly (and somehow it adds to your infrastructure's stability), you are covering up the real problem with a server or its software. A memory leak? A bad driver? The solution to these problems are to fix them, not cover them up with a lazy policy.

  • Servers often get rebooted for updates, at least in the Windows world. Rebooting for critical kernel updates happens anyway.

  • Database servers cache a lot of information in RAM. When you reboot your server, this cache gets empty and very cold. Assuming you have a typical usage pattern, a cold, empty cache will result in slow performance for users when they attempt their queries after a reboot. It may also increase the time needed to perform some types of maintenance like backups because the disk may need to be accessed more.

  • Your servers go down! Your maintenance windows for backups and other things get shortened because your server is off for some nonzero period of time. You also may end up having to tell your users that you will have downtime, depending on your systems' architecture.

  • Assuming you have some sort of notification system for alerting, you will have to configure it to ignore your downtime window. This can mask problems that happen around the time your server reboots, and adds to the amount of configuration you will need to do on your servers.

That being said, reboots sometimes are beneficial as a last resort on resources that you don't necessarily have full control over (old vendor-written software, "black box" devices where explicitly prescribed by the vendor, etc...). But this should be handled on a case by case basis, and not with a naive blanket policy.

Dave Markle
A: 

It is a hack really but it might be the most efficient hack. It is an 80:20 type problem where you can solve 80% of the problem with 20% of the effort. If you can survive the downtime or the downtime costs you less than actually fixing the root cause then this is a good solution. I personally don't like it but that is only because it isn't a clean solution.

stimms
+1  A: 

Answering my own question: One of the benefits that I see from the policy is when it is applied to a server cluster, and the processes are failed over from one node to another. That way all nodes are constantly tested for the correct software install.

Timur Fanshteyn
A: 

Another possibility to consider is that in some environments, such as retail stores that are open 24 hours a day, a "store close" event so that servers can be updated, backed-up, etc.

Even though the servers need to run "24x7", they'e really offline for at least a few minutes every day.

That effectively makes a server reboot every day, even though the store is still operating when it happens.

warren
+4  A: 

If you reboot your servers occasionally, you can be sure they will come back up. Though weekly sounds like a serious overkill, I have seen this problem on Linux machines with long uptimes.

Someone didn't bother to set up a critical service to start automatically on boot. Or the order of services coming up is wrong. Or someone upgraded libraries, added/removed software, etc. and the executable no longer works (it was started up with the old libraries, and continued using them; now it gets a dynamic linker error). Or it turns out service A depends on service B and service B depends on service A (oops).

At some point, when you least want to, you will take a reboot. The colo will drop the power on you; the server's power supplies will fail; someone will pull the cord/hit the reset button on the wrong server; etc. Now, when you can least afford downtime, your bloody server won't come back up.

Just like software, system configurations need testing. How often you need to do this testing depends on how your boxes are administered.

derobert
A: 

Apologies for dusting off an old thread.

I think everyone's missing the point, especially the die-hard 'reboot? I'd rather sell my commodore!' Nix admins.

The point is that a weekly window should be SCHEDULED. Doesn't mean it has to be used, in fact the preference is that it isn't used as it's inevitably at some forsaken hour of the morning.

But if it's there, you can use it.

Personally, I think a quarterly reboot is a very good idea - it can give you a heads up on problems (hardware and software), and as the most forward thinking other poster pointed out, makes you aware of changes that prevent smooth startup that only become apparent after a reboot. Rather than having the situation arise after a 4hr power cut when taking another 2hrs to bring your box up becomes really quite embarrassing....

There are other upsides..

  • It gets the management used to reboots, and you have their confidence when you actually do need a reboot (e.g. physically moving it). If you never reboot a box, your manager's gonna be pretty darned nervous when you say it needs rebooting after 4yrs and no downtime.

  • You yourself get used to reboots, and know what can\does go wrong when it's offline.

  • You KNOW how long reboots take, so when it's coming back up and takes 10mins longer than usual, you're straight into the logs.

  • If you get knocked down by a bus tomorrow, there's CURRENT (not 4yr old) documentation on what happens when a reboot occurs (assuming you're a nice admin and write things down)

  • A 30minute reboot per quarter fits well within 99.9% uptime SLA's.

  • Finally it clears out the proverbial cobwebs.

To answer some points AGAINST regular rebooting..

  • The one about covering up a bad driver\memory leak etc is hilarious. How do you know it's a memory leak\bad driver unless you reboot the server? Not only that but what if you don't manage to fix it in your planned downtime? If you have a weekly scheduled window it's no problem! You just try again next week....

  • Notification system - if you have a planned window, you can set a planned exception. If your software\script doesn't do this, then I suggest modern software\better script writing.

  • As for the planned exception window hiding problems that 'happen to occur during the planned exception window' that's just laughable. Your other server stats will show this issue up very quickly if you review them at all.

Of course a blanket policy is not recommended, and you should have criteria for exceptions (e.g. disk space over a certain size etc)

Having said that, the bottom line is just because your server shouldn't need to be rebooted, it's incredibly naive to think that you shouldn't reboot it....

Edit:

I'm not sure I made this clear enough, but rebooting should NOT be used for plastering over a problem. The window should be weekly so that you have repeated attempts at RESOLVING the issue, not 'living with it'.

Rebooting as a method of dealing with a problem on a server is poor sysadmin. Nothing is learnt and it wastes people's valuable time and (rightly) lowers the management's opinion of you.

My point is

  • It is difficult to ensure you resolve a problem without an accepted, scheduled, weekly maintenance window in place.
  • With a weekly window you have an ongoing opportunity to sort things out properly, and avoid the situation where you have half-a-dozen jerry-rigged workarounds on as many different servers.
HD