UPS Outside Your Head
Does lateral thinking mean you need to look outside your own head instead of just accepting the most obvious solution? If so, I might as well plead guilty in terms of managing the backup power supply for my servers.
Like a great many people I depend on APC UPSs to handle mains power fluctuations and interruptions for my servers. Since Windows NT, through Server 2000, 2003, and now 2008 R2 I've blithely installed the default power management utilities provided by APC. Everything was hunky dory until I went virtual and set up the servers using Hyper-V. That's when the problems started.
Mind you, I can't say I really noticed the problems at the start. OK, so the latest versions of the APC software don't seem to install on a machine configured to host Hyper-V VMs, but the earlier version did and I continued to use that. The only thing I noticed were occasional messages that the server had lost connection with the UPS, but then immediately restored it again.
I had the software set up to shut down the server gracefully, well before the battery ran out in the UPS, and reboot it when back to 60% charge. In the past on Server 2000 etc. this has worked fine. So I reckoned that, because the Hyper-V system manages graceful shutdown of the hosted VMs, it would also work fine on 2008 and 2008 R2, and initial tests proved this to be the case.
However, during a recent power outage (I was rewiring a ring main socket) I came back to find the UPS fully charged but the server stopped. Pressing the power button initiated a reboot sequence but the machine just shut down again. It looked very much like the recent episode when a motherboard failed in another machine, and here I was on a busy Saturday morning pondering another visit from the Dell man. Thankfully I took out a full onsite warranty this time!
However, after shutting down the UPS, then restarting it and the server, everything came back to life again. The event log showed a graceful shutdown and reboot, and all the VMs that are set up to auto-start were running fine. Interestingly, the one that doesn't auto-start showed up in Hyper-V Manager as "Suspended" rather than "Off". It didn't take long to figure that Hyper-V had suspended the VMs rather than closing down and then restarting them.
But why had the server not restarted automatically? In fact, as the power was off for only a few minutes, why did it shut down in the first place? The answer, as evident in the Event Log, was a "Communication Lost" event from the APC software; followed by "Runtime Limit Reached" and then "Shutting Down the System". If the software can't see the UPS it assumes it's broken and shuts down the server automatically, even though there was at least an hour left in the batteries.
According to the APC site, the free software doesn't support Hyper-V because it can't guarantee to safely shut down each VM individually. As many people regularly attempt to point out on the APC forum, surely it doesn't need to. Server 2008 and 2008 R2 can quite happily respond to a shutdown message and safely manage the VMs it hosts. The suggestion from aggravated forum posters is that it's just a cynical way for APC to sell the network version of the management software.
Oh well, I don't mind paying a bit for the real thing, but it seems that to make it work I also have to buy and install a special network management card, and install a ton more drivers and stuff. Do I really want to do that? So I look at the Open Source alternative, apcupsd, but it looks complicated enough to need more than what remains of the afternoon to sort out. I'll need a day to read and understand the manual.
But that's when the "outside your head" thing struck me. A quick Bing located a post by Ben Armstrong (the Virtual PC Guy) that says that the built-in power management stuff in Server 2008 R2 can manage your server and UPS automatically. In fact, as I discovered when installing the APC software, battery management is part of the O/S and all you install from APC is the service that manages the UPS and interacts with Windows. Without the APC software, Server 2008's default settings will monitor battery power and can initiate a server hibernate and shutdown when it's low, though you probably want to tweak the Low and Critical level settings in the Advanced Power Management dialog to something less optimistic that 10% and 5%.
Then I read somewhere else that installing the Hyper-V service changes the server's behavior by disabling hibernate mode, because hibernating a server that hosts VMs is not recommended. When I checked the advanced power configuration settings in my box the Critical Battery Action was still set to "Hibernate", but opening the drop-down list showed that the only options available now are "Do Nothing" and "Shut Down". Obviously installing Hyper-V does not change the current settings. I selected "Shut Down" and set the Critical Battery Level to 50% to make sure that the O/S has plenty of time to shut down all the VMs. I also set the Low Battery Level to 75% and the Low Battery Notification to "On" so that I can see when (and if) the server detects a power failure.
Since uninstalling the APC software and allowing Windows to manage its power requirements directly I've had no Event Log warnings and the power icon in the system tray seems to work, as a quick shutoff of the mains feed to the UPS demonstrated. Of course, where the APC software and the Open Source apcupsd service have an advantage is that they can restart the server when power is restored. And without the APC software I can't monitor the UPS, or configure the EEPROM settings inside it (although apcupsd provides a utility that can do this). So before I uninstalled the APC software I set up the UPS to do a shutdown only (not turn off) and allow 15 minutes for the server to shut down when the low battery warning occurs.
I also configured the UPS is to turn on the power again when the charge reaches 60% after a power failure, and the server BIOS is configured to auto-start when power is restored. Therefore it should, in theory, all work by powering up the server again automatically. The real test was a few days later when the electrician arrived to rewire the kitchen as part of our ongoing modernization plan. Unfortunately, while it kind of worked, there are some issues.
The server did shut down, and restart again. But examining the Event Logs after the restart revealed that, despite the Power settings in Windows Server being set to notify when the battery charge drops to 75%, there was no matching Event Log message. Maybe the warning just pops up in the notification area of the screen. But the Event Log messages did indicate that the server correctly shut down, and restarted with no unexpected errors.
Things were different with the VMs, however. I had configured a combination of different settings in Hyper-V Manager to experiment with the behavior. One VM was set to "Turn Off" and restart if previously running, one was set to "Save" and restart if previously running, and one was set to "Shut Down" and restart if previously running. The fourth was set to "Save" and always restart. Hyper-V Manager revealed that they had all started automatically, so that's OK. The "Turn Off", "Save", and "Shut Down" actions when the host server shuts down all work as expected and allow for automatic restart if previously running.
The problem was that the Event Logs in all of the VMs indicated that they had all shut down unexpectedly. There was the System log message saying just this, and the Critical system error message to confirm it on every one. While the host server had shut down correctly, it seems that the VMs had not.
When you shut down the server manually this doesn't happen, so it must be that the shutdown initiated by the battery power management system does something different from the "Shut Down" command on the Start menu. I wondered if it was just that the UPS had switched off the power to the host server before it had a chance to shut down, turn off, or save the VMs, but the fact that the host server had shut down properly without error seems to indicate this isn't the case.
From the times recorded in the host server and VM Event Logs and by my NAS (which also logs power failure events), it seems that the shutdown occurred only 30 minutes after the power failure, whereas the UPS reckons it has more than 90 minutes of battery life. So it does look like the shutdown occurred when Windows power management detected only 50% battery life remaining.
Does the UPS send some signal to the Windows power management system that initiates a shutdown? Or perhaps Windows power management sends a signal to the UPS to hibernate until the power is restored? Or maybe it's just that there's some setting hidden away somewhere that I haven't found yet...