How Hyper-V responds to disk failure
I have talked to a handful of people over the years who have had to deal with Hyper-V servers that have suffered disk failure. This can be quite problematic to diagnose and troubleshoot – so I wanted to spend some time digging into what happens and what you should look for.
To simulate this scenario; I create a virtual machine with two virtual hard disks. The first virtual hard disk was stored locally, while the second virtual hard disk was stored on a flash USB stick. I then put the operating system on the first virtual hard disk, and an application on the second virtual hard disk (yes – the application was Plants vs. Zombies). Once it was up and running – I pulled the flash USB stick out of the server – and recorded what happened.
The first thing to note is – because it was not the operating system disk that disappeared – the virtual machine continued to run, and the guest operating system (Windows XP in this case) continued to run. Surprisingly the application also ran… for a while. After about 30 seconds the application silently exited and I started to see Windows alerts about write failures:
If I tried to start the application again – I would get a cryptic error message:
Exploring the guest operating system shows some significant confusion at this point in time. Disk manager believes that the disk (E:) has disappeared:
But device manager and Windows Explorer believe that the disk is still there:
I can even explore the missing E: disk:
But trying to open a file fails:
At this stage, the virtual machine will continue to run indefinitely in this mode. However, if the virtual machine tries to reboot it will fail:
Given all of this information – the real trap that I have seen people fall into is that if the missing physical disk is reconnected while the virtual machine is still running – the virtual hard disk will *not* be reconnected to the virtual machine. Where I have seen this cause problems is in a scenario like this:
You have a bunch of virtual machines where their virtual hard disks are stored on remote network based storage (e.g. iSCSI). Late at night one night there is a large glitch in the network (e.g. a switch reboots) and the remote disks disappear for a minute or two and then come back. This causes all of the virtual machines to lose their virtual hard disk.
When the server administrator comes in in the morning – he / she finds that a number of his virtual machines are in a strange state. They seem to be having problems accessing their disks – but the disks are not gone completely. Further more – if he / she looks at the parent partition the disks are there and everything looks fine.
9 times out of 10 the server administrator decides to reboot the affected virtual machines – which does reconnect the missing virtual hard disks – and everything goes back to being good. However the server administrator has no idea what happened.
So how do you figure out if this is what happened in your environment?
The first thing to do is to look at the Windows System event log. If there is any underlying hardware problem that is causing disks to come and go – there will be event log entries here to indicate that this is happening. However, there will be nothing there in the case of a disk removal (e.g. a USB disk getting pulled, an iSCSI disk going away for a moment). For these scenarios you can look at the event log entries under Applications and Service Logs –> Microsoft –> Windows –> VHDMP –> Operational. If a virtual hard disk is ever removed from a virtual machine due to an underlying storage error – you will see an error event log entry here:
Of course – once you have confirmed that this is what is causing the problem – the next task is to go and fix the hardware
Cheers,
Ben
Comments
Anonymous
October 07, 2010
AFAIK Windows caches the listings of recently used directories... I've seen network drives or "asleep" drives respond quickly to a root directory listing access but then on navigating elsewhere I have to wait for the drive to spin up. So this might be what you're seeing for the directory listing? Though I'm thinking it might be Hyper-V doing some caching itself... except that AFAIK virtual machine host software generally doesn't care about file system structures on the disk (except when you compact disks, where you need to know about such structures to locate unused space to zero it out).Anonymous
October 07, 2010
Dan Bugglin - Hyper-V does not do any extra caching on the disk - but you are correct that Windows caches directory information. Cheers, BenAnonymous
October 07, 2010
The comment has been removedAnonymous
October 08, 2010
Bernhard - Correct, there should be no extra risk of data loss. Cheers, BenAnonymous
October 12, 2010
Hi Ben, I have a question about CSV. that i think would be a great topic to go into depth. By default all the CSV are mounted to C:ClusteredStorage. is there any way to move it to D:ClusteredStorage? I hope you can help me with this. RegardsAnonymous
October 13, 2010
RomeroEJ - As far as I know, CSV will always use the system drive - and there is no way to change that. Cheers, BenAnonymous
January 09, 2011
There are more problems than that and I still couldn't find a way to resolve it. For example this scenario: you have a Server using two virtual disk that are stored in two different disk and you use it with a RAID 1 using Windows software. For whatever reason, if the disk that is failing is the disk used to do the mirror, there's no problem, as Windows will continue working with the error shown in the disk administration console. BUT if the disk that is being used to run the system, is the disk wich fails, and for example it reboots, Hyper-V will hang your virtual machine saying the message you showed before. That doesn't happen on a real scenario. In real scenario after the machine boots again, if the secondary plex is the deafault boot disk (as I do in my default RAID configuration), that Windows Server will boot again with no problems but the missing of one disk. Cheers, Akuma.Anonymous
January 09, 2011
There are more problems than that and I still couldn't find a way to resolve it. For example this scenario: you have a Server using two virtual disk that are stored in two different disk and you use it with a RAID 1 using Windows software. For whatever reason, if the disk that is failing is the disk used to do the mirror, there's no problem, as Windows will continue working with the error shown in the disk administration console. BUT if the disk that is being used to run the system, is the disk wich fails, and for example it reboots, Hyper-V will hang your virtual machine saying the message you showed before. That doesn't happen on a real scenario. In real scenario after the machine boots again, if the secondary plex is the deafault boot disk (as I do in my default RAID configuration), that Windows Server will boot again with no problems but the missing of one disk. Cheers, Akuma.