The Dark Side of Virtualization

Over the years I've been engaged in several AD disaster recovery scenarios where things ultimately boiled down to the same root cause; a single point of failure had been introduced into the IT environment.  When the single point of failure failed catastrophically - it consequently took down the entire environment with it.

With good backups that can be restored to recover this may not be an End of Days scenario - but as the 3rd principle of Murphy's Law dicates chances are the backups available are either unusable, unrestorable or non-existent when you actually need them (in the same sense that they will always work when you don't need them).

Now... in most virtualization scenarios the admin responsible for the virtual server is typically completely removed from the storage layer - this has been a conscious push by most of the virtualization providers as part of the drive towards virtualization being intended to simplify IT environments by making the storage medium unimportant.

In itself that may be a valid selling point - but the thing is that even if the Admin is removed from the storage medium the virtual hard disk of the virtual server still needs to be physically stored somewhere.  Even the Cloud has mechanical moving parts...
For large hosting providers this "somewhere" is typically a centralized SAN with redundant gizmos, thingamagicks and bells and whistles.

SAN's are tried and tested storage devices that have been around for years before the idea of using them to store virtual machine images was ever conceived - but with today's storage capacity by far outweighing today's backup or restore capability and the cost of a decent SAN with full redundancy being relatively high it becomes very tempting to build SAN's that are large enough to hold the entire mass of virtual machines you are hosting to save money and increase ROI from that SAN.

Consider the following hypothetical but all too likely disaster recovery scenario in today's SAN-based virtualization environments:

- You store 2000 virtual machines on the same SAN.
- The SAN fails catastrophically

Even at best, with a perfect backup strategy in place, a bulletproof Disaster Recovery plan and a small army of trained ninjas that spring from the shadows and start Disaster Recovery procedures at the very instant that the failure has been detected and quantified.... you're still looking at a lengthy recovery process.

If you're missing one of these...you're looking at an even longer process.  

Morale: Every Cloud has a Silver lining - even Private Clouds :)

Comments

  • Anonymous
    January 01, 2003
    Virtualization is here to stay and is generally a good thing but it does present a new set of challenges - the danger is that whereas you had definite physical space limitations for physical servers the limits for virtual machines are more blurred and subject to interpretation. Being able to add a "limitless" amount of virtual servers into the fray with a few mouseclicks introduces the danger of exceeding capacity of either the host or the SAN or your recovery capacity without realizing it. With physical servers you'd be running out of rack space at that point :)

  • Anonymous
    January 01, 2003
    That is one of the reasons why I only like the idea for virtualization in the most advanced and most large environments because these things of which you speak, it is only too likely that they are just waiting to happen.  

  • Anonymous
    January 14, 2012
    You point is valid and noted and is a warning for VM admins as well as storage admins. However, although I'm sure it has been done, as your article indicates, but I never see one single SAN or more specifically one single array in which all VMs are placed. Without exception, places I've done work for have multiple arrays (often even different SAN vendor's) with mission critical VMs placed strategically on the different LUNs. It simple would not be possible, outside of a complete site failure, for the "SAN" as an entity to fail in such a way that the VMs would not be accessible or at least able to fail over to other arrays. A number of places even have had 'stretch clusters' where the SAN is across two or more sites that mitigated site failure where the VM platform's cluster was placed across the site..