Jaa


Disaster Recover vs. Fault Tolerance

Microsoft offers (for Premier support customers only) a Microsoft Office SharePoint Server Risk Assessment product… basically a way to analyze your SharePoint farm (including WSS, product naming aside) for a possible 450+ known potential issues, and real, road-tested solutions to those problems. A portion of this is some investigation (we call it a survey) to find out all the things that may not sit on your computer… business process, operational type stuff… and the topic of disaster recovery is addressed… and the conversation frequently goes something like this:

Me Customer
“So… tell me a little about your disaster recovery strategy…” “Oh, we’re fine… we have clustered SQL.”
“That’s a great fault tolerance tool for SQL, but what about SharePoint?” “We have 2 SharePoint servers and they’re load balanced.”
“Good! What are the server names?” “’ASDF123MOSSWFE’ and ‘ASDF123MOSSAPP’”
“Hmm… and what is the name of your portal? …the URL?” “It’s ‘https://asd123mosswfe’”
“So how is the load balancer ever going to use the APP server?” “That’s the network team’s concern… but we know we’re fine.”
“Hmm… in any case, this is all great for fault tolerance… but what about your disaster recovery strategy?” “We take backups to tape.”
“Where are the tapes stored?” “In a box next to the tape drive.”
“What if there is a flood?” “Oh, that won’t happen.”
“You hope…”  

I am being a bit flippant in the example… but this isn’t far from conversation’s I have actually had with customers, and it demonstrates that we need to understand the differences between a backup strategy, disaster recovery plan, and fault tolerant system design.

From Wikipedia:

Disaster Recovery

Backup

Fault Tolerance

High Availability

“…the process, policies and procedures related to preparing for recovery or continuation of technology infrastructure critical to an organization after a natural or human-induceddisaster “…making copies of data so that these additional copies may be used to restore the original after a data loss event” “A backup is only as useful as its associated restore strategy.” “…enables a system (often computer -based) to continue operating properly in the event of the failure of… some of its components.” “…ensures a certain degree of operational continuity during a given measurement period.”
OR: the process of recovering from failure. OR: a tool to facilitate recovery of data after failure or loss. OR: the ability of a system to continue operating despite failure with no or little perceivable user impact. OR: the measure of availability (and associated significant design and cost), including unplanned and planned unavailability.

Hopefully it is clear that though these words get mixed together frequently, they are NOT synonymous. Each of these words/ideas is an attempt to solve a different kind of problem. For example, it is perfectly acceptable to have excellent disaster recovery plans on systems that are NOT fault tolerant… as long as you accept that you may experience service unavailability should a server fail or need maintenance. It is also perfectly acceptable to have a fault tolerant system that has no true disaster recovery strategy… though that would be a little like buying car insurance that was only valid while your car was in the garage.

So… what are your options? That’s mostly been covered in other papers (ie, here, and here, and here)… but here’s a quick (and possibly incomplete) chart:

Product Type of Tool Protects Does NOT protect
Microsoft Data Protection Manager Backup EVERYTHING (servers, farm, databases, data) SAN
SQL Backups Backup Content/SSP Databases OS, Web/App Servers, Config DB, Customizations, Search Indexes, SAN
STSADM FULL Backups Backup Content/SSP/Config DB OS, Web/App Servers, Manually deployed customizations, SAN
STSADM Site Collection Backup Backup A single site collection OS, Web/App Servers, Farm Customizations, Config DB, Search Indexes, SAN
STSADM Export Backup The data in a web and sub-webs OS, Web/App Servers, Farm Customizations, Config DB, Search Indexes, SAN
SQL Mirroring or Log Shipping Backup Content/SSP Databases OS, Web/App Servers, Config DB, Customizations, Search Indexes, SAN
Load Balancing (NLB/Hardware) Fault Tolerance Failure of a single web server (or n-1) Failure of All web servers, SQL Cluster, Index server, or SAN.
SQL Clustering Fault Tolerance Failure of a SQL server machine Failure of All SQL Cluster nodes, Web Servers, Index server, or SAN.

Notice anything? NOTHING in the above list offers “Disaster Recovery” or “High Availability”? That is because DR and HA are strategic objectives that require planning, documentation, coordination, and yes, tools. Backup tools are valuable only in the context of a well designed Disaster Recovery strategy. Fault tolerance is most valuable when deployed in alignment with a High Availability strategy and HA goal, objective, or target. Yes, having backup and fault tolerance can be helpful even if you haven’t put together a complete strategy, in absence of such a strategy you don’t truly know what you’re protecting yourself against.

One more thing… a note virtualization, “snapshots”, and SAN capabilities…

First, virtualization snapshots. These may provide a possible back-out strategy for changes in SharePoint… but only in one hotly debated scenario… and even then, Microsoft still does not recommend the use of snapshots for SharePoint products. While we specify Hyper-V in the linked article, the same reasons would apply to any technology that performs snapshots of active machines, including non-Microsoft virtualization products and SAN products. Because these methods are clearly not recommended and are supported by Microsoft only questionably, should absolutely not be included in any DR or HA strategy and should generally be avoided.

Also, this doesn’t address 3rd party solutions, but the fundamental requirements would be the same… no technology solution should exist that doesn’t directly support and meet the needs of an underlying strategic objective. Knowing what your 3rd party solution will or won’t protect you from should be critical to your decision making process. (With SharePoint, the best solutions will integrate directly with the Windows Volume Shadowcopy Services service and the SharePoint VSS writer… ask your vendor!!)

Post a comment if you have any questions, want more detailed or specific information to your needs, or even disagree with me… technical debates are fun and informative! :)

-Chris

Comments

  • Anonymous
    April 21, 2010
    I was just in one of your SP Developers Class and did not expect on Day 3 of my New Job MOSS would crash and no one knew how to recover it.  So this post is very timely.  I was able to troubleshoot and bring the system back up (VM Networking problem).  So now my focus is making sure that I never have to do this again.  Thanks