Reliability in Virtual Machines
This article contains detailed information on VM regional resiliency with availability zones and cross-region disaster recovery and business continuity.
Availability zone support
Availability zones are physically separate groups of datacenters within each Azure region. When one zone fails, services can fail over to one of the remaining zones.
For more information on availability zones in Azure, see What are availability zones?.
Virtual machines support availability zones with three availability zones per supported Azure region and are also zone-redundant and zonal. For more information, see Azure services with availability zones. The customer is responsible for configuring and migrating their virtual machines for availability.
To learn more about availability zone readiness options, see:
- See availability options for VMs
- Review availability zone service support and region support
- Migrate existing VMs to availability zones
Prerequisites
Your virtual machine SKUs must be available across the zones in for your region. To review which regions support availability zones, see the list of supported regions.
Your VM SKUs must be available across the zones in your region. To check for VM SKU availability, use one of the following methods:
- Use PowerShell to Check VM SKU availability.
- Use the Azure CLI to Check VM SKU availability.
- Go to Azure services with availability zone support.
SLA improvements
Because availability zones are physically separate and provide distinct power source, network, and cooling, SLAs (Service-level agreements) increase. For more information, see the SLA for Virtual Machines.
Create a resource with availability zones enabled
Get started by creating a virtual machine (VM) with availability zone enabled from the following deployment options below:
Zonal failover support
You can set up virtual machines to fail over to another zone using the Site Recovery service. For more information, see Site Recovery.
Fault tolerance
Virtual machines can fail over to another server in a cluster, with the VM's operating system restarting on the new server. You should refer to the failover process for disaster recovery, gathering virtual machines in recovery planning, and running disaster recovery drills to ensure their fault tolerance solution is successful.
For more information, see the site recovery processes.
Zone down experience
During a zone-wide outage, you should expect a brief degradation of performance until the virtual machine service self-healing rebalances underlying capacity to adjust to healthy zones. Self-healing isn't dependent on zone restoration; it's expected that the Microsoft-managed service self-healing state compensates for a lost zone, using capacity from other zones.
You should also prepare for the possibility that there's an outage of an entire region. If there's a service disruption for an entire region, the locally redundant copies of your data would temporarily be unavailable. If geo-replication is enabled, three other copies of your Azure Storage blobs and tables are stored in a different region. When there's a complete regional outage or a disaster in which the primary region isn't recoverable, Azure remaps all of the DNS entries to the geo-replicated region.
Zone outage preparation and recovery
The following guidance is provided for Azure virtual machines during a service disruption of the entire region where your Azure virtual machine application is deployed:
- Configure Azure Site Recovery for your VMs
- Check the Azure Service Health Dashboard status if Azure Site Recovery hasn't been configured
- Review how the Azure Backup service works for VMs
- See the support matrix for Azure VM backups
- Determine which VM restore option and scenario works best for your environment
Low-latency design
Cross Region (secondary region), Cross Subscription (preview), and Cross Zonal (preview) are available options to consider when designing a low-latency virtual machine solution. For more information on these options, see the supported restore methods.
Important
By opting out of zone-aware deployment, you forego protection from isolation of underlying faults. Use of SKUs that don't support availability zones or opting out from availability zone configuration forces reliance on resources that don't obey zone placement and separation (including underlying dependencies of these resources). These resources shouldn't be expected to survive zone-down scenarios. Solutions that leverage such resources should define a disaster recovery strategy and configure a recovery of the solution in another region.
Safe deployment techniques
When you opt for availability zones isolation, you should utilize safe deployment techniques for application code and application upgrades. In addition to configuring Azure Site Recovery, and implement any one of the following safe deployment techniques for VMs:
As Microsoft periodically performs planned maintenance updates, there may be rare instances when these updates require a reboot of your virtual machine to apply the required updates to the underlying infrastructure. To learn more, see availability considerations during scheduled maintenance.
Before you upgrade your next set of nodes in another zone, you should perform the following tasks:
- Check the Azure Service Health Dashboard for the virtual machines service status for your expected regions.
- Ensure that replication is enabled on your VMs.
Migrate to availability zone support
To learn how to migrate a VM to availability zone support, see Migrate Virtual Machines and Virtual Machine Scale Sets to availability zone support.
- Move a VM to another subscription or resource group
- Azure Resource Mover
- Move Azure VMs to availability zones
- Move region maintenance configuration resources
Cross-region disaster recovery and business continuity
Disaster recovery (DR) is about recovering from high-impact events, such as natural disasters or failed deployments that result in downtime and data loss. Regardless of the cause, the best remedy for a disaster is a well-defined and tested DR plan and an application design that actively supports DR. Before you begin to think about creating your disaster recovery plan, see Recommendations for designing a disaster recovery strategy.
When it comes to DR, Microsoft uses the shared responsibility model. In a shared responsibility model, Microsoft ensures that the baseline infrastructure and platform services are available. At the same time, many Azure services don't automatically replicate data or fall back from a failed region to cross-replicate to another enabled region. For those services, you are responsible for setting up a disaster recovery plan that works for your workload. Most services that run on Azure platform as a service (PaaS) offerings provide features and guidance to support DR and you can use service-specific features to support fast recovery to help develop your DR plan.
You can use Cross Region restore to restore Azure VMs via paired regions. With Cross Region restore, you can restore all the Azure VMs for the selected recovery point if the backup is done in the secondary region. For more information on Cross Region restore, refer to the Cross Region table row entry in our restore options.
Disaster recovery in multi-region geography
In the case of a region-wide service disruption, Microsoft works diligently to restore the virtual machine service. However, you still must rely on other application-specific backup strategies to achieve the highest level of availability. For more information, see the section on Data strategies for disaster recovery.
Outage detection, notification, and management
Hardware or physical infrastructure for the virtual machine may fail unexpectedly. Unexpected failures can include local network failures, local disk failures, or other rack level failures. When detected, the Azure platform automatically migrates (heals) your virtual machine to a healthy physical machine in the same data center. During the healing procedure, virtual machines experience downtime (reboot) and in some cases loss of the temporary drive. The attached OS and data disks are always preserved.
For more detailed information on virtual machine service disruptions, see disaster recovery guidance.
Set up disaster recovery and outage detection
When setting up disaster recovery for virtual machines, understand what Azure Site Recovery provides. Enable disaster recovery for virtual machines with the below methods:
- Set up disaster recovery to a secondary Azure region for an Azure VM
- Create a Recovery Services vault
- Enable disaster recovery for Linux virtual machines
- Enable disaster recovery for Windows virtual machines
- Fail over virtual machines to another region
- Fail over virtual machines to the primary region
Disaster recovery in single-region geography
With disaster recovery setup, Azure VMs continuously replicate to a different target region. If an outage occurs, you can fail over VMs to the secondary region, and access them from there.
When you replicate Azure VMs using Site Recovery, all the VM disks are continuously replicated to the target region asynchronously. The recovery points are created every few minutes, which grants you a Recovery Point Objective (RPO) in the order of minutes. You can conduct disaster recovery drills as many times as you want, without affecting the production application or the ongoing replication. For more information, see Run a disaster recovery drill to Azure.
For more information, see Azure VMs architectural components and region pairing.
Capacity and proactive disaster recovery resiliency
Microsoft and its customers operate under the Shared Responsibility Model. Shared responsibility means that for customer-enabled DR (customer-responsible services), you must address DR for any service they deploy and control. To ensure that recovery is proactive, you should always pre-deploy secondaries because there's no guarantee of capacity at time of impact for those who haven't preallocated.
For deploying virtual machines, you can use flexible orchestration mode on Virtual Machine Scale Sets. All VM sizes can be used with flexible orchestration mode. Flexible orchestration mode also offers high availability guarantees (up to 1000 VMs) by spreading VMs across fault domains either within a region or within an availability zone.