Reliability in Elastic SAN

This article describes reliability support in Azure Elastic SAN and covers both regional resiliency with availability zones and disaster recovery and business continuity.

Availability zone support

Availability zones are physically separate groups of datacenters within each Azure region. When one zone fails, services can fail over to one of the remaining zones.

For more information on availability zones in Azure, see What are availability zones?.

Azure Elastic SAN supports availability zone deployment with locally redundant storage (LRS) and regional deployment with zone-redundant storage (ZRS).

Prerequisites

LRS and ZRS Elastic SAN are currently only available in a subset of regions. For a list of regions, see Scale targets for Elastic SAN.

Create a resource using availability zones

To create an Elastic SAN with an availability zone enabled, see Deploy an Elastic SAN.

Zone down experience

When deploying an Elastic SAN, if you select ZRS for your SAN's redundancy option, zonal failover is supported by the platform without manual intervention. An elastic SAN using ZRS is designed to self-heal and rebalance itself to take advantage of healthy zones automatically.

If you deployed an LRS elastic SAN, you may need to deploy a new SAN, using snapshots exported to managed disks.

Low-latency design

The latency differences between an elastic SAN on LRS and an elastic SAN on ZRS isn't particularly high. However, for workloads sensitive to latency spikes, consider an elastic SAN on LRS since it offers the lowest latency.

Availability zone migration

To migrate an elastic SAN on LRs to ZRS, you must snapshot your elastic SAN's volumes, export them to managed disk snapshots, deploy an elastic SAN on ZRS, and then create volumes on the SAN on ZRS using those disk snapshots. To learn how to use snapshots (preview), see Snapshot Azure Elastic SAN volumes (preview).

Disaster recovery and business continuity

Disaster recovery (DR) is about recovering from high-impact events, such as natural disasters or failed deployments that result in downtime and data loss. Regardless of the cause, the best remedy for a disaster is a well-defined and tested DR plan and an application design that actively supports DR. Before you begin to think about creating your disaster recovery plan, see Recommendations for designing a disaster recovery strategy.

When it comes to DR, Microsoft uses the shared responsibility model. In a shared responsibility model, Microsoft ensures that the baseline infrastructure and platform services are available. At the same time, many Azure services don't automatically replicate data or fall back from a failed region to cross-replicate to another enabled region. For those services, you are responsible for setting up a disaster recovery plan that works for your workload. Most services that run on Azure platform as a service (PaaS) offerings provide features and guidance to support DR and you can use service-specific features to support fast recovery to help develop your DR plan.

Single and Multi-region disaster recovery

For Azure Elastic SAN, you're responsible for the DR experience. You can take snapshots of your volumes and export them to managed disk snapshots. Then, you can copy an incremental snapshot to a new region to store your data is in a region other than the region your elastic SAN is in. You should export to regions that are geographically distant from your primary region to reduce the possibility of multiple regions being affected due to a disaster.

Outage detection, notification, and management

You can find outage declarations in Service Health - Microsoft Azure.

Capacity and proactive disaster recovery resiliency

Microsoft and its customers operate under the Shared Responsibility Model. Shared responsibility means that for customer-enabled DR (customer-responsible services), you must address DR for any service you deploy and control. You should prevalidate any service you deploy will work with Elastic SAN. To ensure that recovery is proactive, you should always predeploy secondaries because there's no guarantee of capacity at time of impact for those who haven't preallocated.

Next steps