How customer-managed (unplanned) failover works

Άρθρο
08/05/2024

Customer-managed (unplanned) failover enables you to fail over your entire geo-redundant storage account to the secondary region if the storage service endpoints for the primary region become unavailable. During failover, the original secondary region becomes the new primary region. All storage service endpoints are then redirected to the new primary region. After the storage service endpoint outage is resolved, you can perform another failover operation to fail back to the original primary region.

This article describes what happens during a customer-managed (unplanned) failover and failback at every stage of the process.

Important

Customer-managed (unplanned) failover for accounts that have Azure Data Lake Storage Gen2 enabled is currently in PREVIEW and supported in all public GRS/GZRS regions.

See the Supplemental Terms of Use for Microsoft Azure Previews for legal terms that apply to Azure features that are in beta, preview, or otherwise not yet released into general availability.

Important

Customer-managed (unplanned) failover for accounts that have SSH File Transfer Protocol (SFTP) enabled is currently in PREVIEW and only supported in the following regions:

(Asia Pacific) Central India
(Asia Pacific) South East Asia
(Europe) North Europe
(Europe) Switzerland North
(Europe) Switzerland West
(Europe) West Europe
(North America) Canada Central
(North America) East US 2
(North America) South Central US

See the Supplemental Terms of Use for Microsoft Azure Previews for legal terms that apply to Azure features that are in beta, preview, or otherwise not yet released into general availability.

In the event of a significant disaster that affects the primary region, Microsoft will manage the failover for accounts with a hierarchical namespace. For more information, see Microsoft-managed failover.

Redundancy management during unplanned failover and failback

Tip

To understand the various redundancy states during the unplanned failover and failback process in detail, see Azure Storage redundancy for definitions of each.

When a storage account is configured for geo-redundant storage (GRS) or read access geo-redundant storage (RA-GRS) redundancy, data is replicated three times within both the locally redundant storage (LRS) primary and secondary regions. When a storage account is configured for geo-zone-redundant storage (GZRS) or read access geo-zone-redundant storage (RA-GZRS) replication, data is zone-redundant within the zone redundant storage (ZRS) primary region and replicated three times within the LRS secondary region. If the account is configured for read access (RA), you're able to read data from the secondary region as long as the storage service endpoints to that region are available.

During the customer-managed (unplanned) failover process, the Domain Name System (DNS) entries for the storage service endpoints are switched. Your storage account's secondary endpoints become the new primary endpoints, and the original primary endpoints become the new secondary. After failover, the copy of your storage account in the original primary region is deleted and your storage account continues to be replicated three times locally within the new primary region. At that point, your storage account becomes locally redundant and utilizes LRS.

The original and current redundancy configurations are stored within the storage account's properties. This functionality allows you to return to your original configuration when you fail back. For a complete list of resulting redundancy configurations, read Recovery planning and failover.

To regain geo-redundancy after a failover, you need to reconfigure your account as GRS. After the account is reconfigured for geo-redundancy, Azure immediately begins copying data from the new primary region to the new secondary. If you configure your storage account for read access to the secondary region, that access is available. However, replication from the primary to the secondary region might take some time to complete.

Warning

After your account is reconfigured for geo-redundancy, it may take a significant amount of time before existing data in the new primary region is fully copied to the new secondary.

To avoid a major data loss, check the value of the Last Sync Time property before failing back. To evaluate potential data loss, compare the last sync time to the last time at which data was written to the new primary.

The failback process is essentially the same as the failover process, except that the replication configuration is restored to its original, pre-failover state.

After failback, you can reconfigure your storage account to take advantage of geo-redundancy. If the original primary was configured as ZRS, you can configure it to be GZRS or RA-GZRS. For more options, see Change how a storage account is replicated.

How to initiate an unplanned failover

To learn how to initiate an unplanned failover, see Initiate an account failover.

Caution

Unplanned failover usually involves some data loss, and potentially file and data inconsistencies. It's important to understand the impact that an account failover would have on your data before initiating this type of failover.

For details about potential data loss and inconsistencies, see Anticipate data loss and inconsistencies.

The unplanned failover and failback process

This section summarizes the failover process for a customer-managed (unplanned) failover.

Unplanned failover transition summary

After a customer-managed (unplanned) failover:

The secondary region becomes the new primary
The copy of the data in the original primary region is deleted
The storage account is converted to LRS
Geo-redundancy is lost

This table summarizes the resulting redundancy configuration at every stage of a customer-managed (unplanned) failover and failback:

Original configuration	After failover	After re-enabling geo redundancy	After failback	After re-enabling geo redundancy
GRS	LRS	GRS ¹	LRS	GRS ¹
GZRS	LRS	GRS ¹	ZRS	GZRS ¹

¹ Geo-redundancy is lost during a customer-managed (unplanned) failover and must be manually reconfigured.

Unplanned failover transition details

The following diagrams show the customer-managed (unplanned) failover and failback process for a storage account configured for geo-redundancy. The transition details for GZRS and RA-GZRS are slightly different from GRS and RA-GRS.

GRS/RA-GRS
GZRS/RA-GZRS

Normal operation (GRS/RA-GRS)

Under normal circumstances, a client writes data to a storage account in the primary region via storage service endpoints (1). The data is then copied asynchronously from the primary region to the secondary region (2). The following image shows the normal state of a storage account configured as GRS when the primary endpoints are available:

The storage service endpoints become unavailable in the primary region (GRS/RA-GRS)

If the primary storage service endpoints become unavailable for any reason (1), the client is no longer able to write to the storage account. Depending on the underlying cause of the outage, replication to the secondary region might no longer be functioning (2), so some data loss should be expected. The following image shows the scenario where the primary endpoints become unavailable, but before recovery occurs:

The unplanned failover process (GRS/RA-GRS)

To restore write access to your data, you can initiate a failover. The storage service endpoint URIs for blobs, tables, queues, and files remain unchanged, but their DNS entries are changed to point to the secondary region as shown:

Customer-managed (unplanned) failover typically takes about an hour.

After the failover is complete, the original secondary becomes the new primary (1), and the copy of the storage account in the original primary is deleted (2). The storage account is configured as LRS in the new primary region, and is no longer geo-redundant. Users can resume writing data to the storage account (3), as shown in this image:

To resume replication to a new secondary region, reconfigure the account for geo-redundancy.

Important

Keep in mind that converting a locally redundant storage account to use geo-redundancy incurs both cost and time. For more information, see The time and cost of failing over.

After reconfiguring the account to utilize GRS, Azure begins copying your data asynchronously to the new secondary region (1) as shown in this image:

Read access to the new secondary region isn't available again until the issue causing the original outage is resolved.

The unplanned failback process (GRS/RA-GRS)

Warning

After your account is reconfigured for geo-redundancy, it might take a significant amount of time before the data in the new primary region is fully copied to the new secondary.

To avoid a major data loss, check the value of the Last Sync Time property before failing back. Compare the last sync time to the last times that data was written to the new primary to evaluate potential data loss.

After the issue causing the original outage is resolved, you can initiate failback to the original primary region. This process is described in the following image:

The current primary region becomes read only.
With customer-initiated failover and failback, your data isn't allowed to finish replicating to the secondary region during the failback process. Therefore, it's important to check the value of the Last Sync Time property before failing back.
The DNS entries for the storage service endpoints are switched. The endpoints within the secondary region become the new primary endpoints for your storage account.

After the failback is complete, the original primary region becomes the current one again (1), and the copy of the storage account in the original secondary is deleted (2). The storage account is configured as locally redundant in the primary region, and is no longer geo-redundant. Users can resume writing data to the storage account (3), as shown in this image:

To resume replication to the original secondary region, reconfigure the account for geo-redundancy.

Important

Keep in mind that converting a locally redundant storage account to use geo-redundancy incurs both cost and time. For more information, see The time and cost of failing over.

After reconfiguring the account as GRS, replication to the original secondary region resumes as shown in this image: