Customer enabled disaster recovery

Important

Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

To maximize your uptime, plan ahead to maintain business continuity and prepare for disaster recovery with Azure AI Foundry. Since Azure AI Foundry builds on Azure Machine Learning architecture, it's beneficial to reference the foundational architecture.

Microsoft strives to ensure that Azure services are always available. However, unplanned service outages might occur. We recommend having a disaster recovery plan in place for handling regional service outages. In this article, you learn how to:

  • Plan for a multi-regional deployment of Azure AI Foundry and associated resources.
  • Maximize chances to recover logs, notebooks, docker images, and other metadata.
  • Design for high availability of your solution.
  • Initiate a failover to another region.

Important

Azure AI Foundry itself does not provide automatic failover or disaster recovery.

Understand Azure services for Azure AI Foundry

Azure AI Foundry depends on multiple Azure services. Some of these services are provisioned in your subscription. You're responsible for the high-availability configuration of these services. Microsoft manages some services, which are created in a Microsoft subscription.

Azure services include:

  • Azure AI Foundry infrastructure: A Microsoft-managed environment for the Azure AI Foundry hub and project. The [underlying architecture](Azure AI Foundry architecture doc) is provided by Azure Machine Learning.

  • Required associated resources: Resources provisioned in your subscription during Azure AI Foundry hub and project creation. These resources include Azure Storage and Azure Key Vault.

    • Default storage has data such as model, training log data, and references to data assets.
    • Key Vault has credentials for Azure Storage and connections.
  • Optional associated resources: Resources you can attach to your Azure AI Foundry hub. These resources include Azure Container Registry and Application Insights.

    • Container Registry has a Docker image for training and inferencing environments.
    • Application Insights is for monitoring Azure AI Foundry.
  • Compute instance: Resource you create after hub deployment. Microsoft-managed model development environments.

  • Connections: Azure AI Foundry can connect to various other services. You're responsible for cofiguring their high-availability settings.

The following table shows the Azure services that Microsoft manages and the ones you manage. It also indicates the services that are highly available by default.

Service Managed by High availability by default
Azure AI Foundry infrastructure Microsoft
Associated resources
Azure Storage You
Key Vault You
Container Registry You
Application Insights You NA
Compute resources
Compute instance Microsoft
Any connection to external services such as Azure AI Services You

The rest of this article describes the actions you need to take to make each of these services highly available.

Plan for multi-regional deployment

A multi-regional deployment relies on creation of Azure AI Foundry and other resources (infrastructure) in two Azure regions. If a regional outage occurs, you can switch to the other region. When planning on where to deploy your resources, consider:

  • Regional availability: If possible, use a region in the same geographic area, not necessarily the one that is closest. To check regional availability for Azure AI Foundry, see Azure products by region.

  • Azure paired regions: Paired regions coordinate platform updates and prioritize recovery efforts where needed. However, not all regions support paired regions. For more information, see Azure paired regions.

  • Service availability: Decide whether the resources used by your solution should be hot/hot, hot/warm, or hot/cold.

    • Hot/hot: Both regions are active at the same time, with one region ready to begin use immediately.
    • Hot/warm: Primary region active, secondary region has critical resources (for example, deployed models) ready to start. Noncritical resources would need to be manually deployed in the secondary region.
    • Hot/cold: Primary region active, secondary region has Azure AI Foundry and other resources deployed, along with needed data. Resources such as models, model deployments, or pipelines would need to be manually deployed.

Tip

Depending on your business requirements, you may decide to treat different Azure AI Foundry resources differently.

Azure AI Foundry builds on top of other services. Some services can be configured to replicate to other regions. Others you must manually create in multiple regions. The following table provides a list of services, who is responsible for replication, and an overview of the configuration:

Azure service Geo-replicated by Configuration
AI Foundry hub and projects You Create a hub/projects in the selected regions.
AI Foundry compute You Create the compute resources in the selected regions. For compute resources that can dynamically scale, make sure that both regions provide sufficient compute quota for your needs.
Key Vault Microsoft Use the same Key Vault instance with the Azure AI Foundry hub and resources in both regions. Key Vault automatically fails over to a secondary region. For more information, see Azure Key Vault availability and redundancy.
Storage Account You Azure Machine Learning doesn't support default storage-account failover using geo-redundant storage (GRS), geo-zone-redundant storage (GZRS), read-access geo-redundant storage (RA-GRS), or read-access geo-zone-redundant storage (RA-GZRS). Configure a storage account according to your needs and then use it for your hub. All subsequent projects use the hub's storage account. For more information, see Azure Storage redundancy.
Container Registry Microsoft Configure the Container Registry instance to geo-replicate registries to the paired region for Azure AI Foundry. Use the same instance for both hub instances. For more information, see Geo-replication in Azure Container Registry.
Application Insights You Create Application Insights for the hub in both regions. To adjust the data-retention period and details, see Data collection, retention, and storage in Application Insights.

To enable fast recovery and restart in the secondary region, we recommend the following development practices:

  • Use Azure Resource Manager templates. Templates are 'infrastructure-as-code,' and allow you to quickly deploy services in both regions.
  • To avoid drift between the two regions, update your continuous integration and deployment pipelines to deploy to both regions.
  • Create role assignments for users in both regions.
  • Create network resources such as Azure Virtual Networks and private endpoints for both regions. Make sure that users have access to both network environments. For example, VPN and DNS configurations for both virtual networks.

Design for high availability

Availability zones

Certain Azure services support availability zones. For regions that support availability zones, if a zone goes down any project pauses and data should be saved. However, the data is unavailable to refresh until the zone is back online.

For more information, see Availability zone service support.

Deploy critical components to multiple regions

Determine the level of business continuity that you're aiming for. The level might differ between the components of your solution. For example, you might want to have a hot/hot configuration for production pipelines or model deployments, and hot/cold for development.

Azure AI Foundry is a regional service and stores data both service-side and on a storage account in your subscription. If a regional disaster occurs, service data can't be recovered. But you can recover the data stored by the service on the storage account in your subscription given storage redundancy is enforced. Service-side stored data is mostly metadata (tags, asset names, descriptions). Stored on your storage account is typically non-metadata, for example, uploaded data.

For connections, we recommend creating two separate resources in two distinct regions and then create two connections for the hub. For example, if AI Services is a critical resource for business continuity, creating two AI Services resources and two connections for the hub, would be a good strategy for business continuity. With this configuration, if one region goes down there's still one region operational.

For any hubs that are essential to business continuity, deploy resources in two regions.

Isolated storage

In the scenario in which you're connecting with data to customize your AI application, typically your datasets could be used in Azure AI but also outside of Azure AI. Dataset volume could be quite large, so for it might be good practice to keep this data in a separate storage account. Evaluate what data replication strategy makes most sense for your use case.

In AI Foundry portal, make a connection to your data. If you have multiple AI Foundry instances in different regions, you might still point to the same storage account because connections work across regions.

Initiate a failover

Continue work in the failover hub

When your primary hub becomes unavailable, you can switch over to the secondary hub to continue development. Azure AI Foundry doesn't automatically submit jobs to the secondary hub if there's an outage. Update your code configuration to point to the new hub or project resources. We recommend to avoiding hardcoding hub or project references.

Azure AI Foundry can't sync or recover artifacts or metadata between hubs. Dependent on your application deployment strategy, you might have to move or recreate artifacts in the failover hub in order to continue. In case you configure your primary hub and secondary hub to share associated resources with geo-replication enabled, some objects might be directly available to the failover hub. For example, if both hubs share the same docker images, configured datastores, and Azure Key Vault resources.

Note

Any jobs that are running when a service outage occurs will not automatically transition to the secondary hub. It is also unlikely that the jobs will resume and finish successfully in the primary hub once the outage is resolved. Instead, these jobs must be resubmitted, either in the secondary hub or in the primary (once the outage is resolved).

Recovery options

Resource deletion

If a hub and its existing resources are accidentally deleted, there are some resources that have soft delete enabled, allowing for resource recovery. Hubs and projects don't support soft delete. A hub or project that is deleted can't be recovered. Some underlying resources might support soft delete, so they could potentially be recovered. See table for which services have a soft delete option.

Service soft delete enabled
Azure AI Foundry hub Unsupported
Azure AI Foundry project Unsupported
Azure AI Services resource Yes
Azure Storage See Recover a deleted storage account.
Azure Key Vault Yes

Next steps