Azure Local baseline reference architecture

Azure Stack HCI
Azure Arc
Azure Blob Storage
Azure Monitor
Microsoft Defender for Cloud

This baseline reference architecture provides workload-agnostic guidance and recommendations for configuring Azure Local, version 23H2, release 2311 and later infrastructure to ensure a reliable platform that can deploy and manage highly available virtualized and containerized workloads. This architecture describes the resource components and cluster design choices for the physical nodes that provide local compute, storage, and networking features. It also describes how to use Azure services to simplify and streamline the day-to-day management of Azure Local.

For more information about workload architecture patterns that are optimized to run on Azure Local, see the content located in the Azure Local workloads navigation menu.

This architecture is a starting point for how to use the storage switched network design to deploy a multinode Azure Local instance. The workload applications deployed on an Azure Local instance should be well architected. Well-architected workload applications must be deployed using multiple instances or high availability of any critical workload services and have appropriate business continuity and disaster recovery (BCDR) controls in place. These BCDR controls include regular backups and disaster recovery failover capabilities. To focus on the HCI infrastructure platform, these workload design aspects are intentionally excluded from this article.

For more information about guidelines and recommendations for the five pillars of the Azure Well-Architected Framework, see the Azure Local Well-Architected Framework service guide.

Article layout

Architecture Design decisions Well-Architected Framework approach
Architecture
Potential use cases
Scenario details
Platform resources
Platform-supporting resources
Deploy this scenario
Cluster design choices
Physical disk drives
Network design
Monitoring
Update management
Reliability
Security
Cost optimization
Operational excellence
Performance efficiency

Tip

GitHub logo The Azure local template demonstrates how to use an Azure Resource Management template (ARM template) and parameter file to deploy a switched multi-server deployment of Azure Local. Alternatively, the Bicep example demonstrates how to use a Bicep template to deploy an Azure Local instance and its prerequisites resources.

Architecture

Diagram that shows a multinode Azure Local instance reference architecture with dual Top-of-Rack (ToR) switches for external north-south connectivity.

For more information, see Related resources.

Potential use cases

Typical use cases for Azure Local include the ability to run high availability (HA) workloads in on-premises or edge locations, which provides a solution to address workload requirements. You can:

  • Provide a hybrid cloud solution that's deployed on-premises to address data sovereignty, regulation and compliance, or latency requirements.

  • Deploy and manage HA-virtualized or container-based edge workloads that are deployed in a single location or in multiple locations. This strategy enables business-critical applications and services to operate in a resilient, cost-effective, and scalable manner.

  • Lower the total cost of ownership (TCO) by using solutions that are certified by Microsoft, cloud-based deployment, centralized management, and monitoring and alerting.

  • Provide a centralized provisioning capability by using Azure and Azure Arc to deploy workloads across multiple locations consistently and securely. Tools like the Azure portal, Azure CLI, or infrastructure as code (IaC) templates use Kubernetes for containerization or traditional workload virtualization to drive automation and repeatability.

  • Adhere to strict security, compliance, and audit requirements. Azure Local is deployed with a hardened security posture configured by default, or secure-by-default. Azure Local incorporates certified hardware, Secure Boot, Trusted Platform Module (TPM), virtualization-based security (VBS), Credential Guard, and enforced Windows Defender Application Control policies. It also integrates with modern cloud-based security and threat-management services like Microsoft Defender for Cloud and Microsoft Sentinel.

Scenario details

The following sections provide more information about the scenarios and potential use cases for this reference architecture. These sections include a list of business benefits and example workload resource types that you can deploy on Azure Local.

Use Azure Arc with Azure Local

Azure Local directly integrates with Azure by using Azure Arc to lower the TCO and operational overhead. Azure Local is deployed and managed through Azure, which provides built-in integration of Azure Arc through deployment of the Azure Arc resource bridge component. This component is installed during the HCI cluster deployment process. Azure Local cluster nodes are enrolled with Azure Arc for servers as a prerequisite to initiate the cloud-based deployment of the cluster. During deployment, mandatory extensions are installed on each cluster node, such as Lifecycle Manager, Microsoft Edge Device Management, and Telemetry and Diagnostics. You can use Azure Monitor and Log Analytics to monitor the HCI cluster after deployment by enabling Insights for Azure Local. Feature updates for Azure Local are released periodically to enhance the customer experience. Updates are controlled and managed through Azure Update Manager.

You can deploy workload resources such as Azure Arc virtual machines (VMs), Azure Arc-enabled Azure Kubernetes Service (AKS), and Azure Virtual Desktop session hosts that use the Azure portal by selecting an Azure Local instance custom location as the target for the workload deployment. These components provide centralized administration, management, and support. If you have active Software Assurance on your existing Windows Server Datacenter core licenses, you can reduce costs further by applying Azure Hybrid Benefit to Azure Local, Windows Server VMs, and AKS clusters. This optimization helps manage costs effectively for these services.

Azure and Azure Arc integration extend the capabilities of Azure Local virtualized and containerized workloads to include:

Azure Arc-connected workloads provide enhanced Azure consistency and automation for Azure Local deployments, like automating guest OS configuration with Azure Arc VM extensions or evaluating compliance with industry regulations or corporate standards through Azure Policy. You can activate Azure Policy through the Azure portal or IaC automation.

Take advantage of the Azure Local default security configuration

The Azure Local default security configuration provides a defense-in-depth strategy to simplify security and compliance costs. The deployment and management of IT services for retail, manufacturing, and remote office scenarios presents unique security and compliance challenges. Securing workloads against internal and external threats is crucial in environments that have limited IT support or a lack or dedicated datacenters. Azure Local has default security hardening and deep integration with Azure services to help you address these challenges.

Azure Local-certified hardware ensures built-in Secure Boot, Unified Extensible Firmware Interface (UEFI), and TPM support. Use these technologies in combination with VBS to help protect your security-sensitive workloads. You can use BitLocker Drive Encryption to encrypt boot disk volumes and storage spaces direct volumes at rest. Server Message Block (SMB) encryption provides automatic encryption of traffic between servers in the cluster (on the storage network) and signing of SMB traffic between the cluster nodes and other systems. SMB encryption also helps prevent relay attacks and facilitates compliance with regulatory standards.

You can onboard Azure Local VMs in Defender for Cloud to activate cloud-based behavioral analytics, threat detection and remediation, alerting, and reporting. Manage Azure Local VMs in Azure Arc so that you can use Azure Policy to evaluate their compliance with industry regulations and corporate standards.

Components

This architecture consists of physical server hardware that you can use to deploy Azure Local instances in on-premises or edge locations. To enhance platform capabilities, Azure Local integrates with Azure Arc and other Azure services that provide supporting resources. Azure Local provides a resilient platform to deploy, manage, and operate user applications or business systems. Platform resources and services are described in the following sections.

Platform resources

The architecture requires the following mandatory resources and components:

  • Azure Local is a hyperconverged infrastructure (HCI) solution that's deployed on-premises or in edge locations by using physical server hardware and networking infrastructure. Azure Local provides a platform to deploy and manage virtualized workloads such as VMs, Kubernetes clusters, and other services that are enabled by Azure Arc. Azure Local instances can scale from a single-node deployment to a maximum of sixteen nodes using validated, integrated, or premium hardware categories that are provided by original equipment manufacturer (OEM) partners.

  • Azure Arc is a cloud-based service that extends the management model based on Azure Resource Manager to Azure Local and other non-Azure locations. Azure Arc uses Azure as the control and management plane to enable the management of various resources such as VMs, Kubernetes clusters, and containerized data and machine learning services.

  • Azure Key Vault is a cloud service that you can use to securely store and access secrets. A secret is anything that you want to tightly restrict access to, such as API keys, passwords, certificates, cryptographic keys, local admin credentials, and BitLocker recovery keys.

  • Cloud witness is a feature of Azure Storage that acts as a failover cluster quorum. Azure Local cluster nodes use this quorum for voting, which ensures high availability for the cluster. The storage account and witness configuration are created during the Azure Local cloud deployment process.

  • Update Manager is a unified service designed to manage and govern updates for Azure Local. You can use Update Manager to manage workloads that are deployed on Azure Local, including guest OS update compliance for Windows and Linux VMs. This unified approach streamlines patch management across Azure, on-premises environments, and other cloud platforms through a single dashboard.

Platform-supporting resources

The architecture includes the following optional supporting services to enhance the capabilities of the platform:

  • Monitor is a cloud-based service for collecting, analyzing, and acting on diagnostic logs and telemetry from your cloud and on-premises workloads. You can use Monitor to maximize the availability and performance of your applications and services through a comprehensive monitoring solution. Deploy Insights for Azure Local to simplify the creation of the Monitor data collection rule (DCR) and quickly enable monitoring of Azure Local instances.

  • Azure Policy is a service that evaluates Azure and on-premises resources. Azure Policy evaluates resources through integration with Azure Arc by using the properties of those resources to business rules, called policy definitions, to determine compliance or capabilities that you can use to apply VM Guest Configuration using policy settings.

  • Defender for Cloud is a comprehensive infrastructure security management system. It enhances the security posture of your datacenters and delivers advanced threat protection for hybrid workloads, whether they reside in Azure or elsewhere, and across on-premises environments.

  • Azure Backup is a cloud-based service that provides a simple, secure, and cost-effective solution to back up your data and recover it from the Microsoft Cloud. Azure Backup Server is used to take backup of VMs that are deployed on Azure Local and store them in the Backup service.

  • Site Recovery is a disaster recovery service that provides BCDR capabilities by enabling business apps and workloads to fail over if there's a disaster or outage. Site Recovery manages replication and failover of workloads that run on physical servers and VMs between their primary site (on-premises) and a secondary location (Azure).

Cluster design choices

It's important to understand the workload performance and resiliency requirements when you design an Azure Local instance. These requirements include recovery time objective (RTO) and recovery point objective (RPO) times, compute (CPU), memory, and storage requirements for all workloads that are deployed on the Azure Local instance. Several characteristics of the workload affect the decision-making process and include:

  • Central processing unit (CPU) architecture capabilities, including hardware security technology features, the number of CPUs, the GHz frequency (speed) and the number of cores per CPU socket.

  • Graphics processing unit (GPU) requirements of the workload, such as for AI or machine learning, inferencing, or graphics rendering.

  • The memory per node, or the quantity of physical memory required to run the workload.

  • The number of physical nodes in the cluster that are 1 to 16 nodes in scale. The maximum number of nodes is three when you use the storage switchless network architecture.

    • To maintain compute resiliency, you need to reserve at least N+1 nodes worth of capacity in the cluster. This strategy enables node draining for updates or recovery from sudden outages like power outages or hardware failures.

    • For business-critical or mission-critical workloads, consider reserving N+2 nodes worth of capacity to increase resiliency. For example, if two nodes in the cluster are offline, the workload can remain online. This approach provides resiliency for scenarios in which a node that's running a workload goes offline during a planned update procedure and results in two nodes being offline simultaneously.

  • Storage resiliency, capacity, and performance requirements:

    • Resiliency: We recommend that you deploy three or more nodes to enable three-way mirroring, which provides three copies of the data, for the infrastructure and user volumes. Three-way mirroring increases performance and maximum reliability for storage.

    • Capacity: The total required usable storage after fault tolerance, or copies, is taken into consideration. This number is approximately 33% of the raw storage space of your capacity tier disks when you use three-way mirroring.

    • Performance: Input/output operations per second (IOPS) of the platform that determines the storage throughput capabilities for the workload when multiplied by the block size of the application.

To design and plan an Azure Local deployment, we recommend that you use the Azure Local sizing tool and create a New Project for sizing your HCI clusters. Using the sizing tool requires that you understand your workload requirements. When considering the number and size of workload VMs that run on your cluster, make sure to consider factors such as the number of vCPUs, memory requirements, and necessary storage capacity for the VMs.

The sizing tool Preferences section guides you through questions that relate to the system type (Premier, Integrated System, or Validated Node) and CPU family options. It also helps you select your resiliency requirements for the cluster. Make sure to:

  • Reserve a minimum of N+1 nodes worth of capacity, or one node, across the cluster.

  • Reserve N+2 nodes worth of capacity across the cluster for extra resiliency. This option enables the system to withstand a node failure during an update or other unexpected event that affects two nodes simultaneously. It also ensures that there's enough capacity in the cluster for the workload to run on the remaining online nodes.

This scenario requires use of three-way mirroring for user volumes, which is the default for clusters that have three or more physical nodes.

The output from the Azure Local sizing tool is a list of recommended hardware solution SKUs that can provide the required workload capacity and platform resiliency requirements based on the input values in the Sizer Project. For more information about available OEM hardware partner solutions, see Azure Local Solutions Catalog. To help rightsize solution SKUs to meet your requirements, contact your preferred hardware solution provider or system integration (SI) partner.

Physical disk drives

Storage Spaces Direct supports multiple physical disk drive types that vary in performance and capacity. When you design an Azure Local instance, work with your chosen hardware OEM partner to determine the most appropriate physical disk drive types to meet the capacity and performance requirements of your workload. Examples include spinning Hard Disk Drives (HDDs), or Solid State Drives (SSDs) and NVMe drives. These drives are often called flash drives, or Persistent memory (PMem) storage, which is known as storage-class memory (SCM).

The reliability of the platform depends on the performance of critical platform dependencies, such as physical disk types. Make sure to choose the right disk types for your requirements. Use all-flash storage solutions such as NVMe or SSD drives for workloads that have high-performance or low-latency requirements. These workloads include but aren't limited to highly transactional database technologies, production AKS clusters, or any mission-critical or business-critical workloads that have low-latency or high-throughput storage requirements. Use all-flash deployments to maximize storage performance. All-NVMe drive or all-SSD drive configurations, especially at a small scale, improve storage efficiency and maximize performance because no drives are used as a cache tierFor more information, see All-flash based storage.

For general purpose workloads, a hybrid storage configuration, like NVMe drives or SSDs for cache and HDDs for capacity, might provide more storage space. The tradeoff is that spinning disks have lower performance if your workload exceeds the cache working set, and HDDs have a lower mean time between failure value compared to NVMe and SSD drives.

The performance of your cluster storage is influenced by the physical disk drive type, which varies based on the performance characteristics of each drive type and the caching mechanism that you choose. The physical disk drive type is an integral part of any Storage Spaces Direct design and configuration. Depending on the Azure Local workload requirements and budget constraints, you can choose to maximize performance, maximize capacity, or implement a mixed-drive type configuration that balances performance and capacity.

Storage Spaces Direct provides a built-in, persistent, real-time, read, write, server-side cache that maximizes storage performance. The cache should be sized and configured to accommodate the working set of your applications and workloads. Storage Spaces Direct virtual disks, or volumes, are used in combination with cluster shared volume (CSV) in-memory read cache to improve Hyper-V performance, especially for unbuffered input access to workload virtual hard disk (VHD) or virtual hard disk v2 (VHDX) files.

Tip

For high-performance or latency-sensitive workloads, we recommend that you use an all-flash storage (all NVMe or all SSD) configuration and a cluster size of three or more physical nodes. Deploying this design with the default storage configuration settings uses three-way mirroring for the infrastructure and user volumes. This deployment strategy provides the highest performance and resiliency. When you use an all-NVMe or all-SSD configuration, you benefit from the full usable storage capacity of each flash drive. Unlike hybrid or mixed NVMe + SSD setups, there's no capacity reserved for caching. This ensures optimal utilization of your storage resources. For more information about how to balance performance and capacity to meet your workload requirements, see Plan volumes - When performance matters most.

Network design

Network design is the overall arrangement of components within the network's physical infrastructure and logical configurations. You can use the same physical network interface card (NIC) ports for all combinations of management, compute, and storage network intents. Using the same NIC ports for all intent-related purposes is called a fully converged networking configuration.

Although a fully converged networking configuration is supported, the optimal configuration for performance and reliability is for the storage intent to use dedicated network adapter ports. Therefore, this baseline architecture provides example guidance for how to deploy a multinode Azure Local instance by using the storage switched network architecture with two network adapter ports that are converged for management and compute intents and two dedicated network adapter ports for the storage intent. For more information, see Network considerations for cloud deployments of Azure Local.

This architecture requires two or more physical nodes and up to a maximum of 16 nodes in scale. Each node requires four network adapter ports that are connected to two Top-of-Rack (ToR) switches. The two ToR switches should be interconnected through multi-chassis link aggregation group (MLAG) links. The two network adapter ports that are used for the storage intent traffic must support Remote Direct Memory Access (RDMA). These ports require a minimum link speed of 10 Gbps, but we recommend a speed of 25 Gbps or higher. The two network adapter ports used for the management and compute intents are converged using switch embedded teaming (SET) technology. SET technology provides link redundancy and load-balancing capabilities. These ports require a minimum link speed of 1 Gbps, but we recommend a speed of 10 Gbps or higher.

Physical network topology

The following physical network topology shows the actual physical connections between nodes and networking components.

You need the following components when you design a multinode storage switched Azure Local deployment that uses this baseline architecture:

Diagram that shows the physical networking topology for a multinode Azure Local instance that uses a storage switched architecture with dual ToR switches.

  • Dual ToR switches:

    • Dual ToR network switches are required for network resiliency, and the ability to service or apply firmware updates, to the switches without incurring downtime. This strategy prevents a single point of failure (SPoF).

    • The dual ToR switches are used for the storage, or east-west, traffic. These switches use two dedicated Ethernet ports that have specific storage virtual local area networks (VLANs) and priority flow control (PFC) traffic classes that are defined to provide lossless RDMA communication.

    • These switches connect to the nodes through Ethernet cables.

  • Two or more physical nodes and up to a maximum of 16 nodes:

    • Each node is a physical server that runs Azure Stack HCI OS.

    • Each node requires four network adapter ports in total: two RDMA-capable ports for storage and two network adapter ports for management and compute traffic.

    • Storage uses the two dedicated RDMA-capable network adapter ports that connect with one path to each of the two ToR switches. This approach provides link-path redundancy and dedicated prioritized bandwidth for SMB Direct storage traffic.

    • Management and compute uses two network adapter ports that provide one path to each of the two ToR switches for link-path redundancy.

  • External connectivity:

    • Dual ToR switches connect to the external network, such as your internal corporate LAN, to provide access to the required outbound URLs by using your edge border network device. This device can be a firewall or router. These switches route traffic that goes in and out of the Azure Local instance, or north-south traffic.

    • External north-south traffic connectivity supports the cluster management intent and compute intents. This is achieved by using two switch ports and two network adapter ports per node that are converged through switch embedded teaming (SET) and a virtual switch within Hyper-V to ensure resiliency. These components work to provide external connectivity for Azure Arc VMs and other workload resources deployed within the logical networks that are created in Resource Manager using Azure portal, CLI, or IaC templates.

Logical network topology

The logical network topology shows an overview of how network data flows between devices, regardless of their physical connections.

A summarization of the logical setup for this multinode storage switched baseline architecture for Azure Local is as follows:

Diagram that shows the logical networking topology for a multinode Azure Local instance using the storage switched architecture with dual ToR switches.

  • Dual ToR switches:

    • Before you deploy the cluster, the two ToR network switches need to be configured with the required VLAN IDs, maximum transmission unit settings, and datacenter bridging configuration for the management, compute, and storage ports. For more information, see Physical network requirements for Azure Local, or ask your switch hardware vendor or SI partner for assistance.
  • Azure Local uses the Network ATC approach to apply network automation and intent-based network configuration.

    • Network ATC is designed to ensure optimal networking configuration and traffic flow by using network traffic intents. Network ATC defines which physical network adapter ports are used for the different network traffic intents (or types), such as for the cluster management, workload compute, and cluster storage intents.

    • Intent-based policies simplify the network configuration requirements by automating the node network configuration based on parameter inputs that are specified as part of the Azure Local cloud deployment process.

  • External communication:

    • When the nodes or workload need to communicate externally by accessing the corporate LAN, internet, or another service, they route using the dual ToR switches. This process is outlined in the previous physical network topology section.

    • When the two ToR switches act as Layer 3 devices, they handle routing and provide connectivity beyond the cluster to the edge border device, such as your firewall or router.

    • Management network intent uses the converged SET team virtual interface, which enables the cluster management IP address and control plane resources to communicate externally.

    • For the compute network intent, you can create one or more logical networks in Azure with the specific VLAN IDs for your environment. The workload resources, such as VMs, use these IDs to give access to the physical network. The logical networks use the two physical network adapter ports that are converged by using an SET team for the compute and management intents.

  • Storage traffic:

    • The physical nodes communicate with each other by using two dedicated network adapter ports that are connected to the ToR switches to provide high bandwidth and resiliency for storage traffic.

    • The SMB1 and SMB2 storage ports connect to two separate nonroutable (or Layer 2) networks. Each network has a specific VLAN ID configured that must match the switch ports configuration on the ToR switches' default storage VLAN IDs: 711 and 712.

    • There's no default gateway configured on the two storage intent network adapter ports within the Azure Stack HCI OS.

    • Each node can access Storage Spaces Direct capabilities of the cluster, such as remote physical disks that are used in the storage pool, virtual disks, and volumes. Access to these capabilities is facilitated through the SMB-Direct RDMA protocol over the two dedicated storage network adapter ports that are available in each node. SMB Multichannel is used for resiliency.

    • This configuration provides sufficient data transfer speed for storage-related operations, such as maintaining consistent copies of data for mirrored volumes.

Network switch requirements

Your Ethernet switches must meet the different specifications required by Azure Local and set by the Institute of Electrical and Electronics Engineers Standards Association (IEEE SA). For example, for multinode storage switched deployments, the storage network is used for RDMA via RoCE v2 or iWARP. This process requires IEEE 802.1Qbb PFC to ensure lossless communication for the storage traffic class. Your ToR switches must provide support for IEEE 802.1Q for VLANs and IEEE 802.1AB for the Link Layer Discovery Protocol.

If you plan to use existing network switches for an Azure Local deployment, review the list of mandatory IEEE standards and specifications that the network switches and configuration must provide. When purchasing new network switches, review the list of hardware vendor-certified switch models that support Azure Local network requirements.

IP address requirements

In a multinode storage switched deployment, the number of IP addresses needed increases with the addition of each physical node, up to a maximum of 16 nodes within a single cluster. For example, to deploy a two-node storage switched configuration of Azure Local, the cluster infrastructure requires a minimum of 11 x IP addresses to be allocated. More IP addresses are required if you use microsegmentation or software-defined networking. For more information, see Review two-node storage reference pattern IP address requirements for Azure Local.

When you design and plan IP address requirements for Azure Local, remember to account for additional IP addresses or network ranges needed for your workload beyond the ones that are required for the Azure Local instance and infrastructure components. If you plan to deploy AKS on Azure Local, see AKS enabled by Azure Arc network requirements.

Monitoring

To enhance monitoring and alerting, enable Monitor Insights on Azure Local. Insights can scale to monitor and manage multiple on-premises clusters by using an Azure consistent experience. Insights uses cluster performance counters and event log channels to monitor key Azure Local features. Logs are collected by the DCR that's configured through Monitor and Log Analytics.

Insights for Azure Local is built using Monitor and Log Analytics, which ensures an always up-to-date, scalable solution that's highly customizable. Insights provides access to default workbooks with basic metrics, along with specialized workbooks created for monitoring key features of Azure Local. These components provide a near real-time monitoring solution and enable the creation of graphs, customization of visualizations through aggregation and filtering, and configuration of custom resource health alert rules.

Update management

Azure Local instances and the deployed workload resources, such as Azure Arc VMs, need to be updated and patched regularly. By regularly applying updates, you ensure that your organization maintains a strong security posture, and you improve the overall reliability and supportability of your estate. We recommend that you use automatic and periodic manual assessments for early discovery and application of security patches and OS updates.

Infrastructure updates

Azure Local is continuously updated to improve the customer experience and add new features and functionality. This process is managed through release trains, which deliver new baseline builds quarterly. Baseline builds are applied to Azure Local instances to keep them up to date. In addition to regular baseline build updates, Azure Local is updated with monthly OS security and reliability updates.

Update Manager is an Azure service that you can use to apply, view, and manage updates for Azure Local. This service provides a mechanism to view allAzure Local instances across your entire infrastructure and edge locations by using the Azure portal to provide a centralized management experience. For more information, see the following resources:

It's important to check for new driver and firmware updates regularly, such as every three to six months. If you use a Premier solution category version for your Azure Local hardware, the Solution Builder Extension package updates are integrated with Update Manager to provide a simplified update experience. If you use validated nodes or an integrated system category, there might be a requirement to download and run an OEM-specific update package that contains the firmware and driver updates for your hardware. To determine how updates are supplied for your hardware, contact your hardware OEM or SI partner.

Workload guest OS patching

You can enroll Azure Arc VMs that are deployed on Azure Local by using Azure Update Manager (AUM) to provide a unified patch management experience by using the same mechanism used to update the Azure Local cluster physical nodes. You can use AUM to create Guest maintenance configurations. These configurations control settings such as the Reboot setting reboot if necessary, the schedule (dates, times, and repeat options), and either a dynamic (subscription) or static list of the Azure Arc VMs for the scope. These settings control the configuration for when OS security patches are installed inside your workload VM's guest OS.

Considerations

These considerations implement the pillars of the Azure Well-Architected Framework, which is a set of guiding tenets that can be used to improve the quality of a workload. For more information, see Microsoft Azure Well-Architected Framework.

Reliability

Reliability ensures your application can meet the commitments you make to your customers. For more information, see Design review checklist for Reliability.

Identify the potential failure points

Every architecture is susceptible to failures. You can anticipate failures and be prepared with mitigations with failure mode analysis. The following table describes four examples of potential points of failure in this architecture:

Component Risk Likelihood Effect/mitigation/note Outage
Azure Local instance outage Power, network, hardware, or software failure Medium To prevent a prolonged application outage caused by the failure of an Azure Local instance for business or mission-critical use cases, your workload should be architected using HA and DR principles. For example, you can use industry-standard workload data replication technologies to maintain multiple copies of persistent state data that are deployed using multiple Azure Arc VMs or AKS instances that are deployed on separate Azure Local instances and in separate physical locations. Potential outage
Azure Local single physical node outage Power, hardware, or software failure Medium To prevent a prolonged application outage caused by the failure of a single Azure Local machine, your Azure Local instance should have multiple physical nodes. Your workload capacity requirements during the cluster design phase determine the number of nodes. We recommend that you have three or more nodes. We also recommended that you use three-way mirroring, which is the default storage resiliency mode for clusters with three or more nodes. To prevent a SPoF and increase workload resiliency, deploy multiple instances of your workload by using two or more Azure Arc VMs or container pods that run in multiple AKS worker nodes. If a single node fails, the Azure Arc VMs and workload / application services are restarted on the remaining online physical nodes in the cluster. Potential outage
Azure Arc VM or AKS worker node (workload) Misconfiguration Medium Application users are unable to sign in or access the application. Misconfigurations should be caught during deployment. If these errors happen during a configuration update, DevOps team must roll back changes. You can redeploy the VM if necessary. Redeployment takes less than 10 minutes to deploy but can take longer according to the type of deployment. Potential outage
Connectivity to Azure Network outage Medium The cluster needs to reach the Azure control plane regularly for billing, management, and monitoring capabilities. If your cluster loses connectivity to Azure, it operates in a degraded state. For example, it wouldn't be possible to deploy new Azure Arc VMs or AKS clusters if your cluster loses connectivity to Azure. Existing workloads that are running on the HCI cluster continue to run, but you should restore the connection within 48 to 72 hours to ensure uninterrupted operation. None

For more information, see Recommendations for performing failure mode analysis.

Reliability targets

This section describes an example scenario. A fictitious customer called Contoso Manufacturing uses this reference architecture to deploy Azure Local. They want to address their requirements and deploy and manage workloads on-premises. Contoso Manufacturing has an internal service-level objective (SLO) target of 99.8% that business and application stakeholders agree on for their services.

  • An SLO of 99.8% uptime, or availability, results in the following periods of allowed downtime, or unavailability, for the applications that are deployed using Azure Arc VMs that run on Azure Local:

    • Weekly: 20 minutes and 10 seconds

    • Monthly: 1 hour, 26 minutes, and 56 seconds

    • Quarterly: 4 hours, 20 minutes, and 49 seconds

    • Yearly: 17 hours, 23 minutes, and 16 seconds

  • To help meet the SLO targets, Contoso Manufacturing implements the principle of least privilege (PoLP) to restrict the number of Azure Local instance administrators to a small group of trusted and qualified individuals. This approach helps prevent downtime due to any inadvertent or accidental actions performed on production resources. Furthermore, the security event logs for on-premises Active Directory Domain Services (AD DS) domain controllers are monitored to detect and report any user account group membership changes, known as add and remove actions, for the Azure Local instance administrators group by using a security information event management (SIEM) solution. Monitoring increases reliability and improves the security of the solution.

    For more information, see Recommendations for identity and access management.

  • Strict change control procedures are in place for Contoso Manufacturing's production systems. This process requires that all changes are tested and validated in a representative test environment before implementation in production. All changes submitted to the weekly change advisory board process must include a detailed implementation plan (or link to source code), risk level score, a comprehensive rollback plan, post-release testing and verification, and clear success criteria for a change to be reviewed or approved.

    For more information, see Recommendations for safe deployment practices.

  • Monthly security patches and quarterly baseline updates are applied to production Azure Local instance only after they're validated by the preproduction environment. Update Manager and the cluster-aware updating feature automate the process of using VM live migration to minimize downtime for business-critical workloads during the monthly servicing operations. Contoso Manufacturing standard operating procedures require that security, reliability, or baseline build updates are applied to all production systems within four weeks of their release date. Without this policy, production systems are perpetually unable to stay current with monthly OS and security updates. Out-of-date systems negatively affect platform reliability and security.

    For more information, see Recommendations for establishing a security baseline.

  • Contoso Manufacturing implements daily, weekly, and monthly backups to retain the last 6 x days of daily backups (Mondays to Saturdays), the last 3 x weekly (each Sunday) and 3 x monthly backups, with each Sunday week 4 being retained to become the month 1, month 2, and month 3 backups by using a rolling calendar based schedule that's documented and auditable. This approach meets Contoso Manufacturing requirements for an adequate balance between the number of data recovery points available and reducing costs for the offsite or cloud backup storage service.

    For more information, see Recommendations for designing a disaster recovery strategy.

  • Data backup and recovery processes are tested for each business system every six months. This strategy provides assurance that BCDR processes are valid and that the business is protected if a datacenter disaster or cyber incident occurs.

    For more information, see Recommendations for designing a reliability testing strategy.

  • The operational processes and procedures described previously in the article, and the recommendations in the Well-Architected Framework service guide for Azure Local, enable Contoso Manufacturing to meet their 99.8% SLO target and effectively scale and manage Azure Local and workload deployments across multiple manufacturing sites that are distributed around the world.

    For more information, see Recommendations for defining reliability targets.

Redundancy

Consider a workload that you deploy on a single Azure Local instance as a locally redundant deployment. The cluster provides high availability at the platform level, but you must deploy the cluster in a single rack. For business-critical or mission-critical use cases, we recommend that you deploy multiple instances of a workload or service across two or more separate Azure Local instances, ideally in separate physical locations.

Use industry-standard, high-availability patterns for workloads that provide active/passive replication, synchronous replication, or asynchronous replication such as SQL Server Always On. You can also use an external network load balancing (NLB) technology that routes user requests across the multiple workload instances that run on Azure Local instances that you deploy in separate physical locations. Consider using a partner external NLB device. Or you can evaluate the load balancing options that support traffic routing for hybrid and on-premises services, such as an Azure Application Gateway instance that uses Azure ExpressRoute or a VPN tunnel to connect to an on-premises service.

For more information, see Recommendations for designing for redundancy.

Security

Security provides assurances against deliberate attacks and the abuse of your valuable data and systems. For more information, see Design review checklist for Security.

Security considerations include:

  • A secure foundation for the Azure Local platform: Azure Local is a secure-by-default product that uses validated hardware components with a TPM, UEFI, and Secure Boot to build a secure foundation for the Azure Local platform and workload security. When deployed with the default security settings, Azure Local has Windows Defender Application Control, Credential Guard, and BitLocker enabled. To simplify delegating permissions by using the PoLP, use Azure Local built-in role-based access control (RBAC) roles such as Azure Local Administrator for platform administrators and Azure Local VM Contributor or Azure Local VM Reader for workload operators.

  • Default security settings: Azure Local security default applies default security settings for your Azure Local instance during deployment and enables drift control to keep the nodes in a known good state. You can use the security default settings to manage cluster security, drift control, and secured core server settings on your cluster.

  • Security event logs: Azure Local syslog forwarding integrates with security monitoring solutions by retrieving relevant security event logs to aggregate and store events for retention in your own SIEM platform.

  • Protection from threats and vulnerabilities: Defender for Cloud protects your Azure Local instance from various threats and vulnerabilities. This service helps improve the security posture of your Azure Local environment and can protect against existing and evolving threats.

  • Threat detection and remediation: Microsoft Advanced Threat Analytics detects and remediates threats, such as those targeting AD DS, that provide authentication services to Azure Local instance nodes and their Windows Server VM workloads.

  • Network isolation: Isolate networks if needed. For example, you can provision multiple logical networks that use separate VLANs and network address ranges. When you use this approach, ensure that the management network can reach each logical network and VLAN so that Azure Local instance nodes can communicate with the VLAN networks through the ToR switches or gateways. This configuration is required for management of the workload, such as allowing infrastructure management agents to communicate with the workload guest OS.

    For more information, see Recommendations for building a segmentation strategy.

Cost Optimization

Cost Optimization is about looking at ways to reduce unnecessary expenses and improve operational efficiencies. For more information, see Design review checklist for Cost Optimization.

Cost optimization considerations include:

  • Cloud-style billing model for licensing: Azure Local pricing follows the monthly subscription billing model with a flat rate per physical processor core in an Azure Local instance. Extra usage charges apply if you use other Azure services. If you own on-premises core licenses for Windows Server Datacenter edition with active Software Assurance, you might choose to exchange these licenses to activate Azure Local instance and Windows Server VM subscription fees.

  • Automatic VM Guest patching for Azure Arc VMs: This feature helps reduce the overhead of manual patching and the associated maintenance costs. Not only does this action help make the system more secure, but it also optimizes resource allocation and contributes to overall cost efficiency.

  • Cost monitoring consolidation: To consolidate monitoring costs, use Insights for Azure Local and patch using Update Manager for Azure Local. Insights uses Monitor to provide rich metrics and alerting capabilities. The lifecycle manager component of Azure Localintegrates with Update Manager to simplify the task of keeping your clusters up to date by consolidating update workflows for various components into a single experience. Use Monitor and Update Manager to optimize resource allocation and contribute to overall cost efficiency.

    For more information, see Recommendations for optimizing personnel time.

  • Initial workload capacity and growth: When you plan your Azure Local deployment, consider your initial workload capacity, resiliency requirements, and future growth considerations. Consider if using a two or three-node storage switchless architecture could reduce costs, such as removing the need to procure storage-class network switches. Procuring extra storage class network switches can be an expensive component of new Azure Local instance deployments. Instead, you can use existing switches for management and compute networks, which simplifies the infrastructure. If your workload capacity and resiliency needs don't scale beyond a three-node configuration, consider if you can use existing switches for the management and compute networks, and use the three-node storage switchless architecture to deploy Azure Local.

    For more information, see Recommendations for optimizing component costs.

Tip

You can save on costs with Azure Hybrid Benefit if you have Windows Server Datacenter licenses with active Software Assurance. For more information, see Azure Hybrid Benefit for Azure Local.

Operational Excellence

Operational Excellence covers the operations processes that deploy an application and keep it running in production. For more information, see Design review checklist for Operational Excellence.

Operational excellence considerations include:

Performance Efficiency

Performance Efficiency is the ability of your workload to meet the demands placed on it by users in an efficient manner. For more information, see Design review checklist for Performance Efficiency.

Performance efficiency considerations include:

  • Workload storage performance: Consider using the DiskSpd tool to test workload storage performance capabilities of an Azure Local instance. You can use the VMFleet tool to generate load and measure the performance of a storage subsystem. Evaluate whether you should use VMFleet to measure storage subsystem performance.

    • We recommend that you establish a baseline for your Azure Local instances performance before you deploy production workloads. DiskSpd uses various command-line parameters that enable administrators to test the storage performance of the cluster. The main function of DiskSpd is to issue read and write operations and output performance metrics, such as latency, throughput, and IOPs.

      For more information, see Recommendations for performance testing.

  • Workload storage resiliency: Consider the benefits of storage resiliency, usage (or capacity) efficiency, and performance. Planning for Azure Local volumes includes identifying the optimal balance between resiliency, usage efficiency, and performance. You might find it difficult to optimize this balance because maximizing one of these characteristics typically has a negative effect on one or more of the other characteristics. Increasing resiliency reduces the usable capacity. As a result, the performance might vary, depending on the resiliency type selected. When resiliency and performance are the priority, and when you use three or more nodes, the default storage configuration employs three-way mirroring for both infrastructure and user volumes.

    For more information, see Recommendations for capacity planning.

  • Network performance optimization: Consider network performance optimization. As part of your design, be sure to include projected network traffic bandwidth allocation when determining your optimal network hardware configuration.

    • To optimize compute performance in Azure Local, you can use GPU acceleration. GPU acceleration is beneficial for high-performance AI or machine learning workloads that involve data insights or inferencing. These workloads require deployment at edge locations due to considerations like data gravity or security requirements. In a hybrid deployment or on-premises deployment, it's important to take your workload performance requirements, including GPUs, into consideration. This approach helps you select the right services when you design and procure your Azure Local instances.

      For more information, see Recommendations for selecting the right services.

Deploy this scenario

The following section provides an example list of the high-level tasks or typical workflow used to deploy Azure Local, including prerequisite tasks and considerations. This workflow list is intended as an example guide only. It isn't an exhaustive list of all required actions, which can vary based on organizational, geographic, or project-specific requirements.

Scenario: there is a project or use case requirement to deploy a hybrid cloud solution in an on-premises or edge location to provide local compute for data processing capabilities and a desire to use Azure-consistent management and billing experiences. More details are described in the potential use cases section of this article. The remaining steps assume that Azure Local is the chosen infrastructure platform solution for the project.

  1. Gather workload and use case requirements from relevant stakeholders. This strategy enables the project to confirm that the features and capabilities of Azure Local meet the workload scale, performance, and functionality requirements. This review process should include understanding the workload scale, or size, and required features such as Azure Arc VMs, AKS, Azure Virtual Desktop, or Azure Arc-enabled Data Services or Azure Arc-enabled Machine Learning service. The workload RTO and RPO (reliability) values and other nonfunctional requirements (performance/load scalability) should be documented as part of this requirements gathering step.

  2. Review the Azure Local sizer output for the recommended hardware partner solution. This output includes details of the recommended physical server hardware make and model, number of physical nodes, and the specifications for the CPU, memory, and storage configuration of each physical node that are required to deploy and run your workloads.

  3. Use the Azure Local sizing tool to create a new project that models the workload type and scale. This project includes the size and number of VMs and their storage requirements. These details are inputted together with choices for the system type, preferred CPU family, and your resiliency requirements for high availability and storage fault tolerance, as explained in the previous Cluster design choices section.

  4. Review the Azure Local Sizer output for the recommended hardware partner solution. This solution includes details of the recommended physical server hardware (make and model), number of physical nodes, and the specifications for the CPU, memory, and storage configuration of each physical node that are required to deploy and run your workloads.

  5. Contact the hardware OEM or SI partner to further qualify the suitability of the recommended hardware version versus your workload requirements. If available, use OEM-specific sizing tools to determine OEM-specific hardware sizing requirements for the intended workloads. This step typically includes discussions with the hardware OEM or SI partner for the commercial aspects of the solution. These aspects include quotations, availability of the hardware, lead times, and any professional or value-add services that the partner provides to help accelerate your project or business outcomes.

  6. Deploy two ToR switches for network integration. For high availability solutions, HCI clusters require two ToR switches to be deployed. Each physical node requires four NICs, two of which must be RDMA capable, which provides two links from each node to the two ToR switches. Two NICs, one connected to each switch, are converged for outbound north-south connectivity for the compute and management networks. The other two RDMA capable NICs are dedicated for the storage east-west traffic. If you plan to use existing network switches, ensure that the make and model of your switches are on the approved list of network switches supported by Azure Local.

  7. Work with the hardware OEM or SI partner to arrange delivery of the hardware. The SI partner or your employees are then required to integrate the hardware into your on-premises datacenter or edge location, such as racking and stacking the hardware, physical network, and power supply unit cabling for the physical nodes.

  8. Perform the Azure Local instance deployment. Depending on your chosen solution version (Premier solution, Integrated system, or Validated Nodes), either the hardware partner, SI partner, or your employees can deploy the Azure Local software. This step starts by onboarding the physical nodes Azure Stack HCI OS into Azure Arc-enabled servers, then starting the Azure Local cloud deployment process. Customers and partners can raise a support request directly with Microsoft in the Azure portal by selecting the Support + Troubleshooting icon or by contacting their hardware OEM or SI partner, depending on the nature of the request and the hardware solution category.

    Tip

    GitHub logo The Azure Stack HCI OS, version 23H2 system reference implementation demonstrates how to deploy a switched multiserver deployment of Azure Local by using an ARM template and parameter file. Alternatively, the Bicep example demonstrates how to use a Bicep template to deploy an Azure Local instance, including its prerequisites resources.

  9. Deploy highly available workloads on Azure Local using Azure portal, CLI, or ARM + Azure Arc templates for automation. Use the custom location resource of the new HCI cluster as the target region when you deploy workload resources such as Azure Arc VMs, AKS, Azure Virtual Desktop session hosts, or other Azure Arc-enabled services that you can enable through AKS extensions and containerization on Azure Local.

  10. Install monthly updates to improve the security and reliability of the platform. To keep your Azure Local instances up to date, it's important to install Microsoft software updates and hardware OEM driver and firmware updates. These updates improve the security and reliability of the platform. Update Manager applies the updates and provides a centralized and scalable solution to install updates across a single cluster or multiple clusters. Check with your hardware OEM partner to determine the process for installing hardware driver and firmware updates because this process can vary depending on your chosen hardware solution category type (Premier solution, Integrated system, or Validated Nodes). For more information, see Infrastructure updates.

Next steps

Product documentation:

Product documentation for details on specific Azure services:

Microsoft Learn modules: