Resource limits, VM sizes, and regions for AKS enabled by Azure Arc
Applies to: AKS on Azure Stack HCI 22H2, AKS on Windows Server
This article provides information about tested configurations, resource limits, VM sizes, and regions for Azure Kubernetes Service (AKS) enabled by Azure Arc. The tests used the latest release of AKS on Azure Local.
Maximum specifications
AKS enabled by Arc deployments have been validated with the following configurations, including the specified maximums. Keep in mind that exceeding these maximums is at your own risk and might lead to unexpected behaviors and failures. This article provides some guidance on how to avoid common configuration mistakes and can help you create a larger configuration. If in doubt, contact your local Microsoft office for assistance or submit a question in the Azure Local community.
Resource | Maximum |
---|---|
Physical servers per cluster | 8 |
Total number of VMs | 200 |
The recommended limits were tested with the default virtual machine (VM) sizes, based on the following table:
System role | VM size |
---|---|
AKS-Host | Standard_A4_v2 |
Target Cluster Control Plane node | Default |
Target Cluster HAProxy load balancer (optional) | Standard_A4_v2 |
Target Cluster Linux worker node | Standard_K8S3_v1 |
Target Cluster Windows worker node | Standard_K8S3_v1 |
The hardware configuration of each physical node in the Azure Local cluster is as follows:
- Chassis: Dell PowerEdge R650 Server or similar.
- RAM: RDIMM, 3200 MT/s, Dual Rank, total of 256 GB.
- CPU: Two (2) Intel Xeon Silver 4316 2.3G, 20C/40T, 10.4 GT/s, 30M Cache, Turbo, HT (150 W) DDR4-2666.
- Disk: 8x HDDs (2 TB or larger) and 2x 1.6 TB NVMe to support S2D storage configurations.
- Network: Four (4) 100-Gbit NICs (Mellanox or Intel).
Microsoft engineering tested AKS enabled by Arc using the above configuration. For single node. 2 node, 4 node and 8 node Windows failover clusters. If you have a requirement to exceed the tested configuration, see Scaling AKS on Azure Local.
Important
When you upgrade a deployment of AKS, extra resources are temporarily consumed. Each virtual machine is upgraded in a rolling update flow, starting with the control plane nodes. For each node in the Kubernetes cluster, a new node VM is created. The old node VM is restricted in order to prevent workloads from being deployed to it. The restricted VM is then drained of all containers to distribute the containers to other VMs in the system. The drained VM is then removed from the cluster, shut down, and replaced by the new, updated VM. This process repeats until all VMs are updated.
Available VM sizes
The following VM sizes for control plane nodes, Linux worker nodes, and Windows worker nodes are available for AKS on Azure Local. While VM sizes such as Standard_K8S2_v1 and Standard_K8S_v1 are supported for testing and low resource requirement deployments, use these sizes with care and apply stringent testing as they may result in unexpected failures due to out of memory conditions.
VM Size | CPU | Memory (GB) | GPU type | GPU count |
---|---|---|---|---|
Default | 4 | 4 | ||
Standard_A2_v2 | 2 | 4 | ||
Standard_A4_v2 | 4 | 8 | ||
Standard_D2s_v3 | 2 | 8 | ||
Standard_D4s_v3 | 4 | 16 | ||
Standard_D8s_v3 | 8 | 32 | ||
Standard_D16s_v3 | 16 | 64 | ||
Standard_D32s_v3 | 32 | 128 | ||
Standard_DS2_v2 | 2 | 7 | ||
Standard_DS3_v2 | 2 | 14 | ||
Standard_DS4_v2 | 8 | 28 | ||
Standard_DS5_v2 | 16 | 56 | ||
Standard_DS13_v2 | 8 | 56 | ||
Standard_K8S_v1 | 4 | 2 | ||
Standard_K8S2_v1 | 2 | 2 | ||
Standard_K8S3_v1 | 4 | 6 | ||
Standard_NK6 | 6 | 12 | Tesla T4 | 1 |
Standard_NK12 | 12 | 24 | Tesla T4 | 2 |
Supported Azure regions for Azure registration
AKS enabled by Arc is supported in the following Azure regions:
- Australia East
- East US
- Southeast Asia
- West Europe
Scaling AKS on Azure Local
Scaling an AKS deployment on Azure Local involves planning ahead by knowing your workloads and target cluster utilization. Additionally, consider hardware resources in your underlying infrastructure such as total CPU cores, total memory, storage, IP Addresses and so on.
The following examples assume that only AKS-based workloads are deployed on the underlying infrastructure. Deploying non-AKS workloads such as stand-alone or clustered virtual machines, or database servers, reduces the resources available to AKS, which you must take into account.
Before you start, consider the following points in order to determine your maximum scale and the number of target clusters you need to support:
- The number of IP addresses you have available for pods in a target cluster.
- The number of IP addresses available for Kubernetes services in a target cluster.
- The number of pods you need to run your workloads.
To determine the size of your Azure Kubernetes Service Host VM, you need to know the number of worker nodes and target clusters, as that determines the size of the AKS Host VM. For example:
- The number of target clusters in the final deployed system.
- The number of nodes, including control plane, load balancer, and worker nodes across all target clusters.
Note
A single AKS host can only manage target clusters on the same platform.
Also, to determine the size of your target cluster control plane node, you need to know the number of pods, containers, and worker nodes you're planning to deploy in each target cluster.
Default settings that currently can't be changed in AKS
There are default configurations and settings currently not available for customer control during or after deployment. These settings can limit the scale for a given target cluster.
- The number of IP addresses for Kubernetes pods in a target cluster is limited to the subnet
10.244.0.0/16
. That is, 255 IP addresses per worker node across all Kubernetes namespaces can be used for pods. To see the pod CIDR assigned to each worker node in your cluster, use the following command in PowerShell:
kubectl get nodes -o json | findstr 'hostname podCIDR'
"kubernetes.io/hostname": "moc-lip6cotjt0f",
"f:podCIDR": {},
"f:podCIDRs": {
"f:kubernetes.io/hostname": {},
"podCIDR": "10.244.2.0/24",
"podCIDRs": [
"kubernetes.io/hostname": "moc-lmb6zqozk4m",
"f:podCIDR": {},
"f:podCIDRs": {
"f:kubernetes.io/hostname": {},
"podCIDR": "10.244.4.0/24",
"podCIDRs": [
"kubernetes.io/hostname": "moc-wgwhhijxtfv",
"f:podCIDR": {},
"f:podCIDRs": {
"f:kubernetes.io/hostname": {},
"podCIDR": "10.244.5.0/24",
"podCIDRs": [
In the example, you can see three nodes with three CIDRs, each capable of hosting 254 pods. The Kubernetes scale documentation recommends that you don't exceed 110 pods per node for performance reasons. See Considerations for large clusters.
Other considerations:
- The number of IP addresses for Kubernetes services, outside the VIP pool you allocated, come from the
10.96.0.0/16
address pool. The system consumes one of the 255 available addresses for the Kubernetes API server. - The size of the AKS host VM can only be set at installation, when you run Set-AksHciConfig for the first time. You can't change it later.
- You can only set the size of target cluster control plane and load balancer VMs at the time of target cluster creation.
Scale example
The following scaling example is based on these general assumptions/use cases:
- You want to be able to completely tolerate the loss of one physical node in the Azure Local cluster.
- You want to support upgrading target clusters to newer versions.
- You want to allow for high availability of the target cluster control plane nodes and load balancer nodes.
- You want to reserve a part of the overall Azure Local capacity for these cases.
Suggestions
For optimal performance, make sure to set at least 15 percent (100/8=12.5) of cluster capacity aside to allow all resources from one physical node to be redistributed to the other seven (7) nodes. This configuration ensures that you have some reserve available to do an upgrade or other AKS day two operations.
If you want to grow beyond the 200-VM limit for a maximum hardware sized eight (8) node Azure Local cluster, increase the size of the AKS host VM. Doubling in size results in roughly double the number of VMs it can manage. In an 8-node Azure Local cluster, you can get to 8,192 (8x1024) VMs based on the Azure Local recommended resource limits documented in the Maximum supported hardware specifications. You should reserve approximately 30% of capacity, which leaves you with a theoretical limit of 5,734 VMs across all nodes.
- Standard_D32s_v3, for the AKS host with 32 cores and 128 GB - can support a maximum of 1,600 nodes.
Note
Since this configuration has not been tested extensively, it requires a careful approach and validation.
At a scale like this, you might want to split the environment into at least 8 target clusters with 200 worker nodes each.
To run 200 worker nodes in one target cluster, you can use the default control plane and load balancer size. Depending on the number of pods per node, you can go up at least one size on the control plane and use Standard_D8s_v3.
Depending on the number of Kubernetes services hosted in each target cluster, you might have to increase the size of the load balancer VM as well at target cluster creation to ensure that services can be reached with high performance and traffic is routed accordingly.
The deployment of AKS enabled by Arc distributes the worker nodes for each node pool in a target cluster across the available Azure Local nodes using the Azure Local placement logic.
Important
The node placement is not preserved during platform and AKS upgrades and will change over time. A failed physical node will also impact the distribution of virtual machines across the remaining cluster nodes.
Note
Do not run more than four target cluster creations at the same time if the physical cluster is already 50% full, as that can lead to temporary resource contention. When scaling up target cluster node pools by large numbers, take into account available physical resources, as AKS does not verify resource availability for parallel running creation/scaling processes. Always ensure enough reserve to allow for upgrades and failover. Especially in very large environments, these operations, when run in parallel, can lead to rapid resource exhaustion.
If in doubt, contact your local Microsoft office for assistance or post in the Azure Local community forum.