Azure Well-Architected Framework perspective on Azure Kubernetes Service (AKS)

Article
01/21/2025

Azure Kubernetes Service (AKS) is a managed Kubernetes service that you can use to deploy and manage containerized applications. Similar to other managed services, AKS offloads much of the operational overhead to Azure while providing high availability, scalability, and portability features to the workload.

This article assumes that, as an architect, you reviewed the compute decision tree and chose AKS as the compute for your workload. The guidance in this article provides architectural recommendations that are mapped to the principles of the Azure Well-Architected Framework pillars.

Important

How to use this guide

Each section has a design checklist that presents architectural areas of concern along with design strategies localized to the technology scope.

Also included are recommendations for the technology capabilities that can help materialize those strategies. The recommendations don't represent an exhaustive list of all configurations that are available for AKS and its dependencies. Instead, they list the key recommendations mapped to the design perspectives. Use the recommendations to build your proof-of-concept or to optimize your existing environments.

Foundational architecture that demonstrates the key recommendations: AKS baseline architecture.

Technology scope

This review focuses on the interrelated decisions for the following Azure resources:

When you discuss the Well-Architected Framework pillars' best practices for AKS, it's important to distinguish between cluster and workload. Cluster best practices are a shared responsibility between the cluster admin and their resource provider, while workload best practices are the domain of a developer. This article has considerations and recommendations for each of these roles.

Note

The following pillars include a design checklist and a list of recommendations that indicate whether each choice is applicable to cluster architecture, workload architecture, or both.

Reliability

The purpose of the Reliability pillar is to provide continued functionality by building enough resilience and the ability to recover fast from failures.

Reliability design principles provide a high-level design strategy applied for individual components, system flows, and the system as a whole.

Design checklist

Start your design strategy based on the design review checklist for Reliability. Determine its relevance to your business requirements while keeping in mind the features of AKS and its dependencies. Extend the strategy to include more approaches as needed.

(Cluster) Build redundancy to improve resiliency. Use availability zones for your AKS clusters as part of your resiliency strategy to increase availability when you deploy to a single region. Many Azure regions provide availability zones. The zones are close enough to have low-latency connections among them, but far enough apart to reduce the likelihood that local outages will affect more than one zone.

For critical workloads, deploy multiple clusters across different Azure regions. By geographically distributing AKS clusters, you can achieve higher resiliency and minimize the effects of regional failures. A multiregion strategy helps maximize availability and provide business continuity. Internet-facing workloads should use Azure Front Door or Azure Traffic Manager to route traffic globally across AKS clusters. For more information, see Multiregion strategy.

Plan the IP address space to ensure that your cluster can reliably scale and handle failover traffic in multiple-cluster topologies.
(Cluster and workload) Monitor reliability and overall health indicators of the cluster and workloads. Collect logs and metrics to monitor workload health, identify performance and reliability trends, and troubleshoot problems. Review Best practices for monitoring Kubernetes with Azure Monitor and the Well-Architected Health modeling for workloads guide for help designing the reliability and health monitoring solution for your AKS solution.

Ensure that workloads are built to support horizontal scaling and report application readiness and health.
(Cluster and workload) Host application pods in user nodel pools. By isolating system pods from application workloads, you help ensure that AKS essential services are unaffected by the resource demands or potential problems caused by a workload that runs user node pools.

Ensure that your workload runs on user node pools and choose the right size SKU. At a minimum, include two nodes for user node pools and three nodes for the system node pool.
(Cluster and workload) Factor the AKS uptime service-level agreement (SLA) into your availability and recovery targets. To define the reliability and recovery targets for your cluster and workload, follow the guidance in Recommendations for defining reliability targets. Then formulate a design that meets those targets.

Recommendations

Recommendation	Benefit
(Cluster and workload) Control pod scheduling by using node selectors and affinity. In AKS, the Kubernetes scheduler can logically isolate workloads by hardware in the node. Unlike tolerations, pods that don't have a matching node selector can be scheduled on labeled nodes, but priority is given to pods that define the matching node selector.	Node affinity results in more flexibility, which allows you to define what happens if the pod can't be matched with a node.
(Cluster) Choose the appropriate network plugin based on network requirements and cluster sizing. Different network plugins provide varying levels of functionality. Azure Container Networking Interface (Azure CNI) is required for specific scenarios, such as Windows-based node pools, some networking requirements, and Kubernetes network policies. For more information, see Kubenet versus Azure CNI.	The right network plugin can help ensure better compatibility and performance.
(Cluster and workload) Use the AKS uptime SLA for production-grade clusters.	The workload can support higher availability targets because of the higher availability guarantees of the Kubernetes API server endpoint for AKS clusters.
(Cluster) Use availability zones to maximize resilience within an Azure region by distributing AKS agent nodes across physically separate datacenters. If colocality requirements exist, use a regular virtual machine scale sets-based AKS deployment into a single zone or use proximity placement groups to minimize internode latency.	By spreading node pools across multiple zones, nodes in one node pool continue to run even if another zone goes down.
(Cluster and workload) Define pod resource requests and limits in application deployment manifests. Enforce those limits by using Azure Policy.	Container CPU and memory resource limits are necessary to prevent resource exhaustion in your Kubernetes cluster.
(Cluster and workload) Keep the system node pool isolated from application workloads. System node pools require a virtual machine (VM) SKU of at least 2 vCPUs and 4 GB of memory. We recommend that you use 4 vCPU or more. For more information, see System and user node pools.	The system node pool hosts critical system pods that are essential for the control plane of your cluster. By isolating these system pods from application workloads, you help ensure that the essential services are unaffected by the resource demands or potential problems caused by a workload.
(Cluster and workload) Separate applications to dedicated node pools based on specific requirements. Avoid large numbers of node pools to reduce management overhead.	Applications can share the same configuration and need GPU-enabled VMs, CPU or memory-optimized VMs, or the ability to scale to zero. By dedicating node pools to specific applications, you can help ensure that each application gets the resources it needs without overprovisioning or underutilizing resources.
(Cluster) Use a NAT gateway for clusters that run workloads that make many concurrent outbound connections.	Azure NAT Gateway supports reliable egress traffic at scale and helps you avoid reliability problems by applying Azure Load Balancer limitations to high concurrent outbound traffic.

Security

The purpose of the Security pillar is to provide confidentiality, integrity, and availability guarantees to the workload.

The Security design principles provide a high-level design strategy for achieving those goals by applying approaches to the technical design of AKS.

Design checklist

Start your design strategy based on the design review checklist for Security and identify vulnerabilities and controls to improve the security posture. Familiarize yourself with AKS security concepts and evaluate the security hardening recommendations based on the CIS Kubernetes benchmark. Extend the strategy to include more approaches as needed.

(Cluster) Integrate with Microsoft Entra ID for identity and access mangement. Centralize identity management for your cluster by using Microsoft Entra ID. Any change in user account or group status is automatically updated in access to the AKS cluster. Establish identity as the primary security perimeter. The developers and application owners of your Kubernetes cluster need access to different resources.

Use Kubernetes role-based access control (RBAC) with Microsoft Entra ID for least privilege access. Protect configuration and secrets by minimizing the allocation of administrator privileges.
(Cluster) Integrate with security monitoring and security information and event management tools. Use Microsoft Defender for Containers with Microsoft Sentinel to detect and quickly respond to threats across your clusters and the workloads that run on them. Enable AKS connector for Microsoft Sentinel to stream your AKS diagnostics logs into Microsoft Sentinel.
(Cluster and workload) Implement segmentation and network controls. To prevent data exfiltration, ensure that only authorized and safe traffic is allowed, and contain the blast radius of a security breach.

Consider using a private AKS cluster to help ensure that cluster-management traffic to your API server remains on your private network. Or use the API server allowlist for public clusters.
(Workload) Use a web application firewall (WAF) to scan incoming traffic for potential attacks. WAF can detect and mitigate threats in real time to help block malicious traffic before it reaches your applications. It provides robust protection against common web-based attacks, such as SQL injection, cross-site scripting, and other Open Web Application Security Project vulnerabilities. Some load balancers, such as Azure Application Gateway or Azure Front Door have an integrated WAF.
(Workload) Maintain a hardened workload's software supply chain. Ensure that your continuous integration and continuous delivery pipeline is hardened with container-aware scanning.
(Cluster and workload) Implement extra protection for specialized secure workloads. If your cluster needs to run a sensitive workload, you might need to deploy a private cluster. Here are some examples:
- Payment Card Industry Data Security Standard (PCI-DSS 3.2.1): AKS regulated cluster for PCI-DSS 3.2.1
- DoD Impact Level 5 (IL5) support and requirements with AKS: Azure Government IL5 isolation requirements.

Recommendations

Recommendation	Benefit
(Cluster) Use managed identities on the cluster.	You can avoid the overhead associated with managing and rotating service principles.
(Workload) Use Microsoft Entra Workload ID with AKS to access Microsoft Entra protected resources, such as Azure Key Vault and Microsoft Graph, from your workload.	Use AKS Workload IDs to protect access to Azure resources by using Microsoft Entra ID RBAC without having to manage credentials directly in your code.
(Cluster) Use Microsoft Entra ID to authenticate with Azure Container Registry from AKS.	By using Microsoft Entra ID, AKS can authenticate with Container Registry without the use of `imagePullSecrets` secrets.
(Cluster) Secure network traffic to your API server by using private AKS cluster if the workload requirements require higher levels of segmentation.	By default, network traffic between your node pools and the API server travels the Microsoft backbone network. By using a private cluster, you can help ensure that network traffic to your API server remains on the private network only.
(Cluster) For public AKS clusters, use API server-authorized IP address ranges. Include sources like the public IP addresses of your deployment build agents, operations management, and node pools' egress point, such as Azure Firewall.	When you use public clusters, you can significantly reduce the attack surface of your AKS cluster by limiting the traffic that can reach the API server of your clusters.
(Cluster) Protect the API server by using Microsoft Entra ID RBAC. Disable local accounts to enforce all cluster access by using Microsoft Entra ID-based identities.	Securing access to the Kubernetes API server is one of the most important things that you can do to secure your cluster. Integrate Kubernetes RBAC with Microsoft Entra ID to control access to the API server.
(Cluster) Use Azure network policies or Calico.	By using policies, you can secure and control network traffic between pods in a cluster. Calico provides a richer set of capabilities, including policy ordering and priority, deny rules, and more flexible match rules.
(Cluster) Secure clusters and pods by using Azure Policy.	Azure Policy can help apply at-scale enforcement and safeguards on your clusters in a centralized, consistent manner. It can also control what functions pods are granted and detect whether anything is running against company policy.
(Cluster) Secure container access to resources. Limit access to actions that containers can perform. Provide the least number of permissions, and avoid the use of root or privileged escalation. For Linux based containers, see Security container access to resources using built-in Linux security features.	By restricting permissions and avoiding the use of root or privileged escalation, you help reduce the risk of security breaches. You can help ensure that, even if a container is compromised, the potential damage is minimized.
(Cluster) Control cluster egress traffic by ensuring that your cluster's outbound traffic passes through a network security point such as Azure Firewall or an HTTP proxy.	By routing outbound traffic through Azure Firewall or an HTTP proxy, you can help enforce security policies that prevent unauthorized access and data exfiltration. This approach also simplifies the administration of security policies and makes it easier to enforce consistent rules across your entire AKS cluster.
(Cluster) Use the open-source Microsoft Entra Workload ID and Secrets Store CSI Driver with Key Vault.	These features help you protect and rotate secrets, certificates, and connection strings in Key Vault by using strong encryption. They provide an access audit log and keep core secrets out of the deployment pipeline.
(Cluster) Use Microsoft Defender for Containers.	Microsoft Defender for Containers helps you monitor and maintain the security of your clusters, containers, and their applications.

Cost Optimization

Cost Optimization focuses on detecting spend patterns, prioritizing investments in critical areas, and optimizing in others to meet the organization's budget while meeting business requirements.

The Cost Optimization design principles provide a high-level design strategy for achieving those goals and making tradeoffs as necessary in the technical design related to AKS and its environment.

Design checklist

Start your design strategy based on the design review checklist for Cost Optimization for investments. Fine-tune the design so that the workload is aligned with the budget that's allocated for the workload. Your design should use the right Azure capabilities, monitor investments, and find opportunities to optimize over time.

(Cluster) Include the pricing tiers for AKS in your cost model. To estimate costs, use the Azure pricing calculator and test different configuration and payment plans in the calculator.
(Cluster) Get the best rates for your workload. Use the appropriate VM SKU for each node pool because it directly affects the cost to run your workloads. Choosing a high-performance VM without proper utilization can lead to wasteful spending. Selecting a less powerful VM can cause performance problems and increased downtime.

If you properly planned for capacity and your workload is predictable and will exist for an extended period of time, sign up for Azure Reservations or a savings plan to reduce your resource costs.

Choose Azure Spot Virtual Machines to use unutilized Azure capacity with significant discounts. These discounts can reach up to 90% of pay-as-you-go prices. If Azure needs capacity back, the Azure infrastructure evicts the Spot nodes.

If you run AKS on-premises or at the edge, you can also use Azure Hybrid Benefit to reduce costs when you run containerized applications in those scenarios.
(Cluster and workload) Optimize workload components costs. Choose the most cost-effective region for your workload. Evaluate the cost, latency, and compliance requirements to ensure that you run your workload cost-effectively and that it doesn't affect your customers or create extra networking charges. The region where you deploy your workload in Azure can significantly affect the cost. Because of many factors, the cost of resources varies for each region in Azure.

Maintain small and optimized images to help reduce costs because new nodes need to download those images. User request failures or timeouts when the application is starting up can lead to overprovisioning. Build images in a way that allows the container to start as soon as possible to help avoid failures and timeouts.

Review the Cost Optimization recommendations in Best practices for monitoring Kubernetes with Azure Monitor to determine the best monitoring strategy for your workloads. Analyze performance metrics, starting with CPU, memory, storage, and network, to identify cost optimization opportunities by cluster, nodes, and namespace.
(Cluster and workload) Optimize workload scaling costs. Consider alternative vertical and horizontal scaling configurations to reduce scaling costs while still meeting all workload requirements. Use autoscalers to scale in when workloads are less active.
(Cluster and workload) Collect and analyze cost data. The foundation of enabling cost optimization is the spread of a cost-saving cluster. Develop a cost-efficiency mindset that includes collaboration between finance, operations, and engineering teams to drive alignment on cost-saving goals and bring transparency to cloud costs.

Recommendations

Recommendation	Benefit
(Cluster and workload) Align AKS SKU selection and managed disk size with workload requirements.	Matching your selection to your workload demands helps ensure that you don't pay for unneeded resources.
(Cluster) Choose the right VM instance types for your AKS node pools. To determine the right VM instance types, consider workload characteristics, resource requirements, and availability needs.	Selecting the right VM instance type is crucial because it directly affects the cost to run applications on AKS. Choosing a high-performance instance without proper utilization can lead to wasteful spending. Choosing a less powerful instance can lead to performance problems and increased downtime.
(Cluster) Choose VMs based on the more power efficient Azure Resource Manager architecture. AKS supports creating Arm64 node pools and a mix of Intel and Resource Manager architecture nodes within a cluster.	The Arm64 architecture provides a better price-to-performance ratio because of its lower power utilization and efficient compute performance. These capabilities can bring better performance at a lower cost.
(Cluster) Enable the cluster autoscaler to automatically reduce the number of agent nodes in response to excess resource capacity.	Automatically scaling down the number of nodes in your AKS cluster lets you run an efficient cluster when demand is low and scale up when demand increases.
(Cluster) Enable node autoprovisioning to automate VM SKU selection.	Node autoprovision simplifies the SKU selection process and decides, based on pending pod resource requirements, the optimal VM configuration to run workloads in the most efficient and cost-effective manner.
(Workload) Use HorizontalPodAutoscaler to adjust the number of pods in a deployment depending on CPU utilization or other metrics.	Automatically scaling down the number of pods when demand is low and scaling out when demand increases results in a more cost-effective operation of your workload.
(Workload) Use VerticalPodAutoscaler (preview) to rightsize your pods and dynamically set requests and limits based on historic usage.	By setting resource requests and limits on containers for each workload, VerticalPodAutoscaler frees up CPU and memory for other pods and helps ensure effective utilization of your AKS clusters.
(Cluster) Configure the AKS cost analysis add-on.	The cost analysis cluster extension enables you to obtain granular insight into costs that are associated with various Kubernetes resources in your clusters or namespaces.

Operational Excellence

Operational Excellence primarily focuses on procedures for development practices, observability, and release management.

The Operational Excellence design principles provide a high-level design strategy for achieving those goals for the operational requirements of the workload.

Design checklist

Start your design strategy based on the design review checklist for Operational Excellence for defining processes for observability, testing, and deployment. See AKS best practices and Day-2 operations guide to learn about key considerations to understand and implement.

(Cluster) Implement an infrastructure as code (IaC) deployment approach. Use a declarative, template-based deployment approach by using Bicep, Terraform, or similar tools. Make sure that all deployments are repeatable, traceable, and stored in a source code repo. For more information, see the quickstarts in the AKS product documentation.
(Cluster and workload) Automate infrastructure and workload deployments. Use standard software solutions to manage, integrate, and automate the deployment of your cluster and workloads. Integrate deployment pipelines with your source control system and incorporate automated tests.

Build an automated process to help ensure that your clusters are bootstrapped with the necessary cluster-wide configurations and deployments. This process is typically performed by using GitOps.

Use a repeatable and automated deployment processes for your workload within your software development lifecycle.
(Cluster and workload) Implement a comprehensive monitoring strategy. Collect logs and metrics to monitor the health of the workload, identify trends in performance and reliability, and troubleshoot problems. Review the Best practices for monitoring Kubernetes with Azure Monitor and the Well-Architected Recommendations for designing and creating a monitoring system to determine the best monitoring strategy for your workloads.

Enable diagnostics settings to ensure that control plane or core API server interactions are logged.

The workload should be designed to emit telemetry that can be collected, which should also include liveness and readiness statuses.
(Cluster and workload) Implement testing in production strategies. Testing in production uses real deployments to validate and measure an application's behavior and performance in the production environment. Use chaos engineering practices that target Kubernetes to identify application or platform reliability issues.

Azure Chaos Studio can help simulate faults and trigger disaster recovery situations.
(Cluster and workload) Enforce workload governance. Azure Policy helps ensure consistent compliance with organizational standards, automates policy enforcement, and provides centralized visibility and control over your cluster resources.

Review the Azure policies section to learn more about the available built-in policies for AKS.
(Cluster and workload) Use stamp-level, blue-green deployments for mission-critical workloads. A stamp-level, blue-green deployment approach can increase confidence in releasing changes and enables zero-downtime upgrades because compatibilities with downstream dependencies like the Azure platform, resource providers, and IaC modules can be validated.

Kubernetes and ingress controllers support many advanced deployment patterns for inclusion in your release engineering process. Consider patterns like blue-green deployments or canary releases.
(Cluster and workload) Make workloads more sustainable. Making workloads more sustainable and cloud efficient requires combining efforts around cost optimization, reducing carbon emissions, and optimizing energy consumption. Optimizing the application's cost is the initial step in making workloads more sustainable.

See Sustainable software engineering principles in AKS to learn how to build sustainable and efficient AKS workloads.

Recommendations

Recommendation	Benefit
(Cluster) Operationalize cluster and pod configuration standards by using Azure policies for AKS.	Azure policies for AKS can help you apply at-scale enforcement and safeguards on your clusters in a centralized, consistent manner. Use policies to define the permissions granted to pods and ensure compliance with company policies.
(Workload) Use Kubernetes Event Driven Autoscaler (KEDA).	KEDA allows your applications to scale based on events, like the number of events being processed. You can choose from a rich catalog of more than 50 KEDA scalers.

Performance Efficiency

Performance Efficiency is about maintaining user experience even when there's an increase in load by managing capacity. The strategy includes scaling resources, identifying and optimizing potential bottlenecks, and optimizing for peak performance.

The Performance Efficiency design principles provide a high-level design strategy for achieving those capacity goals against the expected usage.

Design checklist

Start your design strategy based on the design review checklist for Performance Efficiency for defining a baseline based on key performance indicators for AKS.

(Cluster and workload) Conduct capacity planning. Perform and iterate on a detailed capacity plan exercise that includes SKU, autoscale settings, IP addressing, and failover considerations.

After you formalize your capacity plan, frequently update the plan by continuously observing the resource utilization of the cluster.
(Cluster) Define a scaling strategy. Configure scaling to ensure that resources are adjusted efficiently to meet workload demands without overuse or waste. Use AKS features like cluster autoscaling and HorizontalPodAutoscaler to dynamically meet your workload needs with less strain on operations. Optimize your workload to operate and deploy efficiently in a container.

Review the Scaling and partitioning guide to understand the various aspects of scaling configuration.
(Cluster and workload) Conduct performance testing. Perform ongoing load testing activities that exercise both the pod and cluster autoscaler. Compare results against the performance targets and the established baselines.
(Cluster and workload) Scale workloads and flows independently. Separate workloads and flows into different node pools to allow independent scaling. Follow the guidance in Optimize workload design using flows to identify and prioritize your flows.

Recommendations

Recommendation	Benefit
(Cluster) Enable cluster autoscaler to automatically adjust the number of agent nodes in response to workload demands. Use the HorizontalPodAutoscaler to adjust the number of pods in a deployment depending on CPU utilization or other metrics.	The ability to automatically scale up or scale down the number of nodes and the number of pods in your AKS cluster lets you run an efficient, cost-effective cluster.
(Cluster and workload) Separate workloads into different node pools and consider scaling user node pools.	Unlike system node pools that always require running nodes, user node pools allow you to scale up or scale down.
(Workload) Use AKS advanced scheduler features to implement advanced balancing of resources for workloads that require them.	As you manage AKS clusters, you often need to isolate teams and workloads. Advanced features that the Kubernetes scheduler provides let you control which pods can be scheduled on certain nodes. They also let you control how multipod applications can be appropriately distributed across the cluster.
(Workload) Use KEDA to build a meaningful autoscale ruleset based on signals that are specific to your workload.	Not all scale decisions can be derived from CPU or memory metrics. Scale considerations often come from more complex or even external data points. KEDA allows your applications to scale based on events, such as the number of messages in a queue or the length of a topic lag.

Azure policies

Azure provides an extensive set of built-in policies related to AKS that apply to the Azure resource, like typical Azure policies and the Azure Policy add-on for Kubernetes, and within the cluster. Many of the Azure resource policies come in both Audit/Deny and a Deploy If Not Exists variants. In addition to the built-in Azure Policy definitions, you can create custom policies for both the AKS resource and for the Azure Policy add-on for Kubernetes.

Some of the recommendations in this article can be audited through Azure Policy. For example, you can check the following cluster policies:

Clusters have readiness or liveness health probes configured for your pod spec.
Microsoft Defender for cloud-based policies.
Authentication mode and configuration policies, like Microsoft Entra ID, RBAC, and disable local authentication.
API server network access policies, including private cluster.
GitOps configuration policies.
Diagnostics settings policies.
AKS version restrictions.
Prevent command invoke.

You can also check the following cluster and workload policies:

Kubernetes cluster pod security initiatives for Linux-based workloads.
Include pod and container capability policies, such as AppArmor, sysctl, security caps, SELinux, seccomp, privileged containers, and automount cluster API credentials.
Mount, volume drivers, and filesystem policies.
Pod and container networking policies, such as host network, port, allowed external IPs, HTTPs, and internal load balancers.
Namespace deployment restrictions.
CPU and memory resource limits.

For comprehensive governance, review the Azure Policy built-in definitions for Kubernetes and other policies that might affect the security of the compute layer.

Azure Advisor recommendations

Azure Advisor is a personalized cloud consultant that helps you follow best practices to optimize your Azure deployments. Here are some recommendations that can help you improve the reliability, security, cost effectiveness, performance, and operational excellence of AKS.

Consider the following articles as resources that demonstrate the recommendations highlighted in this article.

Build implementation expertise by using the following product documentation:

AKS product documentation

Share via

Azure Well-Architected Framework perspective on Azure Kubernetes Service (AKS)

Reliability

Design checklist

Recommendations

Security

Design checklist

Recommendations

Cost Optimization

Design checklist

Recommendations

Operational Excellence

Design checklist

Recommendations

Performance Efficiency

Design checklist

Recommendations

Azure policies

Azure Advisor recommendations

Feedback

Additional resources

Share via

Azure Well-Architected Framework perspective on Azure Kubernetes Service (AKS)

Reliability

Design checklist

Recommendations

Security

Design checklist

Recommendations

Cost Optimization

Design checklist

Recommendations

Operational Excellence

Design checklist

Recommendations

Performance Efficiency

Design checklist

Recommendations

Azure policies

Azure Advisor recommendations

Related content

Feedback

Additional resources