Traffic between node pools is blocked by a custom network security group
This article discusses how to resolve a scenario in which a custom network security group (NSG) blocks traffic between node pools in a Microsoft Azure Kubernetes Service (AKS) cluster.
Symptoms
The Domain Name System (DNS) resolution from pods of the User node pool fails.
Background
In scenarios that involve multiple node pools, the pods in the kube-system
namespace can be put on one node pool (the System node pool) while the application pods are put on a different node pool (the User node pool). In some scenarios, pods that communicate with one another can be in different node pools.
AKS lets customers have node pools in different subnets. This feature means that customers can also associate different NSGs with each node pool's subnet.
Cause 1: The NSG of a node pool blocks inbound traffic
Inbound access on the NSG of a node pool blocks traffic. For example, a custom NSG on the System node pool (that hosts the core DNS pods) blocks inbound traffic on User Datagram Protocol (UDP) port 53 from the subnet of the User node pool.
Solution 1: Configure the custom NSG to allow traffic between the node pools
Make sure that your custom NSG allows the required traffic between the node pools, specifically on UDP port 53. AKS won't update the custom NSG that's associated with subnets.
Cause 2: Outbound access on the NSG of a node pool blocks traffic
The NSG on another node pool blocks outbound access to the pod. For example, a custom NSG on the User node pool blocks outbound traffic on UDP port 53 to the System node pool.
Solution 2: Configure the custom NSG to allow traffic between the node pools
See Solution 1: Configure the custom NSG to allow traffic between the node pools.
Cause 3: TCP port 10250 is blocked
Another common scenario is that Transmission Control Protocol (TCP) port 10250 is blocked between the node pools.
In this situation, a user might receive an error message that resembles the following text:
$ kubectl logs nginx-57cdfd6dd9-xb7hk
Error from server: Get https://<node>:10250/containerLogs/default/nginx-57cdfd6dd9-xb7hk/nginx: net/http: TLS handshake timeout
If the communication on TCP port 10250 is blocked, the connection to the Kubelet will be affected, and the logs won't be fetched.
Solution 3: Check for NSG rules that block TCP port 10250
Check whether any NSG rules block node-to-node communication on TCP port 10250.
Contact us for help
If you have questions or need help, create a support request, or ask Azure community support. You can also submit product feedback to Azure feedback community.