Basic troubleshooting of outbound connections from an AKS cluster

This article discusses how to do basic troubleshooting of outbound connections from a Microsoft Azure Kubernetes Service (AKS) cluster and identify faulty components.

Prerequisites

  • The Kubernetes kubectl tool, or a similar tool to connect to the cluster. To install kubectl by using Azure CLI, run the az aks install-cli command.

  • The apt-get command-line tool for handling packages.

  • The Client URL (cURL) tool, or a similar command-line tool.

  • The nslookup command-line (dnsutils) tool for checking DNS resolution.

Scenarios for outbound traffic in Azure Kubernetes Service

Traffic that originates from within the AKS cluster, whether it's from a pod or a worker node, is considered the outbound traffic from the cluster. What if there's an issue in the outbound flow for an AKS cluster? Before you troubleshoot, first look at the scenarios for outbound traffic flow.

The outbound traffic from an AKS cluster can be classified into the following categories:

  1. Traffic to a pod or service in the same cluster (internal traffic).

  2. Traffic to a device or endpoint in the same virtual network or a different virtual network that uses virtual network peering.

  3. Traffic to an on-premises environment through a VPN connection or an Azure ExpressRoute connection.

  4. Traffic outside the AKS network through Azure Load Balancer (public outbound traffic).

  5. Traffic outside the AKS network through Azure Firewall or a proxy server (public outbound traffic).

Internal traffic

A basic request flow for internal traffic from an AKS cluster resembles the flow that's shown in the following diagram.

Diagram of a basic request flow for internal traffic from an AKS cluster.

Public outbound traffic through Azure Load Balancer

If the traffic is for a destination on the internet, the default method is to send the traffic through the Azure Load Balancer.

Diagram of a request flow for external internet traffic through Azure Load Balancer from an AKS cluster.

Public outbound traffic through Azure Firewall or a proxy server

In some cases, the egress traffic has to be filtered, and it might require Azure Firewall.

Diagram of a request flow for external internet traffic through Azure Firewall from an AKS cluster.

A user might want to add a proxy server instead of a firewall, or set up a NAT gateway for egress traffic. The basic flow remains the same as shown in the diagram.

It's important to understand the nature of egress flow for your cluster so that you can continue troubleshooting.

Considerations when troubleshooting

Check your egress device

When you troubleshoot outbound traffic in AKS, it's important to know what your egress device is (that is, the device through which the traffic passes). Here, the egress device could be one of the following components:

  • Azure Load Balancer
  • Azure Firewall or a custom firewall
  • A network address translation (NAT) gateway
  • A proxy server

The flow could also differ based on the destination. For example, internal traffic (that is, within the cluster) doesn't go through the egress device. The internal traffic would use only the cluster networking. For public outbound traffic, determine if there's an egress device passing through and check that device.

Check each hop within traffic flow

After identifying the egress device, check the following factors:

  • The source and the destination for the request.

  • The hops in between the source and the destination.

  • The request-response flow.

  • The hops that are enhanced by extra security layers, such as the following layers:

    • Firewall
    • Network security group (NSG)
    • Network policy

To identify a problematic hop, check the response codes before and after it. To check whether the packets arrive properly in a specific hop, you can proceed with packet captures.

Check HTTP response codes

When you check each component, get and analyze HTTP response codes. These codes are useful to identify the nature of the issue. The codes are especially helpful in scenarios in which the application responds to HTTP requests.

Take packet captures from the client and server

If other troubleshooting steps don't provide any conclusive outcome, take packet captures from the client and server. Packet captures are also useful when non-HTTP traffic is involved between the client and server. For more information about how to collect packet captures for AKS environment, see the following articles in the data collection guide:

Troubleshooting checklists

For basic troubleshooting for egress traffic from an AKS cluster, follow these steps:

  1. Make sure that the Domain Name System (DNS) resolution for the endpoint works correctly.

  2. Make sure that you can reach the endpoint through an IP address.

  3. Make sure that you can reach the endpoint from another source.

  4. Make sure that the endpoint is working.

  5. Check whether the cluster can reach any other external endpoint.

  6. Check whether a network policy is blocking the traffic.

  7. Check whether an NSG is blocking the traffic.

  8. Check whether a firewall or proxy is blocking the traffic.

  9. Check whether the AKS service principal or managed identity has the required AKS service permissions to make the network changes to Azure resources.

Note

Assumes no service mesh when you do basic troubleshooting. If you use a service mesh such as Istio, it produces unusual outcomes for TCP based traffic.

Check whether the pod and node can reach the endpoint

From within the pod, you can run a DNS lookup to the endpoint.

What if you can't run the kubectl exec command to connect to the pod and install the DNS Utils package? In this situation, you can start a test pod in the same namespace as the problematic pod, and then run the tests.

Note

If the DNS resolution or egress traffic doesn't let you install the necessary network packages, you can use the rishasi/ubuntu-netutil:1.0 docker image. In this image, the required packages are already installed.

Example procedure for checking DNS resolution of a Linux pod
  1. Start a test pod in the same namespace as the problematic pod:

    kubectl run -it --rm aks-ssh --namespace <namespace> --image=debian:stable --overrides='{"spec": { "nodeSelector": {"kubernetes.io/os": "linux"}}}'
    

    After the test pod is running, you'll gain access to the pod.

  2. Run the following apt-get commands to install other tool packages:

    # Update and install tool packages
    apt-get update && apt-get install -y dnsutils curl
    
  3. After the packages are installed, run the nslookup command to test the DNS resolution to the endpoint:

    $ nslookup microsoft.com # Microsoft.com is used as an example
    Server:         <server>
    Address:        <server IP address>#53
    ...
    ...
    Name:   microsoft.com
    Address: 20.53.203.50
    
  4. Try the DNS resolution from the upstream DNS server directly. This example uses Azure DNS:

    $ nslookup microsoft.com 168.63.129.16
    Server:         168.63.129.16
    Address:        168.63.129.16#53
    ...
    ...
    Address: 20.81.111.85
    

Sometimes, there's a problem with the endpoint itself rather than a cluster DNS. In such cases, consider the following checks:

  1. Check whether the desired port is open on the remote host:

    curl -Ivm5 telnet://microsoft.com:443
    
  2. Check the HTTP response code:

    curl -Ivm5 https://microsoft.com
    
  3. Check whether you can connect to any other endpoint:

    curl -Ivm5 https://kubernetes.io
    

To verify that the endpoint is reachable from the node where the problematic pod is in and then verify the DNS settings, follow these steps:

  1. Enter the node where the problematic pod is in through the debug pod. For more information about how to do this, see Connect to Azure Kubernetes Service (AKS) cluster nodes for maintenance or troubleshooting.

  2. Test the DNS resolution to the endpoint:

    $ nslookup  microsoft.com
    Server:         168.63.129.16
    Address:        168.63.129.16#53
    
    Non-authoritative answer:
    Name:   microsoft.com
    Address: 20.112.52.29
    Name:   microsoft.com
    Address: 20.81.111.85
    Name:   microsoft.com
    Address: 20.84.181.62
    Name:   microsoft.com
    Address: 20.103.85.33
    Name:   microsoft.com
    Address: 20.53.203.50
    
  3. Check the resolv.conf file to determine whether the expected name servers are added:

    cat /etc/resolv.conf
    cat /run/systemd/resolve/resolv.conf
    
Example procedure for checking DNS resolution of a Windows pod
  1. Run a test pod in the Windows node pool:

    # For a Windows environment, use the Resolve-DnsName cmdlet.
    kubectl run dnsutil-win --image='mcr.microsoft.com/windows/servercore:ltsc2022' --overrides='{"spec": { "nodeSelector": {"kubernetes.io/os": "windows"}}}' -- powershell "Start-Sleep -s 3600"
    
  2. Run the kubectl exec command to connect to the pod by using PowerShell:

    kubectl exec -it dnsutil-win -- powershell
    
  3. Run the Resolve-DnsName cmdlet in PowerShell to check whether the DNS resolution is working for the endpoint:

    PS C:\> Resolve-DnsName www.microsoft.com 
    
    Name                           Type   TTL   Section    NameHost
    ----                           ----   ---   -------    --------
    www.microsoft.com              CNAME  20    Answer     www.microsoft.com-c-3.edgekey.net
    www.microsoft.com-c-3.edgekey. CNAME  20    Answer     www.microsoft.com-c-3.edgekey.net.globalredir.akadns.net
    net
    www.microsoft.com-c-3.edgekey. CNAME  20    Answer     e13678.dscb.akamaiedge.net
    net.globalredir.akadns.net
    
    Name       : e13678.dscb.akamaiedge.net 
    QueryType  : AAAA
    TTL        : 20
    Section    : Answer
    IP6Address : 2600:1408:c400:484::356e   
    
    
    Name       : e13678.dscb.akamaiedge.net 
    QueryType  : AAAA
    TTL        : 20
    Section    : Answer
    IP6Address : 2600:1408:c400:496::356e 
    
    
    Name       : e13678.dscb.akamaiedge.net
    QueryType  : A
    TTL        : 12
    Section    : Answer
    IP4Address : 23.200.197.152
    

In one unusual scenario that involves DNS resolution, the DNS queries get a correct response from the node but fail from the pod. For this scenario, you might consider checking DNS resolution failures from inside the pod but not from the worker node. If you want to inspect DNS resolution for an endpoint across the cluster, you can consider checking DNS resolution status across the cluster.

If the DNS resolution is successful, continue to the network tests. Otherwise, verify the DNS configuration for the cluster.

Third-party contact disclaimer

Microsoft provides third-party contact information to help you find additional information about this topic. This contact information may change without notice. Microsoft does not guarantee the accuracy of third-party contact information.

Contact us for help

If you have questions or need help, create a support request, or ask Azure community support. You can also submit product feedback to Azure feedback community.