Fix known issues and errors when configuring a network in AKS Arc

Applies to: AKS on Azure Local, AKS on Windows Server Use this topic to help you troubleshoot and resolve networking-related issues with AKS Arc.

Error: 'Failed to start the cloud agent generic cluster service in failover cluster. The cluster resource group is in the 'failed' state.'

Cloud agent may fail to successfully start when using path names with spaces in them.

When using Set-AksHciConfig to specify -imageDir, -workingDir, -cloudConfigLocation, or -nodeConfigLocation parameters with a path name that contains a space character, such as D:\Cloud Share\AKS HCI, the cloud agent cluster service will fail to start with the following (or similar) error message:

Failed to start the cloud agent generic cluster service in failover cluster. The cluster resource group os in the 'failed' state. Resources in 'failed' or 'pending' states: 'MOC Cloud Agent Service'

To work around this issue, use a path that does not include spaces, for example, C:\CloudShare\AKS-HCI.

Arc-connected clusters have empty JSON "distribution" property

Azure Arc for Kubernetes-connected clusters can have the value for the JSON property distribution set to an empty string. For AKS Arc-connected clusters, this value is set during initial installation and is never altered for the lifetime of the deployment.

To reproduce the issue, open an Azure PowerShell window and run the following commands:

az login
az account set --subscription <sub_id>
az connectedk8s show -g <rg_name> -n

The following is example output from this command:

{
"agentPublicKeyCertificate": <value>
"agentVersion": "1.8.14",
"azureHybridBenefit": "NotApplicable",
"connectivityStatus": "Connected",
"distribution": "",
"distributionVersion": null,
"id": "/subscriptions/<subscription>/resourceGroups/<resource group>/providers/Microsoft.Kubernetes/connectedClusters/<cluster name>",
"identity": {
  "principalId": "<principal id>",
  "tenantId": "<tenant id>",
  "type": "SystemAssigned"
},
"infrastructure": "azure_stack_hci",
"kubernetesVersion": "1.23.12",
"lastConnectivityTime": "2022-11-04T14:59:59.050000+00:00",
"location": "eastus",
"managedIdentityCertificateExpirationTime": "2023-02-02T14:24:00+00:00",
"miscellaneousProperties": null,
"name": "mgmt-cluster",
"offering": "AzureStackHCI_AKS_Management",
"privateLinkScopeResourceId": null,
"privateLinkState": "Disabled",
"provisioningState": "Succeeded",
"resourceGroup": "<resource group>",
"systemData": {
  "createdAt": "2022-11-04T14:29:17.680060+00:00",
  "createdBy": "<>",
  "createdByType": "Application",
  "lastModifiedAt": "2022-11-04T15:00:01.830067+00:00",
  "lastModifiedBy": "64b12d6e-6549-484c-8cc6-6281839ba394",
  "lastModifiedByType": "Application"
},
"tags": {},
"totalCoreCount": 4,
"totalNodeCount": 1,
"type": "microsoft.kubernetes/connectedclusters"
}

To resolve the issue

If the output for the JSON distribution property returns an empty string, follow these instructions to patch your cluster:

  1. Copy the following configuration into a file called patchBody.json:

    {
       "properties": {
         "distribution": "aks_management"
       }
    }
    

    Important

    If your cluster is an AKS management cluster, the value should be set to aks_management. If it's a workload cluster, it should be set to aks_workload.

  2. From the JSON output you obtained above, copy your connected cluster ID. It should appear as a long string in the following format:

    "/subscriptions/<subscription >/resourceGroups/<resource group>/providers/Microsoft.Kubernetes/connectedClusters/<cluster name>",

  3. Run the following command to patch your cluster. The <cc_arm_id> value should be the ID obtained in step 2:

    az rest -m patch -u "<cc_arm_id>?api-version=2022-10-01-preview" -b "@patchBody.json"
    
  4. Check that your cluster has been successfully patched by running az connectedk8s show -g <rg_name> -n <cc_name> to make sure the JSON property distribution has been correctly set.

The WSSDAgent service is stuck while starting and fails to connect to the cloud agent

Symptoms:

  • Proxy enabled in AKS Arc. The WSSDAgent service stuck in the starting state. Shows up as the following:
  • Test-NetConnection -ComputerName <computer IP/Name> -Port <port> from the node where the node agent is failing towards the cloud agent is working properly on the system (even when the wssdagent fails to start)
  • Curl.exe from the node on which the agent is failing towards the cloud agent reproduces the problem and is getting stuck: curl.exe https://<computerIP>:6500
  • To fix the problem, pass the --noproxy flag to curl.exe. Curl returns an error from from wssdcloudagent. This is expected since curl is not a GRPC client. Curl doesn't get stuck waiting when you send the --noproxy flag. So returning an error is considered a success here):
curl.exe --noproxy '*' https://<computerIP>:65000

It is likely the proxy settings were changed to a faulty proxy on the host. The proxy settings for AKS Arc are environment variables that are inherited from the parent process on the host. These settings only get propagated when a new service starts or an old one updates or reboots. It is possible that faulty proxy settings were set on the host, and they were propagated to the WSSDAgent after an update or reboot, which has caused the WSSDAgent to fail.

You will need to fix the proxy settings by changing the environmental variables on the machine. On the machine, change the variables with the following commands:

  [Environment]::SetEnvironmentVariable("https_proxy", <Value>, "Machine")
  [Environment]::SetEnvironmentVariable("http_proxy", <Value>, "Machine")
  [Environment]::SetEnvironmentVariable("no_proxy", <Value>, "Machine")

Reboot the machine so that the service manager, and the WSSDAgent, pick up the fixed proxy.

CAPH pod fails to renew certificate

This error occurs because every time the CAPH pod is started, a login to cloudagent is attempted and the certificate is stored in the temporary storage volume, which will clean out on pod restarts. Therefore, every time a pod is restarted, the certificate is destroyed and a new login attempt is made.

A login attempt starts a renewal routine, which renews the certificate when it nears expiry. The CAPH pod decides if a login is needed if the certificate is available or not. If the certificate is available, the login is not attempted, assuming the renewal routine is already there.

However, on a container restart, the temporary directory is not cleaned, so the file is still persisted and the login attempt is not made, causing the renewal routine to not start. This leads to certificate expiration.

To mitigate this issue, restart the CAPH pod using the following command:

kubectl delete pod pod-name

Set-AksHciConfig fails with WinRM errors but shows WinRM is configured correctly

When running Set-AksHciConfig, you might encounter the following error:

WinRM service is already running on this machine.
WinRM is already set up for remote management on this computer.
Powershell remoting to TK5-3WP08R0733 was not successful.
At C:\Program Files\WindowsPowerShell\Modules\Moc\0.2.23\Moc.psm1:2957 char:17
+ ...             throw "Powershell remoting to "+$env:computername+" was n ...
+                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : OperationStopped: (Powershell remo...not successful.:String) [], RuntimeException
    + FullyQualifiedErrorId : Powershell remoting to TK5-3WP08R0733 was not successful.

Most of the time, this error occurs as a result of a change in the user's security token (due to a change in group membership), a password change, or an expired password. In most cases, the issue can be remediated by logging off from the computer and logging back in. If the error still occurs, you can create a support ticket through the Azure portal.

AKS Arc cluster stuck in "ScalingControlPlane" provisioning state

This issue causes an AKS Arc cluster to remain in the ScalingControlPlane state for an extended period of time.

To reproduce

Get-AksHciCluster -Name <cluster name> | select *
Status                : {ProvisioningState, Details}
ProvisioningState     : ScalingControlPlane
KubernetesVersion     : v1.22.6
PackageVersion        : v1.22.6-kvapkg.0
NodePools             : {nodepoolA, nodepoolB}
WindowsNodeCount      : 0
LinuxNodeCount        : 1
ControlPlaneNodeCount : 1
ControlPlaneVmSize    : Standard_D4s_v3
AutoScalerEnabled     : False
AutoScalerProfile     :
Name                  : <cluster name>

This issue has been fixed in recent AKS Arc versions. We recommend updating your AKS Arc clusters to the latest release.

To mitigate this issue, contact support to manually patch the status of the control plane nodes to remove the condition MachineOwnerRemediatedCondition from the machine status via kubectl.

The workload cluster is not found

The workload cluster may not be found if the IP address pools of two AKS on Azure Local deployments are the same or overlap. If you deploy two AKS hosts and use the same AksHciNetworkSetting configuration for both, PowerShell and Windows Admin Center will potentially fail to find the workload cluster because the API server will be assigned the same IP address in both clusters, resulting in a conflict.

The error message you receive will look similar to the example shown below.

A workload cluster with the name 'clustergroup-management' was not found.
At C:\Program Files\WindowsPowerShell\Modules\Kva\0.2.23\Common.psm1:3083 char:9
+         throw $("A workload cluster with the name '$Name' was not fou ...
+         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : OperationStopped: (A workload clus... was not found.:String) [], RuntimeException
    + FullyQualifiedErrorId : A workload cluster with the name 'clustergroup-management' was not found.

Note

Your cluster name will be different.

To resolve the issue, delete one of the clusters and create a new AKS cluster network setting that has a non-overlapping network with the other clusters.

Get-AksHCIClusterNetwork does not show the current allocation of IP addresses

Running the Get-AksHciClusterNetwork command provides a list of all virtual network configurations. However, the command does not show the current allocation of the IP addresses.

To find out what IP addresses are currently in use in a virtual network, use the steps below:

  1. To get the group, run the following command:
Get-MocGroup -location MocLocation
  1. To get the list of IP addresses that are currently in use, and the list of available or used virtual IP addresses, run the following command:
Get-MocNetworkInterface -Group <groupName> | ConvertTo-Json -depth 10
  1. Use the following command to view the list of virtual IP addresses that are currently in use:
Get-MocLoadBalancer -Group <groupName> | ConvertTo-Json -depth 10

"Cluster IP Address x.x.x.x" fails

A cluster IP address shows as "Failed" during precheck. This precheck tests that all IPv4 addresses and DNS network names are online and reachable. For example, a failed IPv4 or network name can cause this test to fail.

To resolve this, perform the following steps:

  1. Run the following command:

    Get-ClusterGroup -name "Cluster Group" | Get-ClusterResource
    
  2. Ensure that all network names and IP addresses are in an online state.

  3. Run the following two commands:

    Stop-ClusterResource -name "Cluster IP Address x.x.x.x"
    

    and then

    Start-ClusterResource -name "Cluster IP Address x.x.x.x"
    

When you deploy AKS on Azure Local with a misconfigured network, deployment timed out at various points

When you deploy AKS on Azure Local, the deployment may time out at different points of the process depending on where the misconfiguration occurred. You should review the error message to determine the cause and where it occurred.

For example, in the following error, the point at which the misconfiguration occurred is in Get-DownloadSdkRelease -Name "mocstack-stable":

$vnet = New-AksHciNetworkSettingSet-AksHciConfig -vnet $vnetInstall-AksHciVERBOSE: 
Initializing environmentVERBOSE: [AksHci] Importing ConfigurationVERBOSE: 
[AksHci] Importing Configuration Completedpowershell : 
GetRelease - error returned by API call: 
Post "https://msk8s.api.cdp.microsoft.com/api/v1.1/contents/default/namespaces/default/names/mocstack-stable/versions/0.9.7.0/files?action=generateDownloadInfo&ForegroundPriority=True": 
dial tcp 52.184.220.11:443: connectex: 
A connection attempt failed because the connected party did not properly
respond after a period of time, or established connection failed because
connected host has failed to respond.At line:1 char:1+ powershell -command
{ Get-DownloadSdkRelease -Name "mocstack-stable"}

This indicates that the physical Azure Local node can resolve the name of the download URL, msk8s.api.cdp.microsoft.com, but the node can't connect to the target server.

To resolve this issue, you need to determine where the breakdown occurred in the connection flow. Here are some steps to try to resolve the issue from the physical cluster node:

  1. Ping the destination DNS name: ping msk8s.api.cdp.microsoft.com.
  2. If you get a response back and no time-out, then the basic network path is working.
  3. If the connection times out, then there could be a break in the data path. For more information, see check proxy settings. Or, there could be a break in the return path, so you should check the firewall rules.

Applications that are NTP time-dependent trigger hundreds of false alerts

Applications that are NTP time-dependent can trigger hundreds of false alerts when they are out of time sync. If the cluster is running in a proxy environment, your nodepools may not be able to communicate with the default NTP server, time.windows.com, through your proxy or firewall.

Workaround

To work around this issue, update the NTP server config on each nodepool node to synchronize the clocks. Doing so will set the clocks in each of your application pods as well.

Prerequisites

  • An address of an NTP server that is accessible in each nodepool node.
  • Access to the workload cluster kubeconfig file.
  • Access to the management cluster kubeconfig.

Steps

  1. Run the following kubectl command using the workload cluster kubeconfig to get a list of nodes in it:

    kubectl --kubeconfig <PATH TO KUBECONFIG> get nodes -owide
    
  2. Run the following kubectl command to correlate the node names from the previous command to the workload cluster's nodepool nodes. Take note of the relevant IP addresses from the previous command's output.

    kubectl --kubeconfig (Get-KvaConfig).kubeconfig get machine -l cluster.x-k8s.io/cluster-name=<WORKLOAD_CLUSTER_NAME>
    
  3. Using the output from the previous steps, run the following steps for each nodepool node that needs its NTP config updated:

    1. SSH into a nodepool node VM:

      ssh -o StrictHostKeyChecking=no -i (Get-MocConfig).sshPrivateKey clouduser@<NODEPOOL_IP>
      
    2. Verify that the configured NTP server is unreachable:

      sudo timedatectl timesync-status
      

      If the output looks like this, then the NTP server is unreachable and it needs to be changed:

      Server: n/a (time.windows.com)
      Poll interval: 0 (min: 32s; max 34min 8s)
      Packet count: 0
      
    3. To update the NTP server, run the following commands to set the NTP server in the timesync config file to one that is accessible from the nodepool VM:

      # make a backup of the config file
      sudo cp /etc/systemd/timesyncd.conf ~/timesyncd.conf.bak
      
      # update the ntp server
      NTPSERVER="NEW_NTP_SERVER_ADDRESS"
      sudo sed -i -E "s|^(NTP=)(.*)$|\1$NTPSERVER|g" /etc/systemd/timesyncd.conf
      
      # verify that the NTP server is updated correctly - check the value for the field that starts with "NTP="
      sudo cat /etc/systemd/timesyncd.conf 
      
    4. Restart the timesync service:

      sudo systemctl restart systemd-timesyncd.service
      
    5. Verify that the NTP server is accessible:

      sudo timedatectl timesync-status
      
    6. Verify that the clock shows the correct time:

      date
      
  4. Make sure that each nodepool node shows the same time by running the command from step 3.f.

  5. If your application does not update its time automatically, restart its pods.