Troubleshoot Container insights

Artikkel
01/24/2025

This article discusses some common issues and troubleshooting steps when using Container insights to monitor your Kubernetes cluster.

Duplicate alerts are being created

You might have enabled Prometheus alert rules without disabling Container insights recommended alerts. See Migrate from Container insights recommended alerts to Prometheus recommended alert rules (preview).

Cluster permissions

If you don't have required permissions to the cluster, you may see the error message, You do not have the right cluster permissions which will restrict your access to Container Insights features. Please reach out to your cluster admin to get the right permission.

Container Insights previously allowed users to access the Azure portal experience based on the access permission of the Log Analytics workspace. It now checks cluster-level permission to provide access to the Azure portal experience. You might need your cluster admin to assign this permission.

For basic read-only cluster level access, assign the Monitoring Reader role for the following types of clusters.

AKS without Kubernetes role-based access control (RBAC) authorization enabled
AKS enabled with Microsoft Entra SAML-based single sign-on
AKS enabled with Kubernetes RBAC authorization
AKS configured with the cluster role binding clusterMonitoringUser
Azure Arc-enabled Kubernetes clusters

See Assign role permissions to a user or group for details on how to assign these roles for AKS and Access and identity options for Azure Kubernetes Service (AKS) to learn more about role assignments.

Onboarding and update issues

The following sections describe issues you might encounter when you onboard or update Container insights on your cluster.

Missing subscription registration

If you see the error Missing Subscription registration, register the resource provider Microsoft.OperationsManagement in the subscription of your Log Analytics workspace. See Resolve errors for resource provider registration.

Authorization error

When you enable Container insights or update a cluster, you might receive an error like The client <user's identity> with object id <user's objectId> does not have authorization to perform action Microsoft.Authorization/roleAssignments/write over scope.

During the onboarding or update process, an attempt is made to assign the Monitoring Metrics Publisher role to the cluster resource. The user initiating the process must have access to the Microsoft.Authorization/roleAssignments/write permission on the AKS cluster resource scope. Only members of the Owner and User Access Administrator built-in roles are granted access to this permission. If your security policies require you to assign granular-level permissions, see Azure custom roles and assign permission to the users who require it. Assign the Publisher role to the Monitoring Metrics with the Azure portal using the guidance at Assign Azure roles by using the Azure portal.

Can't upgrade a cluster

If you can't upgrade Container insights on an AKS cluster after it's been installed, the Log Analytics workspace where the cluster was sending its data may have been deleted. Disable monitoring for the cluster and enable Container insights again using another workspace.

Installation of Azure Monitor Containers extension fails

The error manifests contain a resource that already exists indicates that resources of the Container insights agent already exist on an Azure Arc-enabled Kubernetes cluster, which means that the Container insights agent is already installed. Solve this issue by cleaning up the existing resources of the Container insights agent and then enable the Azure Monitor Containers Extension.

AKS clusters

Run the following commands and look for the Azure Monitor Agent add-on profile to verify whether the AKS Monitoring Add-on is enabled:

az  account set -s <clusterSubscriptionId>
az aks show -g <clusterResourceGroup> -n <clusterName>

If the output includes an Azure Monitor Agent add-on profile config with a Log Analytics workspace resource ID, the AKS Monitoring Add-on is enabled and must be disabled with the following command.

az aks disable-addons -a monitoring -g <clusterResourceGroup> -n <clusterName>

Non-AKS clusters

Run the following command against the cluster to verify whether the azmon-containers-release-1 Helm chart release exists.

helm list  -A

If the output indicates that the azmon-containers-release-1 exists, delete the Helm chart release with the following command.

helm del azmon-containers-release-1

Missing data

It may take up to 15 minutes for data to appear after you enable Container insights on a cluster. If you don't see data after 15 minutes, see the following sections for potential issues and solutions.

Error message retrieving data

The error message Error retrieving data might occur if the Log Analytics workspace where the cluster was sending its data has been deleted. If this is the case, disable monitoring for the cluster and enable Container insights again using another workspace.

Local authentication disabled

Check if the Log Analytics workspace is configured for local authentication with the following CLI command.

az resource show --ids "/subscriptions/[Your subscription ID]/resourcegroups/[Your resource group]/providers/microsoft.operationalinsights/workspaces/[Your workspace name]"

If disableLocalAuth = true, then run the following command.

az resource update --ids "/subscriptions/[Your subscription ID]/resourcegroups/[Your resource group]/providers/microsoft.operationalinsights/workspaces/[Your workspace name]" --api-version "2021-06-01" --set properties.features.disableLocalAuth=False |

Daily cap met

When the daily cap is limit is met for a Log Analytics workspace, it will stop collecting data until the reset time. See Log Analytics Daily Cap.

DCR not deployed with Terraform

If Container insights is enabled using Terraform and msi_auth_for_monitoring_enabled is set to true, ensure that DCR and DCRA resources are also deployed to enable log collection. See Enable Container insights.

Container insights not reporting any information

Use the following steps if you can't view status information or no results are returned from a log query.

Check the status of the agent with the following command:

kubectl get ds ama-logs --namespace=kube-system

The number of pods should be equal to the number of Linux nodes on the cluster. The output should resemble the following example, which indicates that it was deployed properly:
```
User@aksuser:~$ kubectl get ds ama-logs --namespace=kube-system
NAME       DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR    AGE
ama-logs   2         2         2         2            2           <none>           1d
```
If you have Windows Server nodes, check the status of the agent by running the following command:

kubectl get ds ama-logs-windows --namespace=kube-system

The number of pods should be equal to the number of Windows nodes on the cluster. The output should resemble the following example, which indicates that it was deployed properly:
```
User@aksuser:~$ kubectl get ds ama-logs-windows --namespace=kube-system
NAME                   DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR    AGE
ama-logs-windows           2         2         2         2            2           <none>       1d
```
Check the deployment status by using the following command:

kubectl get deployment ama-logs-rs --namespace=kube-system

The output should resemble the following example, which indicates that it was deployed properly:
```
User@aksuser:~$ kubectl get deployment ama-logs-rs --namespace=kube-system
NAME          READY   UP-TO-DATE   AVAILABLE   AGE
ama-logs-rs   1/1     1            1           24d
```

Check the status of the pod to verify that it's running by using the command kubectl get pods --namespace=kube-system.

The output should resemble the following example with a status of Running for ama-logs:

User@aksuser:~$ kubectl get pods --namespace=kube-system
NAME                                READY     STATUS    RESTARTS   AGE
aks-ssh-139866255-5n7k5             1/1       Running   0          8d
azure-vote-back-4149398501-7skz0    1/1       Running   0          22d
azure-vote-front-3826909965-30n62   1/1       Running   0          22d
ama-logs-484hw                      1/1       Running   0          1d
ama-logs-fkq7g                      1/1       Running   0          1d
ama-logs-windows-6drwq              1/1       Running   0          1d

If the pods are in a running state, but there is no data in Log Analytics or data appears to only send during a certain part of the day, it might be an indication that the daily cap has been met. When this limit is met each day, data stops ingesting into the Log Analytics Workspace and resets at the reset time. For more information, see Log Analytics Daily Cap.

Metrics aren't being collected

Verify that the Monitoring Metrics Publisher role assignment exists by using the following CLI command:
```
az role assignment list --assignee "SP/UserassignedMSI for Azure Monitor Agent" --scope "/subscriptions/<subid>/resourcegroups/<RG>/providers/Microsoft.ContainerService/managedClusters/<clustername>" --role "Monitoring Metrics Publisher"
```
For clusters with MSI, the user-assigned client ID for Azure Monitor Agent changes every time monitoring is enabled and disabled, so the role assignment should exist on the current MSI client ID.
For clusters with Microsoft Entra pod identity enabled and using MSI:
- Verify that the required label kubernetes.azure.com/managedby: aks is present on the Azure Monitor Agent pods by using the following command:
  
  kubectl get pods --show-labels -n kube-system | grep ama-logs
- Verify that exceptions are enabled when pod identity is enabled by using one of the supported methods at https://github.com/Azure/aad-pod-identity#1-deploy-aad-pod-identity.
  
  Run the following command to verify:
  
  kubectl get AzurePodIdentityException -A -o yaml
  
  You should receive output similar to the following example:
```
apiVersion: "aadpodidentity.k8s.io/v1"
kind: AzurePodIdentityException
metadata:
name: mic-exception
namespace: default
spec:
podLabels:
app: mic
component: mic
---
apiVersion: "aadpodidentity.k8s.io/v1"
kind: AzurePodIdentityException
metadata:
name: aks-addon-exception
namespace: kube-system
spec:
podLabels:
kubernetes.azure.com/managedby: aks
```

Performance charts don't show CPU or memory of nodes and containers on a non-Azure cluster

Container insights agent pods use the cAdvisor endpoint on the node agent to gather performance metrics. Verify the containerized agent on the node is configured to allow cAdvisor secure port: 10250 or cAdvisor unsecure port: 10255 to be opened on all nodes in the cluster to collect performance metrics. See prerequisites for hybrid Kubernetes clusters.

Image and Name values not populated in the ContainerLog table

For agent version ciprod12042019 and later, these two properties aren't populated by default for every log line to minimize cost incurred on log data collected. You can either enable collection of these properties or modify your queries to include these properties from other tables.

Modify your queries to include Image and ImageTag properties from the ContainerInventory table by joining on ContainerID property. You can include the Name property (as it previously appeared in the ContainerLog table) from the KubepodInventory table's ContainerName field by joining on the ContainerID property.

The following sample query shows how to get use joins to retrieve these values.

//Set the time window for the query
let startTime = ago(1h);
let endTime = now();
//
//Get the latest Image & ImageTag for every containerID
let ContainerInv = ContainerInventory | where TimeGenerated >= startTime and TimeGenerated < endTime | summarize arg_max(TimeGenerated, *)  by ContainerID, Image, ImageTag | project-away TimeGenerated | project ContainerID1=ContainerID, Image1=Image ,ImageTag1=ImageTag;
//
//Get the latest Name for every containerID
let KubePodInv  = KubePodInventory | where ContainerID != "" | where TimeGenerated >= startTime | where TimeGenerated < endTime | summarize arg_max(TimeGenerated, *)  by ContainerID2 = ContainerID, Name1=ContainerName | project ContainerID2 , Name1;
//
//Join the above to get a jointed table that has name, image & imagetag. Outer left is used in case there are no kubepod records or if they're latent
let ContainerData = ContainerInv | join kind=leftouter (KubePodInv) on $left.ContainerID1 == $right.ContainerID2;
//
//Join ContainerLog table with the jointed table above, project-away redundant fields/columns, and rename columns that were rewritten. Outer left is used so logs aren't lost even if no container metadata for loglines is found.
ContainerLog
| where TimeGenerated >= startTime and TimeGenerated < endTime
| join kind= leftouter (
  ContainerData
) on $left.ContainerID == $right.ContainerID2 | project-away ContainerID1, ContainerID2, Name, Image, ImageTag | project-rename Name = Name1, Image=Image1, ImageTag=ImageTag1

Warning

Enabling the properties isn't recommended for large clusters that have more than 50 nodes. It generates API server calls from every node in the cluster and also increases data size for every log line collected.

To enable collection of these fields so you don't have to modify your queries, enable the setting log_collection_settings.enrich_container_logs in the agent config map as described in the data collection configuration settings.

Logs not being collected on Azure Stack HCI cluster

If you registered your cluster and/or configured HCI Insights before November 2023, features that use the Azure Monitor agent on HCI, such as Arc for Servers Insights, VM Insights, Container Insights, Defender for Cloud, or Microsoft Sentinel might not be collecting logs and event data properly. See Repair AMA agent for HCI for steps to reconfigure the agent and HCI Insights.

Missing data on large clusters

If data is missing from any of the following tables, the likely issue is related to parsing of the large payloads because of a large number of pods or nodes. This is known issue in the ruby plugin to parse the large JSON payload because of the default PODS_CHUNK_SIZE, which is 1000.

There are plans to adjust the default PODS_CHUNK_SIZE value to smaller value to address this issue.

KubePodInventory
KubeNodeInventory
KubeEvents
KubePVInventory
KubeServices

Verify whether you've configured smaller PODS_CHUNK_SIZE value on your cluster using the following commands.

# verify if kube context being set for right cluster
kubectl cluster-info

# check if the configmap configured with smaller PODS_CHUNK_SIZE chunksize already
kubectl logs <ama-logs-rs pod name> -n kube-system -c ama-logs | grep PODS_CHUNK_SIZE

# If it's configured, the output will be similar to "Using config map value: PODS_CHUNK_SIZE = 10"

If the cluster is already configured for a smaller PODS_CHUNK_SIZE value, then you need to enable the cluster for large cluster.

If the cluster is using the default PODS_CHUNK_SIZE=1000, then check if the cluster has a large number of pods or nodes.

# check the total number of PODS
kubectl get pods -A -o wide | wc -l

# check the total number of NODES
kubectl get nodes -o wide | wc -l

After confirming the number of pods and nodes is reasonably high, and the cluster is using the default PODS_CHUNK_SIZE=1000 then use the following commands to configure the configmap.

# Check if the cluster has container-azm-ms-agentconfig configmap in kube-system namespace
kubectl get cm -n kube-system | grep container-azm-ms-agentconfig

# If there is no existing container-azm-ms-agentconfig configmap, then configmap needs to be downloaded  and applied
curl -L https://raw.githubusercontent.com/microsoft/Docker-Provider/refs/heads/ci_prod/kubernetes/container-azm-ms-agentconfig.yaml -o container-azm-ms-agentconfig
kubectl apply -f container-azm-ms-agentconfig

# Edit the configmap and uncomment agent_settings.chunk_config and PODS_CHUNK_SIZE lines under agent-settings: |- in the configmap
kubectl edit cm -n kube-system  container-azm-ms-agentconfig -o yaml

Agent OOM killed

Daemonset container getting OOM killed

Start by identifying which container is getting OOM killed using the following commands. This will identify ama-logs, ama-logs-prometheus, or both.

# verify if kube context being set for right cluster
kubectl cluster-info

# get the ama-logs pods and status
kubectl get pods -n kube-system -o custom-columns=NAME:.metadata.name | grep -E ama-logs-[a-z0-9]{5}

# from the result of above command, find out which ama-logs pod instance getting OOM killed
kubectl describe pod <ama-logs-pod> -n kube-system

# review the output of the above command to findout which ama-logs container is getting OOM killed

Check if there are network errors in mdsd.err log file using the following commands.

mkdir log
# for ama-logs-prometheus container use -c ama-logs-prometheus instead of -c ama-logs
kubectl cp -c ama-logs kube-system/<ama-logs pod name>:/var/opt/microsoft/linuxmonagent/log log
cd log
cat mdsd.err

If errors are because of the outbound endpoint is blocked, see Network firewall requirements for monitoring Kubernetes cluster for endpoint requirements.
If errors are because of missing Data collection endpoint (DCE) or Data Collection rule (DCR), then reenable Container insights using the guidance at Enable monitoring for Kubernetes clusters.
If there are no errors, then this may be related to log scale. See High scale logs collection in Container Insights (Preview).

Replicaset container getting OOM killed

Identify how frequently ama-logs-rs pod getting OOM killed with the following commands.

# verify if kube context being set for right cluster
kubectl cluster-info

# get the ama-logs pods and status
kubectl get pods -n kube-system -o wide | grep ama-logs-rs

# from the result of above command, find out which ama-logs pod instance getting OOM killed
kubectl describe pod <ama-logs-rs-pod> -n kube-system

# review the output of the above command to confirm the OOM kill

If ama-logs-rs getting OOM killed, then check if there are network errors with the following commands.

 mkdir log
 kubectl cp -c ama-logs kube-system/<ama-logs-rs pod name>:/var/opt/microsoft/linuxmonagent/log log
 cd log
 cat mdsd.err

If errors are because of the outbound endpoint is blocked, see Network firewall requirements for monitoring Kubernetes cluster for endpoint requirements.
If errors are because of missing Data collection endpoint (DCE) or Data Collection rule (DCR), then reenable Container insights using the guidance at Enable monitoring for Kubernetes clusters.

If there are no network errors, check if the cluster level prometheus scraping is enabled by reviewing the [prometheus_data_collection_settings.cluster] settings in configmap.

# Check if the cluster has container-azm-ms-agentconfig configmap in kube-system namespace
kubectl get cm -n kube-system | grep container-azm-ms-agentconfig
# If there is no existing container-azm-ms-agentconfig configmap, then means cluster level prometheus data collection not enabled

Check the cluster size in terms of the nodes and pods count.

# Check if the cluster has container-azm-ms-agentconfig configmap in kube-system namespace
NodeCount=$(kubectl get nodes | wc -l)
echo "Total number of nodes: ${NodeCount}"
PodCount=$(kubectl get pods -A -o wide | wc -l)
echo "Total number of pods: ${PodCount}"

# If there is no existing container-azm-ms-agentconfig configmap, then means cluster level prometheus data collection is not enabled.

If you determine the issue is related to scale of the cluster, then ama-logs-rs memory limit needs to be increased. Open a support case with Microsoft to make this request.

Latency issues

By default Container insights collects monitoring data every 60 seconds unless you configure data collection settings or add a transformation. See Log data ingestion time in Azure Monitor for detailed information on latency and expected ingestion times in a Log Analytics workspace.

Check the latencies for the reported table and time window in the log analytics workspace associated to the clusters using the following query.

let clusterResourceId = "/subscriptions/<subscriptionId>/resourceGroups/<rgName>/providers/Microsoft.ContainerService/managedClusters/<clusterName>";
let startTime = todatetime('2024-11-20T20:34:11.9117523Z');
let endTime = todatetime('2024-11-21T20:34:11.9117523Z');
KubePodInventory #Update this table name to the one you want to check
| where _ResourceId =~ clusterResourceId
| where TimeGenerated >= startTime and TimeGenerated <= endTime
| extend E2EIngestionLatency = ingestion_time() - TimeGenerated
| extend AgentLatency = _TimeReceived - TimeGenerated
| summarize max(E2EIngestionLatency), max(AgentLatency) by Computer
| project Computer, max_AgentLatency, max_ingestionLatency = (max_E2EIngestionLatency -  max_AgentLatency),max_E2EIngestionLatency

If you're seeing high agent latencies, check if you configured a different log collection interval than default of 60 seconds in the Container Insights DCR.

# set the subscriptionId of the cluster
az account set -s "<subscriptionId>"
# check if ContainerInsightsExtension  data collection rule association exists
az monitor data-collection rule association list --resource <clusterResourceId>
# get the data collection rule resource id associated to ContainerInsightsExtension from above step
az monitor data-collection rule show  --ids  <dataCollectionRuleResourceIdFromAboveStep>
# check if there are any data collection settings related to interval from the output of the above step

Multiline logging issues

Multi-line log feature can be enabled with configmap and supports the following scenarios.

Supports log messages up to 64KB instead of the default limit of 16KB.
Stitches exception call stack traces for supported languages .NET, Go, Python and Java.

Verify that the multiline feature and ContainerLogV2 schema are enabled with the following commands.

    # get the list of ama-logs and these pods should be in Running state
    # If these are not in Running state, then this needs to be investigated
    kubectl get po -n kube-system | grep ama-logs

    # exec into any one of the ama-logs daemonset pod and check for the environment variables
    kubectl exec -it  ama-logs-xxxxx -n kube-system -c ama-logs -- bash

    # after exec into the container run this command
    env | grep AZMON_MULTILINE

    # result should have environment variables which indicates the multiline and languages enabled
    AZMON_MULTILINE_LANGUAGES=java,go
    AZMON_MULTILINE_ENABLED=true

    # check if the containerlog v2 schema enabled or not
    env | grep AZMON_CONTAINER_LOG_SCHEMA_VERSION

    # output should be v2. If not v2, then check whether this is being enabled through DCR
    AZMON_CONTAINER_LOG_SCHEMA_VERSION=v2

Del via