Use this article to help you troubleshoot and resolve storage-related issues in AKS Arc.
Configuring persistent volume claims results in the error: "Unable to initialize agent. Error: mkdir /var/log/agent: permission denied"
This permission denied error indicates that the default storage class may not be suitable for your workloads and occurs in Linux workloads running on top of Kubernetes version 1.19.x or later. Following security best practices, many Linux workloads specify the securityContext fsGroup
setting for a pod. The workloads fail to start on AKS on Azure Local since the default storage class doesn't specify the fstype (=ext4)
parameter, so Kubernetes fails to change the ownership of files and persistent volumes based on the fsGroup
requested by the workload.
To resolve this issue, define a custom storage class that you can use to provision PVCs.
Container storage interface pod stuck in a 'ContainerCreating' state
A new Kubernetes workload cluster was created with Kubernetes version 1.16.10 and then updated to 1.16.15. After the update, the csi-msk8scsi-node-9x47m
pod was stuck in the ContainerCreating state, and the kube-proxy-qqnkr
pod was stuck in the Terminating state as shown in the output below:
Error: kubectl.exe get nodes
NAME STATUS ROLES AGE VERSION
moc-lf22jcmu045 Ready <none> 5h40m v1.16.15
moc-lqjzhhsuo42 Ready <none> 5h38m v1.16.15
moc-lwan4ro72he NotReady master 5h44m v1.16.15
\kubectl.exe get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
5h38m
kube-system csi-msk8scsi-node-9x47m 0/3 ContainerCreating 0 5h44m
kube-system kube-proxy-qqnkr 1/1 Terminating 0 5h44m
Since kubelet
ended up in a bad state and can no longer talk to the API server, the only solution is to restart the kubelet
service. After restarting, the cluster goes into a running state.
Disk storage filled up from crash dump logs
Disk storage can be filled up from crash dump logs that are created. This is due to an expired Geneva agent client certificate. The symptoms can be as follows:
- Services fail to start.
- Kubernetes pods, deployments, etc. fail to start due insufficient resources.
Important
This issue can impact all new Mariner management and target cluster nodes created after April 18, 2023 on releases from April 2022 to March 2023. The issue is fixed in the 2023-05-09 release and later.
This issue can impact any operation that involves allocating disk space or writing new files, so any "insufficient disk space/resources" error is a good hint. To check if this issue is present on a given node, run the following shell command:
clouduser@moc-lwm2oudnskl $ sudo du -h /var/lib/systemd/coredump/
This command reports the storage space consumed by the diagnostic files.
Root cause
The expiration of the client certificate used to authenticate the Geneva agent to the service endpoint causes the agent to crash, resulting in a crash dump. The agent's crash/retry loop is about 5 seconds at initial startup, and there is no timeout. This means that a new file (about 330MB) is created on the node's file system every few seconds, which can rapidly consume disk storage.
Mitigation
The preferred mitigation is to upgrade to the latest release, version 1.10.18.10425, which has an updated certificate. To do so, first manually upgrade your workload clusters to any supported minor version before you update your Azure Local host.
For more information about AKS Arc releases, and all the latest AKS on Azure Local news, subscribe to the AKS releases page.
If upgrading is not an option, you can turn off the mdsd service. For each Mariner node:
Turn off the Geneva agent with the following shell commands:
sudo systemctl disable --now mdsd
Verify that the Geneva agent was successfully disabled:
sudo systemctl status mdsd
Delete accumulated files with the following command:
sudo find /var/lib/systemd/coredump/ -type f -mmin +1 -exec rm -f {} \; sudo find /run/systemd/propagate -name 'systemd-coredump@*' -delete sudo journalctl --rotate && sudo journalctl --vacuum-size=500M
Reboot the node:
sudo reboot
Storage pod crashes and the logs say that the `createSubDir` parameter is invalid
An error can occur if you have an SMB or NFS CSI driver installed in your deployment and you upgrade to the May build from an older version. One of the parameters, called createSubDir
, is no longer accepted. If this applies to your deployment, follow the instructions below to resolve the storage class failure.
If you experience this error, the storage pod crashes and the logs indicate that the createSubDir
parameter is invalid.
Recreate the storage class.
When creating a persistent volume, an attempt to mount the volume fails
After deleting a persistent volume or a persistent volume claim in an AKS Arc environment, a new persistent volume is created to map to the same share. However, when attempting to mount the volume, the mount fails, and the pod times out with the error, NewSmbGlobalMapping failed
.
To work around the failure to mount the new volume, you can SSH into the Windows node and run Remove-SMBGlobalMapping
and provide the share that corresponds to the volume. After running this command, attempts to mount the volume should succeed.