@YutongTie-MSFT Thanks for the advice. Running sudo modprobe nvidia
fails with modprobe: ERROR: could not insert 'nvidia': Operation not permitted
. This is probably because SecureBoot is enabled and the driver taints the kernel (is not signed, or something like that).
To be honest, the reason I use a DSVM is for everything to work out of the box. instead, I am first greeted with conda: command not found
, and I need to press Ctrl+C
to resume bash. And then Nvidia drivers fails. If I wanted this headaches, I'd get a vanilla VM
Nvidia drivers not working on DSVM
Hi,
I'm trying to set up a VM with CUDA installed and figured I would go with the DSVM image, since according to specification it should work out of the box.
However when I connect to my VM (NC6s v3) and execute nvdia-smi i get:
$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
I have tried running apt upgrade and reinstalling the cuda-drives to no avail. I also created a new VM this morning with exactly the same problem.
If I try installing the "NvidiaGpuDriverLinux" Extension it also fails.
Any advice is appreciated, my next approach would be to start from a clean, non-DSVM image and install the drivers myself.
2 answers
Sort by: Most helpful
-
TOMOIAGA Ciprian 6 Reputation points
2024-01-12T18:13:40.6666667+00:00 -
YutongTie-MSFT 53,926 Reputation points
2023-11-16T19:36:44.41+00:00 Thanks for reaching out to us. There could be several reasons why you're experiencing this issue. Here are a few troubleshooting steps you can follow:
- Check the VM size: Not all VMs in Azure support GPU acceleration. Make sure that you're using a VM size that supports GPUs. NC6s v3 should support GPUs, so this shouldn't be an issue.
- Check the CUDA version: It's possible that the CUDA version installed on your DSVM is not compatible with the GPU on your VM. You can check the CUDA version with the command
nvcc --version
. The CUDA version should be compatible with the NVIDIA driver version. - Reinstall the NVIDIA driver: You mentioned that you have tried reinstalling the CUDA drivers. You can try reinstalling the NVIDIA drivers as well. Here's how:
- Uninstall the current driver:
sudo apt-get remove --purge nvidia-*
- Update the system:
sudo apt-get update
- Install the NVIDIA driver:
sudo apt-get install nvidia-driver-xxx
(replace xxx with the version you want)
- Install the NVIDIA driver:
- Update the system:
- Uninstall the current driver:
- Check the NVIDIA Kernel Module: Sometimes, the NVIDIA kernel module is not loaded correctly, which can cause issues. You can check if the NVIDIA kernel module is loaded with the command
lsmod | grep nvidia
. If it's not loaded, you can load it with the commandsudo modprobe nvidia
. - Check for any system updates: Sometimes, system updates can cause issues with the NVIDIA drivers. Make sure your system is up to date.
If none of these steps work, then you might want to consider starting from a clean, non-DSVM image and installing the drivers yourself. Make sure to follow the official NVIDIA installation guides to ensure that the drivers are installed correctly.
I hope this helps.
Regards,
Yutong