PyTorch not finding GPU when using Azure ML online endpoint

aot 66

I'm trying to deploy a Azure ML managed online endpoint, that will be executing my model inference flow, using PyTorch-based models. The endpoint is set up to use a Standard_DS4_v2 compute, and uses an environment based on one of the slightly older, curated acpt-pytorch environment available through Azure ML Studio.

When I try to deploy my endpoint, the deployment fails upon initializing my models, claiming that:

ERROR:root:Error initializing model: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver

If I try to deploy the endpoint, but disabling CUDA and simply score on the CPU, the endpoint deploys as expected.

I have no issues running training of the model, on the same type of compute as mentioned above, with full GPU support. Identical environments are used for training, and for endpoint deployment.

Any suggestions as to why my endpoint cannot deploy when I want to use the GPU?

Pavankumar Purilla 2,930 Reputation points Microsoft Vendor

2025-01-23T20:22:55.9366667+00:00
Hi aot,
Greetings & Welcome to Microsoft Q&A forum! Thanks for posting your query!
It sounds like the issue might be related to the NVIDIA driver not being installed on the compute instance that your online endpoint is running on.
Here are a few things you can try to resolve the issue:

Check that the compute instance has a GPU and that it is properly configured. You can do this by logging into the compute instance and running the nvidia-smi command. This should show you information about the GPU(s) installed on the machine.

Check that the NVIDIA driver is installed on the compute instance. You can do this by running the nvidia-smi command with the --query-gpu=driver_version option. This should show you the version of the NVIDIA driver installed on the machine.

If the NVIDIA driver is installed but the issue persists, you can try updating the driver to the latest version. You can find instructions for updating the NVIDIA driver on the NVIDIA website.

If none of the above steps resolve the issue, you can try creating a new compute instance with a different GPU and see if the issue persists.

Please refer the following: NVIDIA GPU Driver Extension for Windows.
I hope this helps! Let us know if you have any further questions.
Pavankumar Purilla 2,930 Reputation points Microsoft Vendor

2025-01-24T18:39:04.4166667+00:00

Hi aot,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Pavankumar Purilla 2,930 Reputation points Microsoft Vendor

2025-01-27T17:02:12.21+00:00

Hi aot,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Share via

PyTorch not finding GPU when using Azure ML online endpoint

Your answer