PyTorch not finding GPU when using Azure ML online endpoint

aot 66

I'm trying to deploy a Azure ML managed online endpoint, that will be executing my model inference flow, using PyTorch-based models. The endpoint is set up to use a Standard_DS4_v2 compute cluster, and uses an environment based on one of the slightly older, curated acpt-pytorch environment available through Azure ML Studio.

When I try to deploy my endpoint, the deployment fails upon initializing my models, claiming that:

ERROR:root:Error initializing model: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver

If I try to deploy the endpoint, but disabling CUDA and simply score on the CPU, the endpoint deploys as expected.

I have no issues running training of the model, on the same type of compute as mentioned above, with full GPU support. Identical environments are used for training, and for endpoint deployment.

Any suggestions as to why my endpoint cannot deploy when I want to use the GPU?

Pavankumar Purilla 3,080 Reputation points Microsoft Vendor

2025-01-23T20:22:55.9366667+00:00
Hi aot,
Greetings & Welcome to Microsoft Q&A forum! Thanks for posting your query!
It sounds like the issue might be related to the NVIDIA driver not being installed on the compute instance that your online endpoint is running on.
Here are a few things you can try to resolve the issue:

Check that the compute instance has a GPU and that it is properly configured. You can do this by logging into the compute instance and running the nvidia-smi command. This should show you information about the GPU(s) installed on the machine.

Check that the NVIDIA driver is installed on the compute instance. You can do this by running the nvidia-smi command with the --query-gpu=driver_version option. This should show you the version of the NVIDIA driver installed on the machine.

If the NVIDIA driver is installed but the issue persists, you can try updating the driver to the latest version. You can find instructions for updating the NVIDIA driver on the NVIDIA website.

If none of the above steps resolve the issue, you can try creating a new compute instance with a different GPU and see if the issue persists.

Please refer the following: NVIDIA GPU Driver Extension for Windows.
I hope this helps! Let us know if you have any further questions.
Pavankumar Purilla 3,080 Reputation points Microsoft Vendor

2025-01-24T18:39:04.4166667+00:00

Hi aot,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Pavankumar Purilla 3,080 Reputation points Microsoft Vendor

2025-01-27T17:02:12.21+00:00

Hi aot,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Pavankumar Purilla 3,080 Reputation points Microsoft Vendor

2025-01-29T00:25:28.8566667+00:00

Hi aot,
If you are using a compute cluster instead of a compute instance, you will not be able to log in to the cluster to check whether the NVIDIA driver is installed.

However, the compute cluster should have the necessary drivers and software installed to support GPU acceleration. If you are able to run training on the same type of compute cluster without any issues, it's possible that the issue is related to the environment or configuration of your online endpoint.

Here are a few things you can try to resolve the issue:

Check that the environment used for your online endpoint includes the necessary dependencies for GPU acceleration, including the NVIDIA driver and CUDA toolkit. You can do this by reviewing the environment file used for your online endpoint.

Try creating a new environment specifically for your online endpoint that includes the necessary dependencies for GPU acceleration. You can do this by creating a new conda environment or Docker image that includes the necessary dependencies.

Check that the configuration of your online endpoint is set up to use the GPU. You can do this by reviewing the configuration file used for your online endpoint.

If none of the above steps resolve the issue, you can try creating a new compute cluster with a different GPU and see if the issue persists.

I hope this helps you resolve the issue. If you have any further questions, please let us know.

1 answer

aot 66 Reputation points

2025-01-28T08:37:27.5033333+00:00

@Pavankumar Purilla Thank you for your reply and suggestions.

I wrote incorrectly in my initial question, I'm not using a compute instance, but rather a compute cluster. And when deploying my endpoint, to my understanding the compute cluster is spun up on demand, so I have no real option to log in beforehand to check whether there is an NVIDIA driver installed on the cluster?

According to your own documentation there should be both a GPU and full CUDA support

https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nc-family#ncast4_v3-series

It's the exact same type of compute cluster I use for my pipeline training, where I do not face this issue. It is only for the endpoint deployment that this is happening.
Please sign in to rate this answer.

0 comments No comments
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.

Share via

PyTorch not finding GPU when using Azure ML online endpoint

1 answer

Your answer