PyTorch not finding GPU when using Azure ML online endpoint

aot 66 Reputation points
2025-01-23T14:34:11.0533333+00:00

I'm trying to deploy a Azure ML managed online endpoint, that will be executing my model inference flow, using PyTorch-based models. The endpoint is set up to use a Standard_DS4_v2 compute cluster, and uses an environment based on one of the slightly older, curated acpt-pytorch environment available through Azure ML Studio.

When I try to deploy my endpoint, the deployment fails upon initializing my models, claiming that:

ERROR:root:Error initializing model: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver

If I try to deploy the endpoint, but disabling CUDA and simply score on the CPU, the endpoint deploys as expected.

I have no issues running training of the model, on the same type of compute as mentioned above, with full GPU support. Identical environments are used for training, and for endpoint deployment.

Any suggestions as to why my endpoint cannot deploy when I want to use the GPU?

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,102 questions
{count} votes

1 answer

Sort by: Most helpful
  1. aot 66 Reputation points
    2025-01-28T08:37:27.5033333+00:00

    @Pavankumar Purilla Thank you for your reply and suggestions.

    I wrote incorrectly in my initial question, I'm not using a compute instance, but rather a compute cluster. And when deploying my endpoint, to my understanding the compute cluster is spun up on demand, so I have no real option to log in beforehand to check whether there is an NVIDIA driver installed on the cluster?

    According to your own documentation there should be both a GPU and full CUDA support

    https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nc-family#ncast4_v3-series

    It's the exact same type of compute cluster I use for my pipeline training, where I do not face this issue. It is only for the endpoint deployment that this is happening.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.