Prepare GPUs for Azure Stack HCI (preview)
Applies to: Azure Stack HCI, version 23H2
This article describes how to prepare graphical processing units (GPUs) for Azure Stack HCI for computation-intensive workloads running on Arc virtual machines (VMs) and AKS enabled by Azure Arc. GPUs are used for computation-intensive workloads such as machine learning and deep learning.
Important
This feature is currently in PREVIEW. See the Supplemental Terms of Use for Microsoft Azure Previews for legal terms that apply to Azure features that are in beta, preview, or otherwise not yet released into general availability.
Attaching GPUs on Azure Stack HCI
You can attach your GPUs in one of two ways for Azure Stack HCI:
Discrete Device Assignment (DDA) - allows you to dedicate a physical GPU to your workload. In a DDA deployment, virtualized workloads run on the native driver and typically have full access to the GPU's functionality. DDA offers the highest level of app compatibility and potential performance.
GPU Partitioning (GPU-P) - allows you to share a GPU with multiple workloads by splitting the GPU into dedicated fractional partitions.
Consider the following functionality and support differences between the two options of using your GPUs:
Description | Discrete Device Assignment | GPU Partitioning |
---|---|---|
GPU resource model | Entire device | Equally partitioned device |
VM density | Low (one GPU to one VM) | High (one GPU to many VMs) |
App compatibility | All GPU capabilities provided by vendor (DX 12, OpenGL, CUDA) | All GPU capabilities provided by vendor (DX 12, OpenGL, CUDA) |
GPU VRAM | Up to VRAM supported by the GPU | Up to VRAM supported by the GPU per partition |
GPU driver in guest | GPU vendor driver (NVIDIA) | GPU vendor driver (NVIDIA) |
Supported GPU models
To see the full list of supported solutions and GPUs available, see Azure Stack HCI Solutions and select GPU support in the left menu for options.
NVIDIA supports their workloads separately with their virtual GPU software. For more information, see Microsoft Azure Stack HCI - Supported NVIDIA GPUs and Validated Server Platforms.
For AKS workloads, see GPUs for AKS for Arc.
The following GPU models are supported using both DDA and GPU-P for Arc VM workloads:
- NVIDIA A2
- NVIDIA A16
These additional GPU models are supported using GPU-P (only) for Arc VM workloads:
- NVIDIA A10
- NVIDIA A40
- NVIDIA L4
- NVIDIA L40
- NVIDIA L40S
Host requirements
Your Azure Stack HCI host must meet the following requirements:
Your system must support an Azure Stack HCI solution with GPU support. To browse your options, see the Azure Stack HCI Catalog.
You've access to an Azure Stack HCI, version 23H2 cluster.
You must create a homogeneous configuration for GPUs across all the servers in your cluster. A homogeneous configuration consists of installing the same make and model of GPU.
For GPU-P, ensure that the virtualization support and SR-IOV are enabled in the BIOS of each server in the cluster. Contact your system vendor if you're unable to identify the correct setting in your BIOS.
Prepare GPU drivers on each host
The process for preparing and installing GPU drivers for each host server differs somewhat between DDA and GPU-P. Follow the applicable process for your situation.
Find GPUs on each host
First ensure there is no driver installed for each host server. If there is a host driver installed, uninstall the host driver and restart the server.
After you uninstalled the host driver or if you did not have any driver installed, run PowerShell as administrator with the following command:
Get-PnpDevice -Status Error | fl FriendlyName, InstanceId
You should see the GPU devices appear in an error state as 3D Video Controller
as shown in the example output that lists the friendly name and instance ID of the GPU:
[ASRR1N26R02U46A]: PS C:\Users\HCIDeploymentUser\Documents> Get-PnpDevice - Status Error
Status Class FriendlyName
------ ----- ------------
Error SD Host Controller
Error 3D Video Controller
Error 3D Video Controller
Error USB Unknown USB Device (Device Descriptor Request Failed)
[ASRR1N26R02U46A]: PS C:\Users\HCIDeploymentUser\Documents> Get-PnpDevice - Status Error | f1 InstanceId
InstanceId : PCI\VEN_8086&DEV_18DB&SUBSYS_7208086REV_11\3&11583659&0&E0
InstanceId : PCI\VEN_10DE&DEV_25B6&SUBSYS_157E10DE&REV_A1\4&23AD3A43&0&0010
InstanceId : PCI\VEN_10DE&DEV_25B6&SUBSYS_157E10DE&REV_A1\4&17F8422A&0&0010
InstanceId : USB\VID_0000&PID_0002\S&E492A46&0&2
Using DDA
Follow this process if using DDA:
1. Disable and dismount GPUs from the host
For DDA, when you uninstall the host driver or have a new Azure Stack HCI cluster setup, the physical GPU goes into an error state. You must dismount all the GPU devices to continue. You can use Device Manager or PowerShell to disable and dismount the GPU using the InstanceID
obtained in the prior step.
$id1 = "GPU_instance_ID"
Disable-PnpDevice -InstanceId $id1 -Confirm:$false
Dismount-VMHostAssignableDevice -InstancePath $id1 -Force
Confirm the GPUs were correctly dismounted from the host. The GPUs will now be in an Unknown
state:
Get-PnpDevice -Status Unknown | fl FriendlyName, InstanceId
Repeat this process for each server in the Azure Stack HCI cluster to prepare the GPUs.
2. Download and install the mitigation driver
The software might include components developed and owned by NVIDIA Corporation or its licensors. The use of these components is governed by the NVIDIA end user license agreement.
See the NVIDIA documentation to download the applicable NVIDIA mitigation driver. After downloading the driver, expand the archive and install the mitigation driver on each host server. Use the following PowerShell script to download the mitigation driver and extract it:
Invoke-WebRequest -Uri "https://docs.nvidia.com/datacenter/tesla/gpu-passthrough/nvidia_azure_stack_inf_v2022.10.13_public.zip" -OutFile "nvidia_azure_stack_inf_v2022.10.13_public.zip"
mkdir nvidia-mitigation-driver
Expand-Archive .\nvidia_azure_stack_inf_v2022.10.13_public.zip .\nvidia-mitigation-driver
Once the mitigation driver files are extracted, find the version for the correct model of your GPU and install it. For example, if you were installing an NVIDIA A2 mitigation driver, run the following:
pnputil /add-driver nvidia_azure_stack_A2_base.inf /install /force
To confirm the installation of these drivers, run:
pnputil /enum-devices OR pnputil /scan-devices
You should be able to see the correctly identified GPUs in Get-PnpDevice
:
Get-PnpDevice -Class Display | fl FriendlyName, InstanceId
Repeat the above steps for each host in your Azure Stack HCI cluster.
Using GPU-P
Follow this process if using GPU-P:
Download and install the host driver
GPU-P requires drivers on the host level that differ from DDA. For NVIDIA GPUs, you will need an NVIDIA vGPU software graphics driver on each host and on each VM that will use GPU-P. For more information, see the latest version of NVIDIA vGPU Documentation and details on licensing at Client Licensing User Guide.
After identifying the GPUs as 3D Video Controller
on your host server, download the host vGPU driver. Through your NVIDIA GRID license, you should be able to obtain the proper host driver .zip file.
You will need to obtain and move the following folder to your host server: \vGPU_<Your_vGPU_version>_GA_Azure_Stack_HCI_Host_Drivers
Navigate to \vGPU_<Your_vGPU_version>_GA_Azure_Stack_HCI_Host_Drivers\Display.Driver and install the driver.
pnputil /add-driver .\nvgridswhci.inf /install /force
To confirm the installation of these drivers, run:
pnputil /enum-devices
You should be able to see the correctly identified GPUs in Get-PnpDevice
:
Get-PnpDevice -Class Display | fl FriendlyName, InstanceId
You can also run the NVIDIA System Management Interface nvidia-smi
to list the GPUs on the host server as follows:
nvidia-smi
If the driver is correctly installed, you will see an output similar to the following sample:
Wed Nov 30 15:22:36 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 527.27 Driver Version: 527.27 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A2 WDDM | 00000000:65:00.0 Off | 0 |
| 0% 24C P8 5W / 60W | 15192MiB / 15356MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A2 WDDM | 00000000:66:00.0 Off | 0 |
| 0% 24C P8 5W / 60W | 15192MiB / 15356MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Configure GPU partition count
Follow these steps to configure the GPU partition count in PowerShell:
Note
When using PowerShell, you must manually ensure the GPU configuration is homogenous across all servers in your Azure Stack HCI cluster.
Connect to the server whose GPU partition count you want to configure.
Run the
Get-VMHostPartitionableGpu
command and refer to the Name and ValidPartitionCounts values.Run the following command to configure the partition count. Replace
GPU-name
with the Name value andpartition-count
with one of the supported counts from the ValidPartitionCounts value:Set-VMHostPartitionableGpu -Name "<GPU-name>" -PartitionCount <partition-count>
For example, the following command configures the partition count to
4
:PS C:\Users> Set-VMHostPartitionableGpu -Name "\\?\PCI#VEN_10DE&DEV_25B6&SUBSYS_157E10DE&REV_A1#4&18416dc3&0&0000#{064092b3-625e-43bf-9eb5-dc845897dd59}" -PartitionCount 4
You can run the command
Get-VMHostPartitionableGpu | FL Name,ValidPartitionCounts,PartitionCount
again to verify that the partition count is set to4
.Here's a sample output:
PS C:\Users> Get-VMHostPartitionableGpu | FL Name,ValidPartitionCounts,PartitionCount Name : \\?\PCI#VEN_10DE&DEV_25B6&SUBSYS_157E10DE&REV_A1#4&18416dc3&0&0000#{064092b3-625e-43bf-9eb5-dc845897dd59} ValidPartitionCounts : {16, 8, 4, 2...} PartitionCount : 4 Name : \\?\PCI#VEN_10DE&DEV_25B6&SUBSYS_157E10DE&REV_A1#4&5906f5e&0&0010#{064092b3-625e-43bf-9eb5-dc845897dd59} ValidPartitionCounts : {16, 8, 4, 2...} PartitionCount : 4
To keep the configuration homogeneous, repeat the partition count configuration steps on each server in your Azure Stack HCI cluster.
Guest requirements
GPU management is supported for the following Arc VM workloads:
Generation 2 VMs
A supported 64-bit OS as detailed in the latest NVIDIA vGPU support Supported Products