Azure machine learning job is not able to utilize the GPU provisioned Standard_NC6s_v3 using python sdk

Dinesh Selvam 0 Reputation points
2025-03-11T10:13:18.94+00:00

I am currently working on fine-tuning a model called Phi-3-small-instruct-128K using the Azure Machine Learning Python SDK. The training job is running successfully, and there are no errors or exceptions thrown. However, the issue I'm facing is that the GPU resources that were provisioned for the job are not being utilized during the training process.

Below are the details of the setup:

Model Name: Phi-3-small-instruct-128K

Environment: azureml://registries/azureml/environments/acpt-pytorch-2.2-cuda12.1/versions/29

GPU Provisioned: Standard_NC6s_v3

Despite the absence of errors, the lack of GPU utilization is hindering the overall performance of the fine-tuning process. I need assistance in identifying the cause of this issue and implementing a solution to ensure that the job utilizes the GPU effectively.


job = command(
        code="./",  # Path to your training script folder
        command=f"python train.py --train_data ${{inputs.train_data}}"
            f"--output_dir ${{outputs.model_output}} "
            f"--use_qlora --lora_r 16 --lora_alpha 32 --lora_dropout 0.05 --task_name ${{task_name}}",
        environment="azureml://registries/azureml/environments/acpt-pytorch-2.2-cuda12.1/versions/29",

        compute="gpu-cluster",  # Replace with your compute target
        inputs={
            "train_data": Input(
                type=AssetTypes.URI_FILE,
                path="azureml://datastores/ftdatain/paths/valid_chat_completion_data.jsonl",
                mode=InputOutputModes.RO_MOUNT,
            )
        },
        outputs={
            "model_output": Output(
                type=AssetTypes.URI_FILE,
                path="azureml://datastores/models/paths/models",
                mode=InputOutputModes.RW_MOUNT,
            ),
        },
        identity=ManagedIdentityConfiguration(),
    )
    
    # Submit the job
    ml_client.jobs.create_or_update(job)
Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,177 questions
0 comments No comments
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.