Azure machine learning job is not able to utilize the GPU provisioned Standard_NC6s_v3 using python sdk

Question

Azure machine learning job is not able to utilize the GPU provisioned Standard_NC6s_v3 using python sdk

Dinesh Selvam 0

I am currently working on fine-tuning a model called Phi-3-small-instruct-128K using the Azure Machine Learning Python SDK. The training job is running successfully, and there are no errors or exceptions thrown. However, the issue I'm facing is that the GPU resources that were provisioned for the job are not being utilized during the training process.

Below are the details of the setup:

Model Name: Phi-3-small-instruct-128K

Environment: azureml://registries/azureml/environments/acpt-pytorch-2.2-cuda12.1/versions/29

GPU Provisioned: Standard_NC6s_v3

Despite the absence of errors, the lack of GPU utilization is hindering the overall performance of the fine-tuning process. I need assistance in identifying the cause of this issue and implementing a solution to ensure that the job utilizes the GPU effectively.


job = command(
        code="./",  # Path to your training script folder
        command=f"python train.py --train_data ${{inputs.train_data}}"
            f"--output_dir ${{outputs.model_output}} "
            f"--use_qlora --lora_r 16 --lora_alpha 32 --lora_dropout 0.05 --task_name ${{task_name}}",
        environment="azureml://registries/azureml/environments/acpt-pytorch-2.2-cuda12.1/versions/29",

        compute="gpu-cluster",  # Replace with your compute target
        inputs={
            "train_data": Input(
                type=AssetTypes.URI_FILE,
                path="azureml://datastores/ftdatain/paths/valid_chat_completion_data.jsonl",
                mode=InputOutputModes.RO_MOUNT,
            )
        },
        outputs={
            "model_output": Output(
                type=AssetTypes.URI_FILE,
                path="azureml://datastores/models/paths/models",
                mode=InputOutputModes.RW_MOUNT,
            ),
        },
        identity=ManagedIdentityConfiguration(),
    )
    
    # Submit the job
    ml_client.jobs.create_or_update(job)

Share via

Azure machine learning job is not able to utilize the GPU provisioned Standard_NC6s_v3 using python sdk

Your answer