Azure machine learning job is not able to utilize the GPU provisioned Standard_NC6s_v3 using python sdk
I am currently working on fine-tuning a model called Phi-3-small-instruct-128K using the Azure Machine Learning Python SDK. The training job is running successfully, and there are no errors or exceptions thrown. However, the issue I'm facing is that the GPU resources that were provisioned for the job are not being utilized during the training process.
Below are the details of the setup:
Model Name: Phi-3-small-instruct-128K
Environment: azureml://registries/azureml/environments/acpt-pytorch-2.2-cuda12.1/versions/29
GPU Provisioned: Standard_NC6s_v3
Despite the absence of errors, the lack of GPU utilization is hindering the overall performance of the fine-tuning process. I need assistance in identifying the cause of this issue and implementing a solution to ensure that the job utilizes the GPU effectively.
job = command(
code="./", # Path to your training script folder
command=f"python train.py --train_data ${{inputs.train_data}}"
f"--output_dir ${{outputs.model_output}} "
f"--use_qlora --lora_r 16 --lora_alpha 32 --lora_dropout 0.05 --task_name ${{task_name}}",
environment="azureml://registries/azureml/environments/acpt-pytorch-2.2-cuda12.1/versions/29",
compute="gpu-cluster", # Replace with your compute target
inputs={
"train_data": Input(
type=AssetTypes.URI_FILE,
path="azureml://datastores/ftdatain/paths/valid_chat_completion_data.jsonl",
mode=InputOutputModes.RO_MOUNT,
)
},
outputs={
"model_output": Output(
type=AssetTypes.URI_FILE,
path="azureml://datastores/models/paths/models",
mode=InputOutputModes.RW_MOUNT,
),
},
identity=ManagedIdentityConfiguration(),
)
# Submit the job
ml_client.jobs.create_or_update(job)