Trained yolov8 model on compute cluster and all metrics are flat
Hi,
I have been training yolov8 model using compute cluster and after training is complete, I see all the metrics are flat as if model did not learn anything I don't know what went wrong but the training completed successfully with 50 epochs and confusion metrics shows nothing about the model learning as if no data was used while training
# Run the training
job = command(
inputs=dict(
training_data=Input(
type="uri_folder",
path="azureml:plandataset:2",
),
model_to_train=Input(
type="custom_model",
path="azureml:yolov8m:2"
)
),
code="/home/azureuser/cloudfiles/code/Users/model_training/training-code",
command="""
sed -i "s|path:.*$|path: ${{ inputs.training_data }}|" data.yaml &&
yolo task=detect train data=data.yaml model=${{ inputs.model_to_train }} epochs=50 batch=4 amp=True project=train-environment name=experiment
""",
environment="azureml:train-environment:2",
compute="mel-compute",
display_name="train-environment",
experiment_name="train-environment"
)
ml_client.create_or_update(job)
Dataset folder on my local machine
And here's how I uploaded to ml workspace
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
# Create AzureML dataset
my_data = Data(
path="dataset",
type=AssetTypes.URI_FOLDER,
description="Plans dataset",
name="plandataset"
)
ml_client.data.create_or_update(my_data)
data.yaml
path: ../datasets/dataset # dataset root dir
train: images/train # train images (relative to 'path') 128 images
val: images/val # val images (relative to 'path') 128 images
test: images/test # test images (optional)
nc: 22
# I am not posting the classes name because of confidentiality
Dataset path looks like this in dataasset with train test and valid folders for both images and labels
Here's the confusion metrics after the training completed showing the model did not learn anything
Args.yaml file
task: detect
mode: train
model: /mnt/azureml/cr/j/213a1becbd584dc98dbd30862504a442/cap/data-capability/wd/INPUT_model_to_train/best.pt
data: data.yaml
epochs: 50
patience: 50
batch: 4
imgsz: 1824
save: true
save_period: -1
cache: false
device: null
workers: 8
project: train-environment
name: experiment
exist_ok: false
pretrained: true
optimizer: auto
verbose: true
seed: 0
deterministic: true
single_cls: false
rect: false
cos_lr: false
close_mosaic: 10
resume: false
amp: true
fraction: 1.0
profile: false
freeze: null
overlap_mask: true
mask_ratio: 4
dropout: 0.0
val: true
split: val
save_json: false
save_hybrid: false
conf: null
iou: 0.7
max_det: 300
half: false
dnn: false
plots: true
source: null
show: false
save_txt: false
save_conf: false
save_crop: false
show_labels: true
show_conf: true
vid_stride: 1
stream_buffer: false
line_width: null
visualize: false
augment: false
agnostic_nms: false
classes: null
retina_masks: false
boxes: true
format: torchscript
keras: false
optimize: false
int8: false
dynamic: false
simplify: false
opset: null
workspace: 4
nms: false
lr0: 0.01
lrf: 0.01
momentum: 0.937
weight_decay: 0.0005
warmup_epochs: 3.0
warmup_momentum: 0.8
warmup_bias_lr: 0.1
box: 7.5
cls: 0.5
dfl: 1.5
pose: 12.0
kobj: 1.0
label_smoothing: 0.0
nbs: 64
hsv_h: 0.015
hsv_s: 0.7
hsv_v: 0.4
degrees: 0.0
translate: 0.1
scale: 0.5
shear: 0.0
perspective: 0.0
flipud: 0.0
fliplr: 0.5
mosaic: 1.0
mixup: 0.0
copy_paste: 0.0
cfg: null
tracker: botsort.yaml
save_dir: train-environment/experiment
How do I debug this and want to know why and what went wrong?