Parent job is complete, but child job keeps running indefinitely, doing nothing
I have an Azure ML sweep that produces the outputs I expected and is marked as complete. However, when I dug deeper (and got pinged a few times), I see that the child jobs are still running and keep running indefinitely. They don't seem to be doing anything that consumes resources - although I could be wrong about that - but I cannot cancel them. When I try, I get a message "Jobs that cannot be canceled (1):" followed by the job's name. I cannot cancel the parent either since that is completed. I can delete the parent job, though, but I'd rather not do that to a successful job. The job consists of a parameter sweep of deepspeed training of a simple model, for tutorial purposes. Since this is a public forum I'd prefer to be careful with detail, but the bulk of the submission job is shown below. Does anyone have any advice for how I should proceed to address this issue?
#########################################
# Job submission.
# Create command.
command_job = \
aml_setup.create_command(
experiment_name=experiment_name,
inputs=all_inputs,
outputs=outputs
)
# Create the job
job = command_job.sweep(
search_space=search_space,
compute=aml_setup.compute_target,
sampling_algorithm=RandomSamplingAlgorithm(seed=0, rule="sobol"),
primary_metric="val_loss",
goal="minimize"
)
##########################################################################
# VERY IMPORTANT: AVOID ZOMBIE JOBS!
min_trial_count = 1
for key, value in swept_inputs.items():
min_trial_count *= len(value)
job.set_limits(max_total_trials=min_trial_count, max_concurrent_trials=min_trial_count, trial_timeout=60*60)
##########################################################################
# Submit the job.
submission = aml_setup.ml_client.jobs.create_or_update(job)