Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,171 questions
This browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
I have a parameter sweep job that produces the expected outputs, with a parent job that runs to completion. However, I noticed that the child jobs keep running indefinitely (I've tracked these for days; the job is simple, and the parent is marked completed in a few minutes). It doesn't seem that the child jobs are consuming resources, but I'm not sure. In any case, I now have dozens of zombie jobs running, and I'm getting a few pings. Does anyone have pointers for how I could address the problem?
#########################################
# Job submission.
# Create command.
command_job = \
aml_setup.create_command(
experiment_name=experiment_name,
inputs=all_inputs,
outputs=outputs
)
# Create the job
job = command_job.sweep(
search_space=search_space,
compute=aml_setup.compute_target,
sampling_algorithm=RandomSamplingAlgorithm(seed=0, rule="sobol"),
primary_metric="val_loss",
goal="minimize"
)
##########################################################################
# VERY IMPORTANT: AVOID ZOMBIE JOBS!
min_trial_count = 1
for key, value in swept_inputs.items():
min_trial_count *= len(value)
job.set_limits(max_total_trials=min_trial_count, max_concurrent_trials=min_trial_count, trial_timeout=60*60)
##########################################################################
# Submit the job.
submission = aml_setup.ml_client.jobs.create_or_update(job)