Parent job is complete, but child job keeps running indefinitely, doing nothing

PaulodosSantosMendoncaHEHIM-3418 0 Reputation points Microsoft Employee
2025-03-07T17:03:27.3333333+00:00

I have an Azure ML sweep that produces the outputs I expected and is marked as complete. However, when I dug deeper (and got pinged a few times), I see that the child jobs are still running and keep running indefinitely. They don't seem to be doing anything that consumes resources - although I could be wrong about that - but I cannot cancel them. When I try, I get a message "Jobs that cannot be canceled (1):" followed by the job's name. I cannot cancel the parent either since that is completed. I can delete the parent job, though, but I'd rather not do that to a successful job. The job consists of a parameter sweep of deepspeed training of a simple model, for tutorial purposes. Since this is a public forum I'd prefer to be careful with detail, but the bulk of the submission job is shown below. Does anyone have any advice for how I should proceed to address this issue?

    #########################################
    # Job submission.
    # Create command.
    command_job = \
        aml_setup.create_command(
            experiment_name=experiment_name,
            inputs=all_inputs,
            outputs=outputs
        )

    # Create the job
    job = command_job.sweep(
        search_space=search_space,
        compute=aml_setup.compute_target,
        sampling_algorithm=RandomSamplingAlgorithm(seed=0, rule="sobol"),
        primary_metric="val_loss",
        goal="minimize"
    )

    ##########################################################################
    # VERY IMPORTANT: AVOID ZOMBIE JOBS!
    min_trial_count = 1
    for key, value in swept_inputs.items():
        min_trial_count *= len(value)
    job.set_limits(max_total_trials=min_trial_count, max_concurrent_trials=min_trial_count, trial_timeout=60*60)
    ##########################################################################

    # Submit the job.
    submission = aml_setup.ml_client.jobs.create_or_update(job)

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,171 questions
0 comments No comments
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.