Parent job is complete, but child job keeps running indefinitely, doing nothing

PaulodosSantosMendoncaHEHIM-3418 0 Reputation points Microsoft Employee
2025-03-07T17:09:20.5766667+00:00

I have a parameter sweep job that produces the expected outputs, with a parent job that runs to completion. However, I noticed that the child jobs keep running indefinitely (I've tracked these for days; the job is simple, and the parent is marked completed in a few minutes). It doesn't seem that the child jobs are consuming resources, but I'm not sure. In any case, I now have dozens of zombie jobs running, and I'm getting a few pings. Does anyone have pointers for how I could address the problem?


    #########################################
    # Job submission.
    # Create command.
    command_job = \
        aml_setup.create_command(
            experiment_name=experiment_name,
            inputs=all_inputs,
            outputs=outputs
        )

    # Create the job
    job = command_job.sweep(
        search_space=search_space,
        compute=aml_setup.compute_target,
        sampling_algorithm=RandomSamplingAlgorithm(seed=0, rule="sobol"),
        primary_metric="val_loss",
        goal="minimize"
    )

    ##########################################################################
    # VERY IMPORTANT: AVOID ZOMBIE JOBS!
    min_trial_count = 1
    for key, value in swept_inputs.items():
        min_trial_count *= len(value)
    job.set_limits(max_total_trials=min_trial_count, max_concurrent_trials=min_trial_count, trial_timeout=60*60)
    ##########################################################################

    # Submit the job.
    submission = aml_setup.ml_client.jobs.create_or_update(job)
Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,171 questions
0 comments No comments
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.