Uploading Multiple Folders for Remote Job via Azure Machine Learning Python SDK (v2)

Question

I am working with the Python SDK (v2) of Azure Machine Learning.

I want to launch a training script on a serverless compute by using jobs. Typically, I create and launch a job using the Python command shown below.

The problem I’m facing is that my workspace has the source code, the experiment script, and the utility functions organized in separate folders. However, the command function in the SDK allows uploading only a single folder. I prefer not to restructure my entire codebase to fit AzureML's requirements. Is there a way to upload multiple folders?

from azure.ai.ml import command, Input

job = command(
    inputs=dict(...),
    code='path/to/code/folder',
    command='python main.py --train-data ${{inputs.train_data}} --test-data ${{inputs.test_data}}',
    environment="...",
    experiment_name='...'
)

ml_client.create_or_update(job)

Accepted Answer

Hi Diego STUCCHI,
I hope you are doing great!
Thank you for your response and for editing the original answer. On behalf of Sina Salam , I'm posting the updated answer here in case you'd like to accept it.

The fact is that the issue arises because the command function in the Azure Machine Learning Python SDK (v2) supports only one folder using the code argument, and you have separate folders for source code, experiment script, and utility functions. Unfortunately, Azure Machine Learning SDK v2 does not currently support uploading multiple folders directly through the command interface. I wanted to avoid any overhead at the same time, but one practical solution is to consolidate these directories by creating a ZIP file containing all your folders. You can then upload this ZIP as the code folder, and Azure ML will automatically unzip it on the compute instance. This avoids altering or restructuring your codebase, and in the below is how you can modify your code:

First, create a ZIP archive of your directories using bash command: zip -r code_archive.zip path/to/source path/to/experiment path/to/utils

Then, use the following Python script:

   from azure.ai.ml import command, Input
   job = command(
       inputs=dict(
           train_data=Input(type="uri_file", path="path/to/train_data"),
           test_data=Input(type="uri_file", path="path/to/test_data"),
       ),
       code="path/to/code_archive.zip",  # Use the ZIP file as the code path
       command='python path/to/experiment/main.py --train-data ${{inputs.train_data}} --test-data ${{inputs.test_data}}',
       environment="azureml:",
       experiment_name='my_experiment'
   )
   ml_client.create_or_update(job)

My thought is to preserves your codebase structure and works within the current SDK constraints.

Check more details in the Documentation - https://learn.microsoft.com/en-us/azure/machine-learning/how-to-use-azureml-sdk and Azure ML Job Submission Best Practices - https://learn.microsoft.com/en-us/azure/machine-learning/how-to-submit-jobs

Hope this helps. Do let us know if you have any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful.

Answer

Hello Diego STUCCHI,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that you would like to upload multiple folders for Remote Job via Azure Machine Learning Python SDK (v2).

When working with the Azure Machine Learning Python SDK (v2) to upload multiple folders for a remote job, there are several methods to consider. Each method has its pros and cons, and it’s important to choose the one that best fits your needs.

Methods to Upload Multiple Folders

Using a ZIP File,
Using a Custom Docker Image,
Using AzureML Data Assets,
Using the code_paths Parameter.

Best/Optimal Approach is code_paths Parameter.

The code_paths parameter in the AzureML SDK v2 supports multiple folders directly. This method avoids the hassle of compressing files or building custom containers. This is how you can implement it:

from azure.ai.ml import command, Input
job = command(
    inputs=dict(
        train_data=Input(type="uri_file", path="path/to/train_data"),
        test_data=Input(type="uri_file", path="path/to/test_data"),
    ),
    code_paths=[
        "path/to/source",
        "path/to/experiment",
        "path/to/utils"
    ],
    command='python path/to/experiment/main.py --train-data ${{inputs.train_data}} --test-data ${{inputs.test_data}}',
    environment="azureml:",
    experiment_name='my_experiment'
)
ml_client.create_or_update(job)

Using a ZIP File.

Should there be any issues with the code_paths parameter, another practical solution is to consolidate your directories into a ZIP file. This method works within the current SDK constraints and preserves your codebase structure. This is how you can do it:

Use the following bash command to create a ZIP file containing all your folders:

zip -r code_archive.zip path/to/source path/to/experiment path/to/utils

Modify Your Python Script: Use the ZIP file as the code path in your script:

from azure.ai.ml import command, Input
job = command(
    inputs=dict(
        train_data=Input(type="uri_file", path="path/to/train_data"),
        test_data=Input(type="uri_file", path="path/to/test_data"),
    ),
    code="path/to/code_archive.zip",  # Use the ZIP file as the code path
    command='python path/to/experiment/main.py --train-data ${{inputs.train_data}} --test-data ${{inputs.test_data}}',
    environment="azureml:",
    experiment_name='my_experiment'
)
ml_client.create_or_update(job)

Check more details in the Documentation - https://learn.microsoft.com/en-us/azure/machine-learning/how-to-use-azureml-sdk and Azure ML Job Submission Best Practices - https://learn.microsoft.com/en-us/azure/machine-learning/how-to-submit-jobs

I hope this is helpful! Do not hesitate to let me know if you have any other questions.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Share via

Uploading Multiple Folders for Remote Job via Azure Machine Learning Python SDK (v2)

1 additional answer

Your answer