パイプラインで並列ジョブを使用する

[アーティクル]
09/26/2024

適用対象:Azure CLI ml extension v2 (現行)Python SDK azure-ai-ml v2 (現行)

この記事では、CLI v2 と Python SDK v2 を使用して Azure Machine Learning パイプラインで並列ジョブを実行する方法について説明します。並列ジョブを使うことで、繰り返されるタスクを強力なマルチノードコンピューティングクラスターに分散させて、ジョブの実行を高速化できます。

機械学習エンジニアは、常にトレーニングまたは推論タスクに関するスケールの要件を抱えています。たとえば、データ科学者が売上予測モデルをトレーニングするための 1 つのスクリプトを提供したら、機械学習エンジニアはこのトレーニングタスクを個々の各データストアに適用する必要があります。このスケールアウトプロセスの課題には、遅延を引き起こす実行時間の長さや、タスクの実行を維持するために手動による介入を必要とする予期しない問題などがあります。

Azure Machine Learning の並列処理の主要な役割は、1 つのシリアルタスクをミニバッチに分割し、それらのミニバッチを複数のコンピューティングにディスパッチして並列に実行することです。並列ジョブでは、エンドツーエンドの実行時間が大幅に短縮され、エラーも自動的に処理されます。 Azure Machine Learning の並列ジョブを使用して、パーティション分割されたデータに基づいて多くのモデルをトレーニングしたり、大規模バッチ推論タスクを促進したりすることについて考えるとします。

たとえば、多数の画像に対して物体検出モデルを実行しているシナリオの場合、Azure Machine Learning の並列ジョブを使用すると、特定のコンピューティングクラスター上でカスタムコードを並列で実行するために画像を簡単に分散できます。並列化により、時間コストを大幅に削減できます。 Azure Machine Learning の並列ジョブでは、プロセスを簡略化して自動化し、ジョブの効率をさらに上げることもできます。

前提条件

Azure Machine Learning アカウントとワークスペースを持っている。
Azure Machine Learning パイプラインについて理解します。

Azure CLI
Python SDK

Azure CLI と ml 拡張機能をインストールします。詳しくは、CLI (v2) のインストール、設定、使用に関するページをご覧ください。 ml 拡張機能は、az ml コマンドを初めて実行したときに自動的にインストールされます。
CLI v2 を使用して Azure Machine Learning パイプラインとコンポーネントを作成して実行する方法について理解します。

並列ジョブステップを持つパイプラインを作成して実行する

Azure Machine Learning の並列ジョブは、パイプラインジョブのステップとしてのみ使用できます。

Azure CLI
Python SDK

次の例は、Azure Machine Learning の例リポジトリにあるパイプラインで並列ジョブを使用してパイプラインジョブを実行する方法に由来します。

並列化を準備する

この並列ジョブステップには準備が必要です。定義済みの関数を実装するエントリスクリプトが必要です。また、並列ジョブ定義で次の属性を設定する必要があります。

入力データを定義してバインドします。
データの分割方法を設定します。
コンピューティングリソースを構成します。
エントリスクリプトを呼び出します。

次のセクションでは、並列ジョブを準備する方法について説明します。

入力とデータ分割の設定を宣言する

並列ジョブでは、1 つの主要な入力を分割して並列に処理する必要があります。主要な入力データ形式は、表形式データまたはファイルのリストのいずれかです。

データ形式が異なると、入力の種類、入力モード、データの分割方法が異なります。次の表では、このオプションについて説明します。

データ形式	入力型	入力モード	データ分割の方法
ファイル一覧	`mltable` または `uri_folder`	`ro_mount` または `download`	サイズ別 (ファイルの数) またはパーティション別
表形式データ	`mltable`	`direct`	サイズ別 (推定物理サイズ) またはパーティション別

Note

表形式の mltable を主要な入力データとして使用する場合は、次の手順を実行する必要があります。

この Conda ファイルの 9 行目にあるように、mltable ライブラリをご利用の環境内にインストールします。
指定したパスの下に、transformations: - read_delimited: セクションを入力した MLTable 仕様ファイルを配置します。例については、「データ資産の作成と管理」を参照してください。

並列ジョブ YAML または Python で input_data 属性を使用して主要な入力データを宣言し、${{inputs.<input name>}} を使用して並列ジョブの定義済みの input にデータをバインドできます。次に、データの分割方法に応じて、主要な入力のデータ分割属性を定義します。

データ分割の方法	属性名	属性の型	ジョブの例
サイズ別	`mini_batch_size`	string	アヤメのバッチ予測
パーティション別	`partition_keys`	文字列のリスト	オレンジジュースの売上予測

並列化用にコンピューティングリソースを構成する

データ分割属性を定義したら、instance_count 属性と max_concurrency_per_instance 属性を設定して、並列化のためのコンピューティングリソースを構成します。

Attribute name	タイプ	説明	既定値
`instance_count`	整数 (integer)	ジョブに使用するノードの数。	1
`max_concurrency_per_instance`	整数 (integer)	各ノードのプロセッサの数。	GPU コンピューティングの場合: 1。 CPU コンピューティングの場合: コアの数。

これらの属性は、次の図に示すように、指定したコンピューティングクラスターと連携します。

並列ジョブでの分散データの動作を示す図。

エントリスクリプトを呼び出す

エントリスクリプトは、カスタムコードを使用して次の 3 つの定義済み関数を実装する 1 つの Python ファイルです。

関数名	必須	Description	入力	戻り値
`Init()`	年	ミニバッチの実行を開始する前の一般的な準備を実施します。たとえば、この関数を使用して、モデルをグローバルオブジェクトに読み込みます。	--	--
`Run(mini_batch)`	年	ミニバッチ用のメイン実行ロジックを実装します。	`mini_batch` は、入力データが表形式データの場合は Pandas DataFrame であり、入力データがディレクトリの場合はファイルパスリストです。	データフレーム、リスト、またはタプル。
`Shutdown()`	N	コンピューティングをプールに返す前にカスタムクリーンアップを実行する省略可能な関数。	--	--

重要

Init() 関数または Run(mini_batch) 関数で引数を解析するときに例外を回避するには、parse_args の代わりに parse_known_args を使用します。引数パーサーを含むエントリスクリプトについては、iris_score の例をご覧ください。

重要

Run(mini_batch) 関数は、データフレーム、リスト、またはタプル項目のいずれかを返す必要があります。この並列ジョブでは、その戻り値の数を使って、そのミニバッチで成功した項目の数を測定します。ミニバッチ数は、すべての項目が処理されている場合、戻り値リストの数と等しくなる必要があります。

並列ジョブは、次の図に示すように、各プロセッサで関数を実行します。

並列ジョブでのエントリスクリプトの動作を示す図。

次のエントリスクリプトの例を参照してください。

エントリスクリプトを呼び出すには、並列ジョブ定義で次の 2 つの属性を設定します。

Attribute name	タイプ	説明
`code`	string	ジョブにアップロードして使用するためのソースコードディレクトリへのローカルパス。
`entry_script`	string	定義済みの並列関数の実装を含む Python ファイル。

次の並列ジョブステップでは、入力の種類、モード、データの分割方法を宣言し、入力をバインドし、コンピューティングを構成し、エントリスクリプトを呼び出します。

batch_prediction:
  type: parallel
  compute: azureml:cpu-cluster
  inputs:
    input_data: 
      type: mltable
      path: ./neural-iris-mltable
      mode: direct
    score_model: 
      type: uri_folder
      path: ./iris-model
      mode: download
  outputs:
    job_output_file:
      type: uri_file
      mode: rw_mount

  input_data: ${{inputs.input_data}}
  mini_batch_size: "10kb"
  resources:
      instance_count: 2
  max_concurrency_per_instance: 2

  logging_level: "DEBUG"
  mini_batch_error_threshold: 5
  retry_settings:
    max_retries: 2
    timeout: 60

  task:
    type: run_function
    code: "./script"
    entry_script: iris_prediction.py
    environment:
      name: "prs-env"
      version: 1
      image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
      conda_file: ./environment/environment_parallel.yml

次のコードでは、job_data_path を入力として宣言し、これを input_data 属性にバインドし、mini_batch_size データ分割属性を設定し、エントリスクリプトを呼び出します。

# parallel task to process file data
file_batch_inference = parallel_run_function(
    name="file_batch_score",
    display_name="Batch Score with File Dataset",
    description="parallel component for batch score",
    inputs=dict(
        job_data_path=Input(
            type=AssetTypes.MLTABLE,
            description="The data to be split and scored in parallel",
        )
    ),
    outputs=dict(job_output_path=Output(type=AssetTypes.MLTABLE)),
    input_data="${{inputs.job_data_path}}",
    instance_count=2,
    max_concurrency_per_instance=1,
    mini_batch_size="1",
    mini_batch_error_threshold=1,
    retry_settings=dict(max_retries=2, timeout=60),
    logging_level="DEBUG",
    task=RunFunction(
        code="./src",
        entry_script="file_batch_inference.py",
        program_arguments="--job_output_path ${{outputs.job_output_path}}",
        environment="azureml://registries/azureml/environments/sklearn-1.5/labels/latest",
    ),
)

自動化の設定を検討する

Azure Machine Learning の並列ジョブでは、手動による介入なしでジョブを自動的に制御できるオプションの設定が多数公開されています。次の表にこれらの設定を示します。

キー	Type	説明	使用できる値	規定値	属性またはプログラム引数で設定する
`mini_batch_error_threshold`	integer	この並列ジョブで無視する失敗したミニバッチの数。失敗したミニバッチの数がこのしきい値を超えた場合、並列ジョブは失敗としてマークされます。ミニバッチは、次の場合に失敗としてマークされます。 - `run()` からの戻り値の数が、ミニバッチの入力数未満である場合。 - 例外がカスタム `run()` コードで捕捉された場合。	`[-1, int.max]`	`-1` は、失敗したすべてのミニバッチを無視することを示します	属性 `mini_batch_error_threshold`
`mini_batch_max_retries`	integer	ミニバッチが失敗またはタイムアウトしたときの再試行回数。すべての再試行が失敗した場合、ミニバッチは `mini_batch_error_threshold` 計算に従って失敗としてマークされます。	`[0, int.max]`	`2`	属性 `retry_settings.max_retries`
`mini_batch_timeout`	integer	カスタム `run()` 関数を実行するためのタイムアウト (秒単位)。実行時間がこのしきい値を超えた場合、そのミニバッチは中止され、失敗としてマークされて再試行がトリガーされます。	`(0, 259200]`	`60`	属性 `retry_settings.timeout`
`item_error_threshold`	integer	失敗した項目数のしきい値。失敗した項目の数は、入力数と各ミニバッチから返された数の間のギャップによってカウントされます。失敗した項目の合計がこのしきい値を超えた場合、並列ジョブは失敗としてマークされます。	`[-1, int.max]`	`-1` は、並列ジョブ中のすべての失敗を無視することを示します	プログラム引数 `--error_threshold`
`allowed_failed_percent`	integer	`mini_batch_error_threshold` に似ていますが、回数でなく、失敗したミニバッチの割合を使います。	`[0, 100]`	`100`	プログラム引数 `--allowed_failed_percent`
`overhead_timeout`	integer	各ミニバッチの初期化のタイムアウト (秒)。たとえば、ミニバッチのデータを読み込んで `run()` 関数に渡します。	`(0, 259200]`	`600`	プログラム引数 `--task_overhead_timeout`
`progress_update_timeout`	integer	ミニバッチの実行の進行状況を監視するためのタイムアウト (秒)。このタイムアウト設定内で進行状況の更新が受け取られない場合、並列ジョブは失敗としてマークされます。	`(0, 259200]`	他の設定によって動的に計算	プログラム引数 `--progress_update_timeout`
`first_task_creation_timeout`	integer	ジョブの開始から最初のミニバッチの実行までの時間を監視するためのタイムアウト (秒単位)。	`(0, 259200]`	`600`	プログラム引数 `--first_task_creation_timeout`
`logging_level`	string	ユーザーログファイルにダンプするログのレベル。	`INFO`、 `WARNING`、または `DEBUG`	`INFO`	属性 `logging_level`
`append_row_to`	string	ミニバッチの各実行から戻されたすべての値を集約して、このファイルに出力します。 `${{outputs.<output_name>}}` 式を使用して、並列ジョブの出力のいずれかを参照できます			属性 `task.append_row_to`
`copy_logs_to_parent`	string	ジョブの進行状況、概要、ログを親パイプラインジョブにコピーするかどうかを示すブール値オプション。	`True` または `False`	`False`	プログラム引数 `--copy_logs_to_parent`
`resource_monitor_interval`	integer	ノードリソースの使用状況 (CPU やメモリなど) を "logs/sys/perf" パスの下のログフォルダーにダンプする時間間隔 (秒)。注: ダンプリソースログが頻繁に記録されると、実行速度が若干遅くなります。リソース使用状況のダンプを停止するには、この値を `0` に設定します。	`[0, int.max]`	`600`	プログラム引数 `--resource_monitor_interval`

次のサンプルコードは、これらの設定を更新します。

Azure CLI
Python

batch_prediction:
  type: parallel
  compute: azureml:cpu-cluster
  inputs:
    input_data: 
      type: mltable
      path: ./neural-iris-mltable
      mode: direct
    score_model: 
      type: uri_folder
      path: ./iris-model
      mode: download
  outputs:
    job_output_file:
      type: uri_file
      mode: rw_mount

  input_data: ${{inputs.input_data}}
  mini_batch_size: "10kb"
  resources:
      instance_count: 2
  max_concurrency_per_instance: 2

  logging_level: "DEBUG"
  mini_batch_error_threshold: 5
  retry_settings:
    max_retries: 2
    timeout: 60

  task:
    type: run_function
    code: "./script"
    entry_script: iris_prediction.py
    environment:
      name: "prs-env"
      version: 1
      image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
      conda_file: ./environment/environment_parallel.yml
    program_arguments: >-
      --model ${{inputs.score_model}}
      --error_threshold 5
      --allowed_failed_percent 30
      --task_overhead_timeout 1200
      --progress_update_timeout 600
      --first_task_creation_timeout 600
      --copy_logs_to_parent True
      --resource_monitor_interva 20
    append_row_to: ${{outputs.job_output_file}}

# parallel task to process tabular data
tabular_batch_inference = parallel_run_function(
    name="batch_score_with_tabular_input",
    display_name="Batch Score with Tabular Dataset",
    description="parallel component for batch score",
    inputs=dict(
        job_data_path=Input(
            type=AssetTypes.MLTABLE,
            description="The data to be split and scored in parallel",
        ),
        score_model=Input(
            type=AssetTypes.URI_FOLDER, description="The model for batch score."
        ),
    ),
    outputs=dict(job_output_path=Output(type=AssetTypes.MLTABLE)),
    input_data="${{inputs.job_data_path}}",
    instance_count=2,
    max_concurrency_per_instance=2,
    mini_batch_size="100",
    mini_batch_error_threshold=5,
    logging_level="DEBUG",
    retry_settings=dict(max_retries=2, timeout=60),
    task=RunFunction(
        code="./src",
        entry_script="tabular_batch_inference.py",
        environment=Environment(
            image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
            conda_file="./src/environment_parallel.yml",
        ),
        program_arguments="--model ${{inputs.score_model}} "
        "--job_output_path ${{outputs.job_output_path}} "
        "--error_threshold 5 "
        "--allowed_failed_percent 30 "
        "--task_overhead_timeout 1200 "
        "--progress_update_timeout 600 "
        "--first_task_creation_timeout 600 "
        "--copy_logs_to_parent True "
        "--resource_monitor_interva 20 ",
        append_row_to="${{outputs.job_output_path}}",
    ),
)

並列ジョブステップを持つパイプラインを作成する

Azure CLI
Python

次の例は、並列ジョブステップをインラインで使用した完全なパイプラインジョブを示しています。

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline

display_name: iris-batch-prediction-using-parallel
description: The hello world pipeline job with inline parallel job
tags:
  tag: tagvalue
  owner: sdkteam

settings:
  default_compute: azureml:cpu-cluster

jobs:
  batch_prediction:
    type: parallel
    compute: azureml:cpu-cluster
    inputs:
      input_data: 
        type: mltable
        path: ./neural-iris-mltable
        mode: direct
      score_model: 
        type: uri_folder
        path: ./iris-model
        mode: download
    outputs:
      job_output_file:
        type: uri_file
        mode: rw_mount

    input_data: ${{inputs.input_data}}
    mini_batch_size: "10kb"
    resources:
        instance_count: 2
    max_concurrency_per_instance: 2

    logging_level: "DEBUG"
    mini_batch_error_threshold: 5
    retry_settings:
      max_retries: 2
      timeout: 60

    task:
      type: run_function
      code: "./script"
      entry_script: iris_prediction.py
      environment:
        name: "prs-env"
        version: 1
        image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
        conda_file: ./environment/environment_parallel.yml
      program_arguments: >-
        --model ${{inputs.score_model}}
        --error_threshold 5
        --allowed_failed_percent 30
        --task_overhead_timeout 1200
        --progress_update_timeout 600
        --first_task_creation_timeout 600
        --copy_logs_to_parent True
        --resource_monitor_interva 20
      append_row_to: ${{outputs.job_output_file}}

まず、必要なライブラリをインポートし、適切な資格情報で ml_client を開始し、コンピューティングを作成または取得します。

# import required libraries
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient, Input, Output, load_component
from azure.ai.ml.dsl import pipeline
from azure.ai.ml.entities import Environment
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.ai.ml.parallel import parallel_run_function, RunFunction

try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

# Get a handle to workspace
ml_client = MLClient.from_config(credential=credential)

# Retrieve an already attached Azure Machine Learning Compute.
cpu_compute_target = "cpu-cluster"
print(ml_client.compute.get(cpu_compute_target))
gpu_compute_target = "gpu-cluster"
print(ml_client.compute.get(gpu_compute_target))

次に、parallel_run_function を完了して、並列ジョブを実装します。

# parallel task to process tabular data
tabular_batch_inference = parallel_run_function(
    name="batch_score_with_tabular_input",
    display_name="Batch Score with Tabular Dataset",
    description="parallel component for batch score",
    inputs=dict(
        job_data_path=Input(
            type=AssetTypes.MLTABLE,
            description="The data to be split and scored in parallel",
        ),
        score_model=Input(
            type=AssetTypes.URI_FOLDER, description="The model for batch score."
        ),
    ),
    outputs=dict(job_output_path=Output(type=AssetTypes.MLTABLE)),
    input_data="${{inputs.job_data_path}}",
    instance_count=2,
    max_concurrency_per_instance=2,
    mini_batch_size="100",
    mini_batch_error_threshold=5,
    logging_level="DEBUG",
    retry_settings=dict(max_retries=2, timeout=60),
    task=RunFunction(
        code="./src",
        entry_script="tabular_batch_inference.py",
        environment=Environment(
            image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
            conda_file="./src/environment_parallel.yml",
        ),
        program_arguments="--model ${{inputs.score_model}} "
        "--job_output_path ${{outputs.job_output_path}} "
        "--error_threshold 5 "
        "--allowed_failed_percent 30 "
        "--task_overhead_timeout 1200 "
        "--progress_update_timeout 600 "
        "--first_task_creation_timeout 600 "
        "--copy_logs_to_parent True "
        "--resource_monitor_interva 20 ",
        append_row_to="${{outputs.job_output_path}}",
    ),
)

最後に、パイプラインのステップとして並列ジョブを使用し、その入力と出力を他のステップとバインドします。

@pipeline()
def parallel_in_pipeline(pipeline_job_data_path, pipeline_score_model):

    prepare_file_tabular_data = prepare_data(input_data=pipeline_job_data_path)
    # output of file & tabular data should be type MLTable
    prepare_file_tabular_data.outputs.file_output_data.type = AssetTypes.MLTABLE
    prepare_file_tabular_data.outputs.tabular_output_data.type = AssetTypes.MLTABLE

    batch_inference_with_file_data = file_batch_inference(
        job_data_path=prepare_file_tabular_data.outputs.file_output_data
    )
    # use eval_mount mode to handle file data
    batch_inference_with_file_data.inputs.job_data_path.mode = (
        InputOutputModes.EVAL_MOUNT
    )
    batch_inference_with_file_data.outputs.job_output_path.type = AssetTypes.MLTABLE

    batch_inference_with_tabular_data = tabular_batch_inference(
        job_data_path=prepare_file_tabular_data.outputs.tabular_output_data,
        score_model=pipeline_score_model,
    )
    # use direct mode to handle tabular data
    batch_inference_with_tabular_data.inputs.job_data_path.mode = (
        InputOutputModes.DIRECT
    )

    return {
        "pipeline_job_out_file": batch_inference_with_file_data.outputs.job_output_path,
        "pipeline_job_out_tabular": batch_inference_with_tabular_data.outputs.job_output_path,
    }


pipeline_job_data_path = Input(
    path="./dataset/", type=AssetTypes.MLTABLE, mode=InputOutputModes.RO_MOUNT
)
pipeline_score_model = Input(
    path="./model/", type=AssetTypes.URI_FOLDER, mode=InputOutputModes.DOWNLOAD
)
# create a pipeline
pipeline_job = parallel_in_pipeline(
    pipeline_job_data_path=pipeline_job_data_path,
    pipeline_score_model=pipeline_score_model,
)
pipeline_job.outputs.pipeline_job_out_tabular.type = AssetTypes.URI_FILE

# set pipeline level compute
pipeline_job.settings.default_compute = "cpu-cluster"

パイプラインジョブを送信する

Azure CLI
Python

az ml job create CLI コマンドを使うことにより、並列ステップを持つパイプラインジョブを送信します。

az ml job create --file pipeline.yml

ml_client の jobs.create_or_update 関数を使用して、並列ステップを持つパイプラインジョブを送信します。

pipeline_job = ml_client.jobs.create_or_update(
    pipeline_job, experiment_name="pipeline_samples"
)
pipeline_job

スタジオ UI で並列ステップを確認する

パイプラインジョブを送信すると、SDK または CLI ウィジェットによって、Azure Machine Learning スタジオ UI のパイプライングラフへの Web URL リンクが表示されます。

並列ジョブの結果を表示するには、パイプライングラフで並列ステップをダブルクリックし、詳細パネルで [設定] タブを選択し、[実行設定] を展開し、[並列] セクションを展開します。

並列ジョブの失敗をデバッグするには、[出力とログ] タブを選択し、logs フォルダーを展開して、並列ジョブが失敗した理由を job_result.txt から確認します。並列ジョブのログ記録構造については、同じフォルダー内の readme.txt を参照してください。

次の方法で共有

パイプラインで並列ジョブを使用する

前提条件

並列ジョブステップを持つパイプラインを作成して実行する

並列化を準備する

入力とデータ分割の設定を宣言する

並列化用にコンピューティングリソースを構成する

エントリスクリプトを呼び出す

並列ジョブステップの例

自動化の設定を検討する

並列ジョブステップを持つパイプラインを作成する

パイプラインジョブを送信する

スタジオ UI で並列ステップを確認する

フィードバック

その他のリソース

次の方法で共有

パイプラインで並列ジョブを使用する

前提条件

並列ジョブ ステップを持つパイプラインを作成して実行する

並列化を準備する

入力とデータ分割の設定を宣言する

並列化用にコンピューティング リソースを構成する

エントリ スクリプトを呼び出す

並列ジョブ ステップの例

自動化の設定を検討する

並列ジョブ ステップを持つパイプラインを作成する

パイプライン ジョブを送信する

スタジオ UI で並列ステップを確認する

関連するコンテンツ

フィードバック

その他のリソース

並列ジョブステップを持つパイプラインを作成して実行する

並列化用にコンピューティングリソースを構成する

エントリスクリプトを呼び出す

並列ジョブステップの例

並列ジョブステップを持つパイプラインを作成する

パイプラインジョブを送信する