データ管理を SDK v2 にアップグレードする

[アーティクル]
09/01/2024

V1 では、Azure Machine Learning データセットを Filedataset または Tabulardataset にすることができます。 V2 では、Azure Machine Learning データ資産には、 uri_folder、 uri_file、または mltableを指定できます。概念的には、 Filedataset を uri_folderにマップし、 uri_file または Tabulardataset を mltableにマップできます。

URI (uri_folder、 uri_file) - Uniform Resource Identifier は、ジョブ内のデータに簡単にアクセスできるように、ローカルコンピューターまたはクラウド上のストレージの場所への参照です。
MLTable - 表形式データスキーマ定義を抽象化するメソッド。そのデータのコンシューマーは、テーブルをより簡単に Pandas/Dask/Spark データフレームに具体化できます。

この記事では、SDK v1 と SDK v2 のデータシナリオを比較します。

データ資産の `filedataset`/uri 型を作成する

SDK v1 - Filedataset を作成する

from azureml.core import Workspace, Datastore, Dataset

# create a FileDataset pointing to files in 'animals' folder and its subfolders recursively
datastore_paths = [(datastore, 'animals')]
animal_ds = Dataset.File.from_files(path=datastore_paths)

# create a FileDataset from image and label files behind public web urls
web_paths = ['https://azureopendatastorage.blob.core.windows.net/mnist/train-images-idx3-ubyte.gz',
             'https://azureopendatastorage.blob.core.windows.net/mnist/train-labels-idx1-ubyte.gz']
mnist_ds = Dataset.File.from_files(path=web_paths)

SDK v2

URI_FOLDER 型のデータ資産を作成する

from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

# Supported paths include:
# local: './<path>'
# blob:  'https://<account_name>.blob.core.windows.net/<container_name>/<path>'
# ADLS gen2: 'abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/'
# Datastore: 'azureml://datastores/<data_store_name>/paths/<path>'

my_path = '<path>'

my_data = Data(
    path=my_path,
    type=AssetTypes.URI_FOLDER,
    description="<description>",
    name="<name>",
    version='<version>'
)

ml_client.data.create_or_update(my_data)

URI_FILE 型のデータ資産を作成する

from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

# Supported paths include:
# local: './<path>/<file>'
# blob:  'https://<account_name>.blob.core.windows.net/<container_name>/<path>/<file>'
# ADLS gen2: 'abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file>'
# Datastore: 'azureml://datastores/<data_store_name>/paths/<path>/<file>'
my_path = '<path>'

my_data = Data(
    path=my_path,
    type=AssetTypes.URI_FILE,
    description="<description>",
    name="<name>",
    version="<version>"
)

ml_client.data.create_or_update(my_data)

表形式のデータセットまたはデータ資産を作成する

SDK v1

from azureml.core import Workspace, Datastore, Dataset

datastore_name = 'your datastore name'

# get existing workspace
workspace = Workspace.from_config()

# retrieve an existing datastore in the workspace by name
datastore = Datastore.get(workspace, datastore_name)

# create a TabularDataset from 3 file paths in datastore
datastore_paths = [(datastore, 'weather/2018/11.csv'),
                   (datastore, 'weather/2018/12.csv'),
                   (datastore, 'weather/2019/*.csv')]

weather_ds = Dataset.Tabular.from_delimited_files(path=datastore_paths)

SDK v2 - yaml 定義を使用して mltable データ資産を作成する

type: mltable

paths:
  - pattern: ./*.txt
transformations:
  - read_delimited:
      delimiter: ,
      encoding: ascii
      header: all_files_same_headers

from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

# my_path must point to folder containing MLTable artifact (MLTable file + data
# Supported paths include:
# local: './<path>'
# blob:  'https://<account_name>.blob.core.windows.net/<container_name>/<path>'
# ADLS gen2: 'abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/'
# Datastore: 'azureml://datastores/<data_store_name>/paths/<path>'

my_path = '<path>'

my_data = Data(
    path=my_path,
    type=AssetTypes.MLTABLE,
    description="<description>",
    name="<name>",
    version='<version>'
)

ml_client.data.create_or_update(my_data)

実験またはジョブでデータを使用する

SDK v1

from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory=script_folder,
                      script='train_titanic.py',
                      # pass dataset as an input with friendly name 'titanic'
                      arguments=['--input-data', titanic_ds.as_named_input('titanic')],
                      compute_target=compute_target,
                      environment=myenv)

# Submit the run configuration for your training run
run = experiment.submit(src)
run.wait_for_completion(show_output=True)

SDK v2

from azure.ai.ml import command
from azure.ai.ml.entities import Data
from azure.ai.ml import Input, Output
from azure.ai.ml.constants import AssetTypes

# Possible Asset Types for Data:
# AssetTypes.URI_FILE
# AssetTypes.URI_FOLDER
# AssetTypes.MLTABLE

# Possible Paths for Data:
# Blob: https://<account_name>.blob.core.windows.net/<container_name>/<folder>/<file>
# Datastore: azureml://datastores/paths/<folder>/<file>
# Data Asset: azureml:<my_data>:<version>

my_job_inputs = {
    "raw_data": Input(type=AssetTypes.URI_FOLDER, path="<path>")
}

my_job_outputs = {
    "prep_data": Output(type=AssetTypes.URI_FOLDER, path="<path>")
}

job = command(
    code="./src",  # local path where the code is stored
    command="python process_data.py --raw_data ${{inputs.raw_data}} --prep_data ${{outputs.prep_data}}",
    inputs=my_job_inputs,
    outputs=my_job_outputs,
    environment="<environment_name>:<version>",
    compute="cpu-cluster",
)

# submit the command
returned_job = ml_client.create_or_update(job)
# get a URL for the status of the job
returned_job.services["Studio"].endpoint

SDK v1 と SDK v2 の主要機能のマッピング

SDK v1 の機能	SDK v2 での大まかなマッピング
SDK v1 のメソッド/API	SDK v2 のメソッド/API

次のステップ

詳しくは、こちらのドキュメントをご覧ください。

次の方法で共有

データ管理を SDK v2 にアップグレードする

データ資産の `filedataset`/uri 型を作成する

表形式のデータセットまたはデータ資産を作成する

実験またはジョブでデータを使用する

SDK v1 と SDK v2 の主要機能のマッピング

次のステップ

フィードバック

その他のリソース

次の方法で共有

データ管理を SDK v2 にアップグレードする

データ資産の filedataset/uri 型を作成する

表形式のデータセットまたはデータ資産を作成する

実験またはジョブでデータを使用する

SDK v1 と SDK v2 の主要機能のマッピング

次のステップ

フィードバック

その他のリソース

データ資産の `filedataset`/uri 型を作成する