PipelineData Class
Represents intermediate data in an Azure Machine Learning pipeline.
Data used in pipeline can be produced by one step and consumed in another step by providing a PipelineData object as an output of one step and an input of one or more subsequent steps.
Note if you are using the pipeline data, please make sure the directory used existed.
A python example to ensure the directory existed, suppose you have a output port named output_folder in one pipeline step, you want to write some data to relative path in this folder.
import os
os.makedirs(args.output_folder, exist_ok=True)
f = open(args.output_folder + '/relative_path/file_name', 'w+')
PipelineData use DataReference underlying which is no longer the recommended approach for data access and delivery, please use OutputFileDatasetConfig instead, you can find sample here: Pipeline using OutputFileDatasetConfig.
Initialize PipelineData.
- Inheritance
-
builtins.objectPipelineData
Constructor
PipelineData(name, datastore=None, output_name=None, output_mode='mount', output_path_on_compute=None, output_overwrite=None, data_type=None, is_directory=None, pipeline_output_name=None, training_output=None)
Parameters
Name | Description |
---|---|
name
Required
|
The name of the PipelineData object, which can contain only letters, digits, and underscores. PipelineData names are used to identify the outputs of a step. After a pipeline run has completed, you can use the step name with an output name to access a particular output. Names should be unique within a single step in a pipeline. |
datastore
|
The Datastore the PipelineData will reside on. If unspecified, the default datastore is used. Default value: None
|
output_name
|
The name of the output, if None name is used. Can contain only letters, digits, and underscores. Default value: None
|
output_mode
|
Specifies whether the producing step will use "upload" or "mount" method to access the data. Default value: mount
|
output_path_on_compute
|
For Default value: None
|
output_overwrite
|
For Default value: None
|
data_type
|
Optional. Data type can be used to specify the expected type of the output and to detail how consuming steps should use the data. It can be any user-defined string. Default value: None
|
is_directory
|
Specifies whether the data is a directory or single file. This is only used to determine
a data type used by Azure ML backend when the Default value: None
|
pipeline_output_name
|
If provided this output will be available by using
Default value: None
|
training_output
|
Defines output for training result. This is needed only for specific trainings which result in different kinds of outputs such as Metrics and Model. For example, AutoMLStep results in metrics and model. You can also define specific training iteration or metric used to get best model. For HyperDriveStep, you can also define the specific model files to be included in the output. Default value: None
|
name
Required
|
The name of the PipelineData object, which can contain only letters, digits, and underscores. PipelineData names are used to identify the outputs of a step. After a pipeline run has completed, you can use the step name with an output name to access a particular output. Names should be unique within a single step in a pipeline. |
datastore
Required
|
The Datastore the PipelineData will reside on. If unspecified, the default datastore is used. |
output_name
Required
|
The name of the output, if None name is used. which can contain only letters, digits, and underscores. |
output_mode
Required
|
Specifies whether the producing step will use "upload" or "mount" method to access the data. |
output_path_on_compute
Required
|
For |
output_overwrite
Required
|
For |
data_type
Required
|
Optional. Data type can be used to specify the expected type of the output and to detail how consuming steps should use the data. It can be any user-defined string. |
is_directory
Required
|
Specifies whether the data is a directory or single file. This is only used to determine
a data type used by Azure ML backend when the |
pipeline_output_name
Required
|
If provided this output will be available by using
|
training_output
Required
|
Defines output for training result. This is needed only for specific trainings which result in different kinds of outputs such as Metrics and Model. For example, AutoMLStep results in metrics and model. You can also define specific training iteration or metric used to get best model. For HyperDriveStep, you can also define the specific model files to be included in the output. |
Remarks
PipelineData represents data output a step will produce when it is run. Use PipelineData when creating steps to describe the files or directories which will be generated by the step. These data outputs will be added to the specified Datastore and can be retrieved and viewed later.
For example, the following pipeline step produces one output, named "model":
from azureml.pipeline.core import PipelineData
from azureml.pipeline.steps import PythonScriptStep
datastore = ws.get_default_datastore()
step_output = PipelineData("model", datastore=datastore)
step = PythonScriptStep(script_name="train.py",
arguments=["--model", step_output],
outputs=[step_output],
compute_target=aml_compute,
source_directory=source_directory)
In this case, the train.py script will write the model it produces to the location which is provided to the script through the –model argument.
PipelineData objects are also used when constructing Pipelines to describe step dependencies. To specify that a step requires the output of another step as input, use a PipelineData object in the constructor of both steps.
For example, the pipeline train step depends on the process_step_output output of the pipeline process step:
from azureml.pipeline.core import Pipeline, PipelineData
from azureml.pipeline.steps import PythonScriptStep
datastore = ws.get_default_datastore()
process_step_output = PipelineData("processed_data", datastore=datastore)
process_step = PythonScriptStep(script_name="process.py",
arguments=["--data_for_train", process_step_output],
outputs=[process_step_output],
compute_target=aml_compute,
source_directory=process_directory)
train_step = PythonScriptStep(script_name="train.py",
arguments=["--data_for_train", process_step_output],
inputs=[process_step_output],
compute_target=aml_compute,
source_directory=train_directory)
pipeline = Pipeline(workspace=ws, steps=[process_step, train_step])
This will create a Pipeline with two steps. The process step will be executed first, then after it has completed, the train step will be executed. Azure ML will provide the output produced by the process step to the train step.
See this page for further examples of using PipelineData to construct a Pipeline: https://aka.ms/pl-data-dep
For supported compute types, PipelineData can also be used to specify how the data will be produced and consumed by the run. There are two supported methods:
Mount (default): The input or output data is mounted to local storage on the compute node, and an environment variable is set which points to the path of this data ($AZUREML_DATAREFERENCE_name). For convenience, you can pass the PipelineData object in as one of the arguments to your script, for example using the
arguments
parameter of PythonScriptStep, and the object will resolve to the path to the data. For outputs, your compute script should create a file or directory at this output path. To see the value of the environment variable used when you pass in the Pipeline object as an argument, use the get_env_variable_name method.Upload: Specify an
output_path_on_compute
corresponding to a file or directory name that your script will generate. (Environment variables are not used in this case.)
Methods
as_dataset |
Promote the intermediate output into a Dataset. This dataset will exist after the step has executed. Please note that the output must be promoted to be a dataset in order for the subsequent input to be consumed as dataset. If as_dataset is not called on the output but only called on the input, then it will be a noop and the input will not be consumed as a dataset. The code example below shows a correct usage of as_dataset:
|
as_download |
Consume the PipelineData as download. |
as_input |
Create an InputPortBinding and specify an input name (but use default mode). |
as_mount |
Consume the PipelineData as mount. |
create_input_binding |
Create input binding. |
get_env_variable_name |
Return the name of the environment variable for this PipelineData. |
as_dataset
Promote the intermediate output into a Dataset.
This dataset will exist after the step has executed. Please note that the output must be promoted to be a dataset in order for the subsequent input to be consumed as dataset. If as_dataset is not called on the output but only called on the input, then it will be a noop and the input will not be consumed as a dataset. The code example below shows a correct usage of as_dataset:
# as_dataset is called here and is passed to both the output and input of the next step.
pipeline_data = PipelineData('output').as_dataset()
step1 = PythonScriptStep(..., outputs=[pipeline_data])
step2 = PythonScriptStep(..., inputs=[pipeline_data])
as_dataset()
Returns
Type | Description |
---|---|
The intermediate output as a Dataset. |
as_download
Consume the PipelineData as download.
as_download(input_name=None, path_on_compute=None, overwrite=None)
Parameters
Name | Description |
---|---|
input_name
|
Use to specify a name for this input. Default value: None
|
path_on_compute
|
The path on the compute to download to. Default value: None
|
overwrite
|
Use to indicate whether to overwrite existing data. Default value: None
|
Returns
Type | Description |
---|---|
The InputPortBinding with this PipelineData as the source. |
as_input
Create an InputPortBinding and specify an input name (but use default mode).
as_input(input_name)
Parameters
Name | Description |
---|---|
input_name
Required
|
Use to specify a name for this input. |
Returns
Type | Description |
---|---|
The InputPortBinding with this PipelineData as the source. |
as_mount
Consume the PipelineData as mount.
as_mount(input_name=None)
Parameters
Name | Description |
---|---|
input_name
|
Use to specify a name for this input. Default value: None
|
Returns
Type | Description |
---|---|
The InputPortBinding with this PipelineData as the source. |
create_input_binding
Create input binding.
create_input_binding(input_name=None, mode=None, path_on_compute=None, overwrite=None)
Parameters
Name | Description |
---|---|
input_name
|
The name of the input. Default value: None
|
mode
|
The mode to access the PipelineData ("mount" or "download"). Default value: None
|
path_on_compute
|
For "download" mode, the path on the compute the data will reside. Default value: None
|
overwrite
|
For "download" mode, whether to overwrite existing data. Default value: None
|
Returns
Type | Description |
---|---|
The InputPortBinding with this PipelineData as the source. |
get_env_variable_name
Return the name of the environment variable for this PipelineData.
get_env_variable_name()
Returns
Type | Description |
---|---|
The environment variable name. |
Attributes
data_type
datastore
Datastore the PipelineData will reside on.
Returns
Type | Description |
---|---|
The Datastore object. |