Use automated ML in an Azure Machine Learning pipeline in Python
APPLIES TO: Python SDK azureml v1
Azure Machine Learning's automated ML capability helps you discover high-performing models without you reimplementing every possible approach. Combined with Azure Machine Learning pipelines, you can create deployable workflows that can quickly discover the algorithm that works best for your data. This article will show you how to efficiently join a data preparation step to an automated ML step. Automated ML can quickly discover the algorithm that works best for your data, while putting you on the road to MLOps and model lifecycle operationalization with pipelines.
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the free or paid version of Azure Machine Learning today.
An Azure Machine Learning workspace. See Create workspace resources.
Familiarity with Azure's automated machine learning and machine learning pipelines facilities and SDK.
Review automated ML's central classes
Automated ML in a pipeline is represented by an AutoMLStep
object. The AutoMLStep
class is a subclass of PipelineStep
. A graph of PipelineStep
objects defines a Pipeline
.
There are several subclasses of PipelineStep
. In addition to the AutoMLStep
, this article will show a PythonScriptStep
for data preparation and another for registering the model.
The preferred way to initially move data into an ML pipeline is with Dataset
objects. To move data between steps and possible save data output from runs, the preferred way is with OutputFileDatasetConfig
and OutputTabularDatasetConfig
objects. To be used with AutoMLStep
, the PipelineData
object must be transformed into a PipelineOutputTabularDataset
object. For more information, see Input and output data from ML pipelines.
The AutoMLStep
is configured via an AutoMLConfig
object. AutoMLConfig
is a flexible class, as discussed in Configure automated ML experiments in Python.
A Pipeline
runs in an Experiment
. The pipeline Run
has, for each step, a child StepRun
. The outputs of the automated ML StepRun
are the training metrics and highest-performing model.
To make things concrete, this article creates a simple pipeline for a classification task. The task is predicting Titanic survival, but we won't be discussing the data or task except in passing.
Get started
Retrieve initial dataset
Often, an ML workflow starts with pre-existing baseline data. This is a good scenario for a registered dataset. Datasets are visible across the workspace, support versioning, and can be interactively explored. There are many ways to create and populate a dataset, as discussed in Create Azure Machine Learning datasets. Since we'll be using the Python SDK to create our pipeline, use the SDK to download baseline data and register it with the name 'titanic_ds'.
from azureml.core import Workspace, Dataset
ws = Workspace.from_config()
if not 'titanic_ds' in ws.datasets.keys() :
# create a TabularDataset from Titanic training data
web_paths = ['https://dprepdata.blob.core.windows.net/demo/Titanic.csv',
'https://dprepdata.blob.core.windows.net/demo/Titanic2.csv']
titanic_ds = Dataset.Tabular.from_delimited_files(path=web_paths)
titanic_ds.register(workspace = ws,
name = 'titanic_ds',
description = 'Titanic baseline data',
create_new_version = True)
titanic_ds = Dataset.get_by_name(ws, 'titanic_ds')
The code first logs in to the Azure Machine Learning workspace defined in config.json (for an explanation, see Create a workspace configuration file. If there isn't already a dataset named 'titanic_ds'
registered, then it creates one. The code downloads CSV data from the Web, uses them to instantiate a TabularDataset
and then registers the dataset with the workspace. Finally, the function Dataset.get_by_name()
assigns the Dataset
to titanic_ds
.
Configure your storage and compute target
Additional resources that the pipeline will need are storage and, generally, Azure Machine Learning compute resources.
from azureml.core import Datastore
from azureml.core.compute import AmlCompute, ComputeTarget
datastore = ws.get_default_datastore()
compute_name = 'cpu-cluster'
if not compute_name in ws.compute_targets :
print('creating a new compute target...')
provisioning_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
min_nodes=0,
max_nodes=1)
compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
compute_target.wait_for_completion(
show_output=True, min_node_count=None, timeout_in_minutes=20)
# Show the result
print(compute_target.get_status().serialize())
compute_target = ws.compute_targets[compute_name]
The intermediate data between the data preparation and the automated ML step can be stored in the workspace's default datastore, so we don't need to do more than call get_default_datastore()
on the Workspace
object.
After that, the code checks if the Azure Machine Learning compute target 'cpu-cluster'
already exists. If not, we specify that we want a small CPU-based compute target. If you plan to use automated ML's deep learning features (for instance, text featurization with DNN support) you should choose a compute with strong GPU support, as described in GPU optimized virtual machine sizes.
The code blocks until the target is provisioned and then prints some details of the just-created compute target. Finally, the named compute target is retrieved from the workspace and assigned to compute_target
.
Configure the training run
The runtime context is set by creating and configuring a RunConfiguration
object. Here we set the compute target.
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
aml_run_config = RunConfiguration()
# Use just-specified compute target ("cpu-cluster")
aml_run_config.target = compute_target
# Specify CondaDependencies obj, add necessary packages
aml_run_config.environment.python.conda_dependencies = CondaDependencies.create(
conda_packages=['pandas','scikit-learn'],
pip_packages=['azureml-sdk[automl]', 'pyarrow'])
Prepare data for automated machine learning
Write the data preparation code
The baseline Titanic dataset consists of mixed numerical and text data, with some values missing. To prepare it for automated machine learning, the data preparation pipeline step will:
- Fill missing data with either random data or a category corresponding to "Unknown"
- Transform categorical data to integers
- Drop columns that we don't intend to use
- Split the data into training and testing sets
- Write the transformed data to the
OutputFileDatasetConfig
output paths
%%writefile dataprep.py
from azureml.core import Run
import pandas as pd
import numpy as np
import argparse
RANDOM_SEED=42
def prepare_age(df):
# Fill in missing Age values from distribution of present Age values
mean = df["Age"].mean()
std = df["Age"].std()
is_null = df["Age"].isnull().sum()
# compute enough (== is_null().sum()) random numbers between the mean, std
rand_age = np.random.randint(mean - std, mean + std, size = is_null)
# fill NaN values in Age column with random values generated
age_slice = df["Age"].copy()
age_slice[np.isnan(age_slice)] = rand_age
df["Age"] = age_slice
df["Age"] = df["Age"].astype(int)
# Quantize age into 5 classes
df['Age_Group'] = pd.qcut(df['Age'],5, labels=False)
df.drop(['Age'], axis=1, inplace=True)
return df
def prepare_fare(df):
df['Fare'].fillna(0, inplace=True)
df['Fare_Group'] = pd.qcut(df['Fare'],5,labels=False)
df.drop(['Fare'], axis=1, inplace=True)
return df
def prepare_genders(df):
genders = {"male": 0, "female": 1, "unknown": 2}
df['Sex'] = df['Sex'].map(genders)
df['Sex'].fillna(2, inplace=True)
df['Sex'] = df['Sex'].astype(int)
return df
def prepare_embarked(df):
df['Embarked'].replace('', 'U', inplace=True)
df['Embarked'].fillna('U', inplace=True)
ports = {"S": 0, "C": 1, "Q": 2, "U": 3}
df['Embarked'] = df['Embarked'].map(ports)
return df
parser = argparse.ArgumentParser()
parser.add_argument('--output_path', dest='output_path', required=True)
args = parser.parse_args()
titanic_ds = Run.get_context().input_datasets['titanic_ds']
df = titanic_ds.to_pandas_dataframe().drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
df = prepare_embarked(prepare_genders(prepare_fare(prepare_age(df))))
df.to_csv(os.path.join(args.output_path,"prepped_data.csv"))
print(f"Wrote prepped data to {args.output_path}/prepped_data.csv")
The above code snippet is a complete, but minimal, example of data preparation for the Titanic data. The snippet starts with a Jupyter "magic command" to output the code to a file. If you aren't using a Jupyter notebook, remove that line and create the file manually.
The various prepare_
functions in the above snippet modify the relevant column in the input dataset. These functions work on the data once it has been changed into a Pandas DataFrame
object. In each case, missing data is either filled with representative random data or categorical data indicating "Unknown." Text-based categorical data is mapped to integers. No-longer-needed columns are overwritten or dropped.
After the code defines the data preparation functions, the code parses the input argument, which is the path to which we want to write our data. (These values will be determined by OutputFileDatasetConfig
objects that will be discussed in the next step.) The code retrieves the registered 'titanic_cs'
Dataset
, converts it to a Pandas DataFrame
, and calls the various data preparation functions.
Since the output_path
is a directory, the call to to_csv()
specifies the filename prepped_data.csv
.
Write the data preparation pipeline step (PythonScriptStep
)
The data preparation code described above must be associated with a PythonScripStep
object to be used with a pipeline. The path to which the CSV output is written is generated by a OutputFileDatasetConfig
object. The resources prepared earlier, such as the ComputeTarget
, the RunConfig
, and the 'titanic_ds' Dataset
are used to complete the specification.
from azureml.data import OutputFileDatasetConfig
from azureml.pipeline.steps import PythonScriptStep
prepped_data_path = OutputFileDatasetConfig(name="output_path")
dataprep_step = PythonScriptStep(
name="dataprep",
script_name="dataprep.py",
compute_target=compute_target,
runconfig=aml_run_config,
arguments=["--output_path", prepped_data_path],
inputs=[titanic_ds.as_named_input('titanic_ds')],
allow_reuse=True
)
The prepped_data_path
object is of type OutputFileDatasetConfig
which points to a directory. Notice that it's specified in the arguments
parameter. If you review the previous step, you'll see that within the data preparation code, the value of the argument '--output_path'
is the directory path at which the CSV file was written.
Train with AutoMLStep
Configuring an automated ML pipeline step is done with the AutoMLConfig
class. This flexible class is described in Configure automated ML experiments in Python. Data input and output are the only aspects of configuration that require special attention in an ML pipeline. Input and output for AutoMLConfig
in pipelines is discussed in detail below. Beyond data, an advantage of ML pipelines is the ability to use different compute targets for different steps. You might choose to use a more powerful ComputeTarget
only for the automated ML process. Doing so is as straightforward as assigning a more powerful RunConfiguration
to the AutoMLConfig
object's run_configuration
parameter.
Send data to AutoMLStep
In an ML pipeline, the input data must be a Dataset
object. The highest-performing way is to provide the input data in the form of OutputTabularDatasetConfig
objects. You create an object of that type with the read_delimited_files()
on a OutputFileDatasetConfig
, such as the prepped_data_path
, such as the prepped_data_path
object.
# type(prepped_data) == OutputTabularDatasetConfig
prepped_data = prepped_data_path.read_delimited_files()
Another option is to use Dataset
objects registered in the workspace:
prepped_data = Dataset.get_by_name(ws, 'Data_prepared')
Comparing the two techniques:
Technique | Benefits and drawbacks |
---|---|
OutputTabularDatasetConfig |
Higher performance |
Natural route from OutputFileDatasetConfig |
|
Data isn't persisted after pipeline run | |
Registered Dataset |
Lower performance |
Can be generated in many ways | |
Data persists and is visible throughout workspace | |
Notebook showing registered Dataset technique |
Specify automated ML outputs
The outputs of the AutoMLStep
are the final metric scores of the higher-performing model and that model itself. To use these outputs in further pipeline steps, prepare OutputFileDatasetConfig
objects to receive them.
from azureml.pipeline.core import TrainingOutput, PipelineData
metrics_data = PipelineData(name='metrics_data',
datastore=datastore,
pipeline_output_name='metrics_output',
training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='best_model_data',
datastore=datastore,
pipeline_output_name='model_output',
training_output=TrainingOutput(type='Model'))
The snippet above creates the two PipelineData
objects for the metrics and model output. Each is named, assigned to the default datastore retrieved earlier, and associated with the particular type
of TrainingOutput
from the AutoMLStep
. Because we assign pipeline_output_name
on these PipelineData
objects, their values will be available not just from the individual pipeline step, but from the pipeline as a whole, as will be discussed below in the section "Examine pipeline results."
Configure and create the automated ML pipeline step
Once the inputs and outputs are defined, it's time to create the AutoMLConfig
and AutoMLStep
. The details of the configuration will depend on your task, as described in Configure automated ML experiments in Python. For the Titanic survival classification task, the following snippet demonstrates a simple configuration.
from azureml.train.automl import AutoMLConfig
from azureml.pipeline.steps import AutoMLStep
# Change iterations to a reasonable number (50) to get better accuracy
automl_settings = {
"iteration_timeout_minutes" : 10,
"iterations" : 2,
"experiment_timeout_hours" : 0.25,
"primary_metric" : 'AUC_weighted'
}
automl_config = AutoMLConfig(task = 'classification',
path = '.',
debug_log = 'automated_ml_errors.log',
compute_target = compute_target,
run_configuration = aml_run_config,
featurization = 'auto',
training_data = prepped_data,
label_column_name = 'Survived',
**automl_settings)
train_step = AutoMLStep(name='AutoML_Classification',
automl_config=automl_config,
passthru_automl_config=False,
outputs=[metrics_data,model_data],
enable_default_model_output=False,
enable_default_metrics_output=False,
allow_reuse=True)
The snippet shows an idiom commonly used with AutoMLConfig
. Arguments that are more fluid (hyperparameter-ish) are specified in a separate dictionary while the values less likely to change are specified directly in the AutoMLConfig
constructor. In this case, the automl_settings
specify a brief run: the run will stop after only 2 iterations or 15 minutes, whichever comes first.
The automl_settings
dictionary is passed to the AutoMLConfig
constructor as kwargs. The other parameters aren't complex:
task
is set toclassification
for this example. Other valid values areregression
andforecasting
path
anddebug_log
describe the path to the project and a local file to which debug information will be writtencompute_target
is the previously definedcompute_target
that, in this example, is an inexpensive CPU-based machine. If you're using AutoML's Deep Learning facilities, you would want to change the compute target to be GPU-basedfeaturization
is set toauto
. More details can be found in the Data Featurization section of the automated ML configuration documentlabel_column_name
indicates which column we are interested in predictingtraining_data
is set to theOutputTabularDatasetConfig
objects made from the outputs of the data preparation step
The AutoMLStep
itself takes the AutoMLConfig
and has, as outputs, the PipelineData
objects created to hold the metrics and model data.
Important
You must set enable_default_model_output
and enable_default_metrics_output
to True
only if you are using AutoMLStepRun
.
In this example, the automated ML process will perform cross-validations on the training_data
. You can control the number of cross-validations with the n_cross_validations
argument. If you've already split your training data as part of your data preparation steps, you can set validation_data
to its own Dataset
.
You might occasionally see the use X
for data features and y
for data labels. This technique is deprecated and you should use training_data
for input.
Register the model generated by automated ML
The last step in a simple ML pipeline is registering the created model. By adding the model to the workspace's model registry, it will be available in the portal and can be versioned. To register the model, write another PythonScriptStep
that takes the model_data
output of the AutoMLStep
.
Write the code to register the model
A model is registered in a Workspace
. You're probably familiar with using Workspace.from_config()
to log on to your workspace on your local machine, but there's another way to get the workspace from within a running ML pipeline. The Run.get_context()
retrieves the active Run
. This run
object provides access to many important objects, including the Workspace
used here.
%%writefile register_model.py
from azureml.core.model import Model, Dataset
from azureml.core.run import Run, _OfflineRun
from azureml.core import Workspace
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--model_name", required=True)
parser.add_argument("--model_path", required=True)
args = parser.parse_args()
print(f"model_name : {args.model_name}")
print(f"model_path: {args.model_path}")
run = Run.get_context()
ws = Workspace.from_config() if type(run) == _OfflineRun else run.experiment.workspace
model = Model.register(workspace=ws,
model_path=args.model_path,
model_name=args.model_name)
print("Registered version {0} of model {1}".format(model.version, model.name))
Write the PythonScriptStep code
Warning
If you are using the Azure Machine Learning SDK v1, and your workspace is configured for network isolation (VNet), you may receive an error when running this step. For more information, see HyperdriveStep and AutoMLStep fail with network isolation.
The model-registering PythonScriptStep
uses a PipelineParameter
for one of its arguments. Pipeline parameters are arguments to pipelines that can be easily set at run-submission time. Once declared, they're passed as normal arguments.
from azureml.pipeline.core.graph import PipelineParameter
# The model name with which to register the trained model in the workspace.
model_name = PipelineParameter("model_name", default_value="TitanicSurvivalInitial")
register_step = PythonScriptStep(script_name="register_model.py",
name="register_model",
allow_reuse=False,
arguments=["--model_name", model_name, "--model_path", model_data],
inputs=[model_data],
compute_target=compute_target,
runconfig=aml_run_config)
Create and run your automated ML pipeline
Creating and running a pipeline that contains an AutoMLStep
is no different than a normal pipeline.
from azureml.pipeline.core import Pipeline
from azureml.core import Experiment
pipeline = Pipeline(ws, [dataprep_step, train_step, register_step])
experiment = Experiment(workspace=ws, name='titanic_automl')
run = experiment.submit(pipeline, show_output=True)
run.wait_for_completion()
The code above combines the data preparation, automated ML, and model-registering steps into a Pipeline
object. It then creates an Experiment
object. The Experiment
constructor will retrieve the named experiment if it exists or create it if necessary. It submits the Pipeline
to the Experiment
, creating a Run
object that will asynchronously run the pipeline. The wait_for_completion()
function blocks until the run completes.
Examine pipeline results
Once the run
completes, you can retrieve PipelineData
objects that have been assigned a pipeline_output_name
. You can download the results and load them for further processing.
metrics_output_port = run.get_pipeline_output('metrics_output')
model_output_port = run.get_pipeline_output('model_output')
metrics_output_port.download('.', show_progress=True)
model_output_port.download('.', show_progress=True)
Downloaded files are written to the subdirectory azureml/{run.id}/
. The metrics file is JSON-formatted and can be converted into a Pandas dataframe for examination.
For local processing, you might need to install relevant packages, such as Pandas, Pickle, the Azure Machine Learning SDK, and so forth. For this example, it's likely that the best model found by automated ML will depend on XGBoost.
!pip install xgboost==0.90
import pandas as pd
import json
metrics_filename = metrics_output._path_on_datastore
# metrics_filename = path to downloaded file
with open(metrics_filename) as f:
metrics_output_result = f.read()
deserialized_metrics_output = json.loads(metrics_output_result)
df = pd.DataFrame(deserialized_metrics_output)
df
The code snippet above shows the metrics file being loaded from its location on the Azure datastore. You can also load it from the downloaded file, as shown in the comment. Once you've deserialized it and converted it to a Pandas DataFrame, you can see detailed metrics for each of the iterations of the automated ML step.
The model file can be deserialized into a Model
object that you can use for inferencing, further metrics analysis, and so forth.
import pickle
model_filename = model_output._path_on_datastore
# model_filename = path to downloaded file
with open(model_filename, "rb" ) as f:
best_model = pickle.load(f)
# ... inferencing code not shown ...
For more information on loading and working with existing models, see Use an existing model with Azure Machine Learning.
Download the results of an automated ML run
If you've been following along with the article, you'll have an instantiated run
object. But you can also retrieve completed Run
objects from the Workspace
by way of an Experiment
object.
The workspace contains a complete record of all your experiments and runs. You can either use the portal to find and download the outputs of experiments or use code. To access the records from a historic run, use Azure Machine Learning to find the ID of the run in which you are interested. With that ID, you can choose the specific run
by way of the Workspace
and Experiment
.
# Retrieved from Azure Machine Learning web UI
run_id = 'aaaaaaaa-bbbb-cccc-dddd-0123456789AB'
experiment = ws.experiments['titanic_automl']
run = next(run for run in ex.get_runs() if run.id == run_id)
You would have to change the strings in the above code to the specifics of your historical run. The snippet above assumes that you've assigned ws
to the relevant Workspace
with the normal from_config()
. The experiment of interest is directly retrieved and then the code finds the Run
of interest by matching the run.id
value.
Once you have a Run
object, you can download the metrics and model.
automl_run = next(r for r in run.get_children() if r.name == 'AutoML_Classification')
outputs = automl_run.get_outputs()
metrics = outputs['default_metrics_AutoML_Classification']
model = outputs['default_model_AutoML_Classification']
metrics.get_port_data_reference().download('.')
model.get_port_data_reference().download('.')
Each Run
object contains StepRun
objects that contain information about the individual pipeline step run. The run
is searched for the StepRun
object for the AutoMLStep
. The metrics and model are retrieved using their default names, which are available even if you don't pass PipelineData
objects to the outputs
parameter of the AutoMLStep
.
Finally, the actual metrics and model are downloaded to your local machine, as was discussed in the "Examine pipeline results" section above.
Next Steps
- Run this Jupyter notebook showing a complete example of automated ML in a pipeline that uses regression to predict taxi fares
- Create automated ML experiments without writing code
- Explore a variety of Jupyter notebooks demonstrating automated ML
- Read about integrating your pipeline in to End-to-end MLOps or investigate the MLOps GitHub repository