Muokkaa

Jaa


How to use the ML.NET Automated Machine Learning (AutoML) API

In this article, you learn how to use the ML.NET Automated ML (AutoML API).

Samples for the AutoML API can be found in the dotnet/machinelearning-samples repo.

Installation

To use the AutoML API, install the Microsoft.ML.AutoML NuGet package in the .NET project you want to reference it in.

Note

This guide uses version 0.20.0 and later of the Microsoft.ML.AutoML NuGet package. Although samples and code from earlier versions still work, it is highly recommended you use the APIs introduced in this version for new projects.

For more information on installing NuGet packages, see the following guides:

Quick Start

AutoML provides several defaults for quickly training machine learning models. In this section you'll learn how to:

  • Load your data
  • Define your pipeline
  • Configure your experiment
  • Run your experiment
  • Use the best model to make predictions

Define your problem

Given a dataset stored in a comma-separated file called taxi-fare-train.csv that looks like the following:

vendor_id rate_code passenger_count trip_time_in_secs trip_distance payment_type fare_amount
CMT 1 1 1271 3.8 CRD 17.5
CMT 1 1 474 1.5 CRD 8
CMT 1 1 637 1.4 CRD 8.5

Load your data

Start by initializing your MLContext. MLContext is a starting point for all ML.NET operations. Initializing mlContext creates a new ML.NET environment that can be shared across the model creation workflow objects. It's similar, conceptually, to DBContext in Entity Framework.

Then, to load your data, use the InferColumns method.

// Initialize MLContext
MLContext ctx = new MLContext();

// Define data path
var dataPath = Path.GetFullPath(@"..\..\..\..\Data\taxi-fare-train.csv");

// Infer column information
ColumnInferenceResults columnInference =
    ctx.Auto().InferColumns(dataPath, labelColumnName: "fare_amount", groupColumns: false);

InferColumns loads a few rows from the dataset. It then inspects the data and tries to guess or infer the data type for each of the columns based on their content.

The default behavior is to group columns of the same type into feature vectors or arrays containing the elements for each of the individual columns. Setting groupColumns to false overrides that default behavior and only performs column inference without grouping columns. By keeping columns separate, it allows you to apply different data transformations when preprocessing the data at the individual column level rather than the column grouping.

The result of InferColumns is a ColumnInferenceResults object that contains the options needed to create a TextLoader as well as column information.

For the sample dataset in taxi-fare-train.csv, column information might look like the following:

Once you have your column information, use the TextLoader.Options defined by the ColumnInferenceResults to create a TextLoader to load your data into an IDataView.

// Create text loader
TextLoader loader = ctx.Data.CreateTextLoader(columnInference.TextLoaderOptions);

// Load data into IDataView
IDataView data = loader.Load(dataPath);

It's often good practice to split your data into train and validation sets. Use TrainTestSplit to create an 80% training and 20% validation split of your dataset.

TrainTestData trainValidationData = ctx.Data.TrainTestSplit(data, testFraction: 0.2);

Define your pipeline

Your pipeline defines the data processing steps and machine learning pipeline to use for training your model.

SweepablePipeline pipeline =
    ctx.Auto().Featurizer(data, columnInformation: columnInference.ColumnInformation)
        .Append(ctx.Auto().Regression(labelColumnName: columnInference.ColumnInformation.LabelColumnName));

A SweepablePipeline is a collection of SweepableEstimator. A SweepableEstimator is an ML.NET Estimator with a SearchSpace.

The Featurizer is a convenience API that builds a sweepable pipeline of data processing sweepable estimators based on the column information you provide. Instead of building a pipeline from scratch, Featurizer automates the data preprocessing step. For more information on supported transforms by ML.NET, see the data transformations guide.

The Featurizer output is a single column containing a numerical feature vector representing the transformed data for each of the columns. This feature vector is then used as input for the algorithms used to train a machine learning model.

If you want finer control over your data preprocessing, you can create a pipeline with each of the individual preprocessing steps. For more information, see the prepare data for building a model guide.

Tip

Use Featurizer with ColumnInferenceResults to maximize the utility of AutoML.

For training, AutoML provides a sweepable pipeline with default trainers and search space configurations for the following machine learning tasks:

For the taxi fare prediction problem, since the goal is to predict a numerical value, use Regression. For more information on choosing a task, see Machine learning tasks in ML.NET

Configure your experiment

First, create an AutoML experiment. An AutoMLExperiment is a collection of TrialResult.

AutoMLExperiment experiment = ctx.Auto().CreateExperiment();

Once your experiment is created, use the extension methods it provides to configure different settings.

experiment
    .SetPipeline(pipeline)
    .SetRegressionMetric(RegressionMetric.RSquared, labelColumn: columnInference.ColumnInformation.LabelColumnName)
    .SetTrainingTimeInSeconds(60)
    .SetDataset(trainValidationData);

In this example, you:

  • Set the sweepable pipeline to run during the experiment by calling SetPipeline.
  • Choose RSquared as the metric to optimize during training by calling SetRegressionMetric. For more information on evaluation metrics, see the evaluate your ML.NET model with metrics guide.
  • Set 60 seconds as the amount of time you want to train for by calling SetTrainingTimeInSeconds. A good heuristic to determine how long to train for is the size of your data. Typically, larger datasets require longer training time. For more information, see training time guidance.
  • Provide the training and validation datasets to use by calling SetDataset.

Once your experiment is defined, you'll want some way to track its progress. The quickest way to track progress is by modifying the Log event from MLContext.

// Log experiment trials
ctx.Log += (_, e) => {
    if (e.Source.Equals("AutoMLExperiment"))
    {
        Console.WriteLine(e.RawMessage);
    }
};

Run your experiment

Now that you've defined your experiment, use the RunAsync method to start your experiment.

TrialResult experimentResults = await experiment.RunAsync();

Once the time to train expires, the result is a TrialResult for the best model found during training.

At this point, you can save your model or use it for making predictions. For more information on how use an ML.NET model, see the following guides:

Modify column inference results

Because InferColumns only loads a subset of your data, it's possible that edge cases contained outside of the samples used to infer columns aren't caught and the wrong data types are set for your columns. You can update the properties of ColumnInformation to account for those cases where the column inference results aren't correct.

For example, in the taxi fare dataset, the data in the rate_code column is a number. However, that numerical value represents a category. By default, calling InferColumns will place rate_code in the NumericColumnNames property instead of CategoricalColumnNames. Because these properties are .NET collections, you can use standard operations to add and remove items from them.

You can do the following to update the ColumnInformation for rate_code.

columnInference.ColumnInformation.NumericColumnNames.Remove("rate_code");
columnInference.ColumnInformation.CategoricalColumnNames.Add("rate_code");

Exclude trainers

By default, AutoML tries multiple trainers as part of the training process to see which one works best for your data. However, throughout the training process you might discover there are some trainers that use up too many compute resources or don't provide good evaluation metrics. You have the option to exclude trainers from the training process. Which trainers are used depends on the task. For a list of supported trainers in ML.NET, see the Machine learning tasks in ML.NET guide.

For example, in the taxi fare regression scenario, to exclude the LightGBM algorithm, set the useLgbm parameter to false.

ctx.Auto().Regression(labelColumnName: columnInference.ColumnInformation.LabelColumnName, useLgbm:false)

The process for excluding trainers in other tasks like binary and multiclass classification works the same way.

Customize a sweepable estimator

When you want to more granular customization of estimator options included as part of your sweepable pipeline, you need to:

  1. Initialize a search space
  2. Use the search space to define a custom factory
  3. Create a sweepable estimator
  4. Add your sweepable estimator to your sweepable pipeline

AutoML provides a set of preconfigured search spaces for trainers in the following machine learning tasks:

In this example, the search space used is for the SdcaRegressionTrainer. Initialize it by using SdcaOption.

var sdcaSearchSpace = new SearchSpace<SdcaOption>();

Then, use the search space to define a custom factory method to create the SdcaRegressionTrainer. In this example, the values of L1Regularization and L2Regularization are both being set to something other than the default. For L1Regularization, the value set is determined by the tuner during each trial. The L2Regularization is fixed for each trial to the hard-coded value. During each trial, the custom factory's output is an SdcaRegressionTrainer with the configured hyperparameters.

// Use the search space to define a custom factory to create an SdcaRegressionTrainer
var sdcaFactory = (MLContext ctx, SdcaOption param) =>
{
    var sdcaOption = new SdcaRegressionTrainer.Options();
    sdcaOption.L1Regularization = param.L1Regularization;
    sdcaOption.L2Regularization = 0.02f;

    sdcaOption.LabelColumnName = columnInference.ColumnInformation.LabelColumnName;

    return ctx.Regression.Trainers.Sdca(sdcaOption);
};

A sweepable estimator is the combination of an estimator and a search space. Now that you've defined a search space and used it to create a custom factory method for generating trainers, use the CreateSweepableEstimator method to create a new sweepable estimator.

// Define Sdca sweepable estimator (SdcaRegressionTrainer + SdcaOption search space)
var sdcaSweepableEstimator = ctx.Auto().CreateSweepableEstimator(sdcaFactory, sdcaSearchSpace);

To use your sweepable estimator in your experiment, add it to your sweepable pipeline.

SweepablePipeline pipeline =
    ctx.Auto().Featurizer(data, columnInformation: columnInference.ColumnInformation)
        .Append(sdcaSweepableEstimator);

Because sweepable pipelines are a collection of sweepable estimators, you can configure and customize as many of these sweepable estimators as you need.

Customize your search space

There are scenarios where you want to go beyond customizing the sweepable estimators used in your experiment and want control the search space range. You can do so by accessing the search space properties using keys. In this case, the L1Regularization parameter is a float. Therefore, to customize the search range, use UniformSingleOption.

sdcaSearchSpace["L1Regularization"] = new UniformSingleOption(min: 0.01f, max: 2.0f, logBase: false, defaultValue: 0.01f);

Depending on the data type of the hyperparameter you want to set, you can choose from the following options:

Search spaces can contain nested search spaces as well.

var searchSpace = new SearchSpace();
searchSpace["SingleOption"] = new UniformSingleOption(min:-10f, max:10f, defaultValue=0f)
var nestedSearchSpace = new SearchSpace();
nestedSearchSpace["IntOption"] = new UniformIntOption(min:-10, max:10, defaultValue=0);
searchSpace["Nest"] = nestedSearchSpace;

Another option for customizing search ranges is by extending them. For example, SdcaOption only provides the L1Regularization and L2Regularization parameters. However, SdcaRegressionTrainer has more parameters you can set such as BiasLearningRate.

To extend the search space, create a new class, such as SdcaExtendedOption, that inherits from SdcaOption.

public class SdcaExtendedOption : SdcaOption
{
    [Range(0.10f, 1f, 0.01f)]
    public float BiasLearningRate {get;set;}
}

To specify the search space range, use RangeAttribute, which is equivalent to Microsoft.ML.SearchSpace.Option.

Then, anywhere you use your search space, reference the SdcaExtendedOption instead of SdcaOption.

For example, when you initialize your search space, you can do so as follows:

var sdcaSearchSpace = new SearchSpace<SdcaExtendedOption>();

Create your own trial runner

By default, AutoML supports binary classification, multiclass classification, and regression. However, ML.NET supports many more scenarios such as:

  • Recommendation
  • Forecasting
  • Ranking
  • Image classification
  • Text classification
  • Sentence similarity

For scenarios that don't have preconfigured search spaces and sweepable estimators you can create your own and use a trial runner to enable AutoML for that scenario.

For example, given restaurant review data that looks like the following:

Wow... Loved this place.

1

Crust is not good.

0

You want to use the TextClassificationTrainer trainer to analyze sentiment where 0 is negative and 1 is positive. However, there is no ctx.Auto().TextClassification() configuration.

To use AutoML with the text classification trainer, you'll have to:

  1. Create your own search space.

    // Define TextClassification search space
    public class TCOption
    {
        [Range(64, 128, 32)]
        public int BatchSize { get; set; }
    }
    

    In this case, AutoML will search for different configurations of the BatchSize hyperparameter.

  2. Create a sweepable estimator and add it to your pipeline.

    // Initialize search space
    var tcSearchSpace = new SearchSpace<TCOption>();
    
    // Create factory for Text Classification trainer
    var tcFactory = (MLContext ctx, TCOption param) =>
    {
        return ctx.MulticlassClassification.Trainers.TextClassification(
            sentence1ColumnName: textColumnName,
            batchSize:param.BatchSize);
    };
    
    // Create text classification sweepable estimator
    var tcEstimator =
        ctx.Auto().CreateSweepableEstimator(tcFactory, tcSearchSpace);
    
    // Define text classification pipeline
    var pipeline =
        ctx.Transforms.Conversion.MapValueToKey(columnInference.ColumnInformation.LabelColumnName)
            .Append(tcEstimator);
    

    In this example, the TCOption search space and a custom TextClassificationTrainer factory are used to create a sweepable estimator.

  3. Create a custom trial runner

    To create a custom trial runner, implement ITrialRunner:

    public class TCRunner : ITrialRunner
    {
        private readonly MLContext _context;
        private readonly TrainTestData _data;
        private readonly IDataView _trainDataset;
        private readonly IDataView _evaluateDataset;
        private readonly SweepablePipeline _pipeline;
        private readonly string _labelColumnName;
        private readonly MulticlassClassificationMetric _metric;
    
        public TCRunner(
            MLContext context,
            TrainTestData data,
            SweepablePipeline pipeline,
            string labelColumnName = "Label",
            MulticlassClassificationMetric metric = MulticlassClassificationMetric.MicroAccuracy)
        {
            _context = context;
            _data = data;
            _trainDataset = data.TrainSet;
            _evaluateDataset = data.TestSet;
            _labelColumnName = labelColumnName;
            _pipeline = pipeline;
            _metric = metric;
        }
    
        public void Dispose()
        {
            return;
        }
    
        // Run trial asynchronously
        public Task<TrialResult> RunAsync(TrialSettings settings, CancellationToken ct)
        {
            try
            {
                return Task.Run(() => Run(settings));
            }
            catch (Exception ex) when (ct.IsCancellationRequested)
            {
                throw new OperationCanceledException(ex.Message, ex.InnerException);
            }
            catch (Exception)
            {
                throw;
            }
        }
    
        // Helper function to define trial run logic
        private TrialResult Run(TrialSettings settings)
        {
            try
            {
                // Initialize stop watch to measure time
                var stopWatch = new Stopwatch();
                stopWatch.Start();
    
                // Get pipeline parameters
                var parameter = settings.Parameter["_pipeline_"];
    
                // Use parameters to build pipeline
                var pipeline = _pipeline.BuildFromOption(_context, parameter);
    
                // Train model
                var model = pipeline.Fit(_trainDataset);
    
                // Evaluate the model
                var predictions = model.Transform(_evaluateDataset);
    
                // Get metrics
                var evaluationMetrics = _context.MulticlassClassification.Evaluate(predictions, labelColumnName: _labelColumnName);
                var chosenMetric = GetMetric(evaluationMetrics);
    
                return new TrialResult()
                {
                    Metric = chosenMetric,
                    Model = model,
                    TrialSettings = settings,
                    DurationInMilliseconds = stopWatch.ElapsedMilliseconds
                };
            }
            catch (Exception)
            {
                return new TrialResult()
                {
                    Metric = double.MinValue,
                    Model = null,
                    TrialSettings = settings,
                    DurationInMilliseconds = 0,
                };
            }
        }
    
        // Helper function to choose metric used by experiment
        private double GetMetric(MulticlassClassificationMetrics metric)
        {
            return _metric switch
            {
                MulticlassClassificationMetric.MacroAccuracy => metric.MacroAccuracy,
                MulticlassClassificationMetric.MicroAccuracy => metric.MicroAccuracy,
                MulticlassClassificationMetric.LogLoss => metric.LogLoss,
                MulticlassClassificationMetric.LogLossReduction => metric.LogLossReduction,
                MulticlassClassificationMetric.TopKAccuracy => metric.TopKAccuracy,
                _ => throw new NotImplementedException(),
            };
        }
    }
    

    The TCRunner implementation in this example:

    • Extracts the hyperparameters chosen for that trial
    • Uses the hyperparameters to create an ML.NET pipeline
    • Uses the ML.NET pipeline to train a model
    • Evaluates the model
    • Returns a TrialResult object with the information for that trial
  4. Initialize your custom trial runner

    var tcRunner = new TCRunner(context: ctx, data: trainValidationData, pipeline: pipeline);
    
  5. Create and configure your experiment. Use the SetTrialRunner extension method to add your custom trial runner to your experiment.

    AutoMLExperiment experiment = ctx.Auto().CreateExperiment();
    
    // Configure AutoML experiment
    experiment
        .SetPipeline(pipeline)
        .SetMulticlassClassificationMetric(MulticlassClassificationMetric.MicroAccuracy, labelColumn: columnInference.ColumnInformation.LabelColumnName)
        .SetTrainingTimeInSeconds(120)
        .SetDataset(trainValidationData)
        .SetTrialRunner(tcRunner);
    
  6. Run your experiment

    var tcCts = new CancellationTokenSource();
    TrialResult textClassificationExperimentResults = await experiment.RunAsync(tcCts.Token);
    

Choose a different tuner

AutoML supports various tuning algorithms to iterate through the search space in search of the optimal hyperparameters. By default, it uses the Eci Cost Frugal tuner. Using experiment extension methods, you can choose another tuner that best fits your scenario.

Use the following methods to set your tuner:

For example, to use the grid search tuner, your code might look like the following:

experiment.SetGridSearchTuner();

Configure experiment monitoring

The quickest way to monitor the progress of an experiment is to define the Log event from MLContext. However, the Log event outputs a raw dump of the logs generated by AutoML during each trial. Because of the large amount of unformatted information, it's difficult.

For a more controlled monitoring experience, implement a class with the IMonitor interface.

public class AutoMLMonitor : IMonitor
{
    private readonly SweepablePipeline _pipeline;

    public AutoMLMonitor(SweepablePipeline pipeline)
    {
        _pipeline = pipeline;
    }

    public IEnumerable<TrialResult> GetCompletedTrials() => _completedTrials;

    public void ReportBestTrial(TrialResult result)
    {
        return;
    }

    public void ReportCompletedTrial(TrialResult result)
    {
        var trialId = result.TrialSettings.TrialId;
        var timeToTrain = result.DurationInMilliseconds;
        var pipeline = _pipeline.ToString(result.TrialSettings.Parameter);
        Console.WriteLine($"Trial {trialId} finished training in {timeToTrain}ms with pipeline {pipeline}");
    }

    public void ReportFailTrial(TrialSettings settings, Exception exception = null)
    {
        if (exception.Message.Contains("Operation was canceled."))
        {
            Console.WriteLine($"{settings.TrialId} cancelled. Time budget exceeded.");
        }
        Console.WriteLine($"{settings.TrialId} failed with exception {exception.Message}");
    }

    public void ReportRunningTrial(TrialSettings setting)
    {
        return;
    }
}

The IMonitor interface has four lifecycle events:

Tip

Although it's not required, include your SweepablePipeline in your monitor so you can inspect the pipeline that was generated for a trial using the Parameter property of the TrialSettings.

In this example, only the ReportCompletedTrial and ReportFailTrial are implemented.

Once you've implemented your monitor, set it as part of your experiment configuration using SetMonitor.

var monitor = new AutoMLMonitor(pipeline);
experiment.SetMonitor(monitor);

Then, run your experiment:

var cts = new CancellationTokenSource();
TrialResult experimentResults = await experiment.RunAsync(cts.Token);

When you run the experiment with this implementation, the output should look similar to the following:

Trial 0 finished training in 5835ms with pipeline ReplaceMissingValues=>OneHotEncoding=>Concatenate=>FastForestRegression
Trial 1 finished training in 15080ms with pipeline ReplaceMissingValues=>OneHotEncoding=>Concatenate=>SdcaRegression
Trial 2 finished training in 3941ms with pipeline ReplaceMissingValues=>OneHotHashEncoding=>Concatenate=>FastTreeRegression

Persist trials

By default, AutoML only stores the TrialResult for the best model. However, if you wanted to persist each of the trials, you can do so from within your monitor.

Inside your monitor:

  1. Define a property for your completed trials and a method for accessing them.

    private readonly List<TrialResult> _completedTrials;
    
    public IEnumerable<TrialResult> GetCompletedTrials() => _completedTrials;
    
  2. Initialize it in your constructor

    public AutoMLMonitor(SweepablePipeline pipeline)
    {
        //...
        _completedTrials = new List<TrialResult>();
        //...
    }
    
  3. Append each trial result inside your ReportCompletedTrial lifecycle method.

    public void ReportCompletedTrial(TrialResult result)
    {
        //...
        _completedTrials.Add(result);
    }
    
  4. When training completes, you can access all the completed trials by calling GetCompletedTrials

    var completedTrials = monitor.GetCompletedTrials();
    

At this point, you can perform additional processing on the collection of completed trials. For example, you can choose a model other than the one selected by AutoML, log trial results to a database, or rebuild the pipeline from any of the completed trials.

Cancel experiments

When you run experiments asynchronously, make sure to cleanly terminate the process. To do so, use a CancellationToken.

Warning

Cancelling an experiment will not save any of the intermediary outputs. Set a checkpoint to save intermediary outputs.

var cts = new CancellationTokenSource();
TrialResult experimentResults = await experiment.RunAsync(cts.Token);

Set checkpoints

Checkpoints provide a way for you to save intermediary outputs from the training process in the event of an early termination or error. To set a checkpoint, use the SetCheckpoint extension method and provide a directory to store the intermediary outputs.

var checkpointPath = Path.Join(Directory.GetCurrentDirectory(), "automl");
experiment.SetCheckpoint(checkpointPath);

Determine feature importance

As machine learning is introduced into more aspects of everyday life such as healthcare, it's of utmost importance to understand why a machine learning model makes the decisions it does. Permutation Feature Importance (PFI) is a technique used to explain classification, ranking, and regression models. At a high level, the way it works is by randomly shuffling data one feature at a time for the entire dataset and calculating how much the performance metric of interest decreases. The larger the change, the more important that feature is. For more information on PFI, see interpret model predictions using Permutation Feature Importance.

Note

Calculating PFI can be a time consuming operation. How much time it takes to calculate is proportional to the number of feature columns you have. The more features, the longer PFI will take to run.

To determine feature importance using AutoML:

  1. Get the best model.

    var bestModel = expResult.Model;
    
  2. Apply the model to your dataset.

    var transformedData = bestModel.Transform(trainValidationData.TrainSet);
    
  3. Calculate feature importance using PermutationFeatureImportance

    In this case, the task is regression but the same concept applies to other tasks like ranking and classification.

    var pfiResults =
        mlContext.Regression.PermutationFeatureImportance(bestModel, transformedData, permutationCount:3);
    
  4. Order feature importance by changes to evaluation metrics.

    var featureImportance =
        pfiResults.Select(x => Tuple.Create(x.Key, x.Value.Regression.RSquared))
            .OrderByDescending(x => x.Item2);