Hyperopt concepts
Note
The open-source version of Hyperopt is no longer being maintained.
Hyperopt will be removed in the next major DBR ML version. Azure Databricks recommends using either Optuna for single-node optimization or RayTune for a similar experience to the deprecated Hyperopt distributed hyperparameter tuning functionality. Learn more about using RayTune on Azure Databricks.
This article describes some of the concepts you need to know to use distributed Hyperopt.
In this section:
For examples illustrating how to use Hyperopt in Azure Databricks, see Hyperopt.
fmin()
You use fmin()
to execute a Hyperopt run. The arguments for fmin()
are shown in the table; see the Hyperopt documentation for more information. For examples of how to use each argument, see the example notebooks.
Argument name | Description |
---|---|
fn |
Objective function. Hyperopt calls this function with values generated from the hyperparameter space provided in the space argument. This function can return the loss as a scalar value or in a dictionary (see Hyperopt docs for details). This function typically contains code for model training and loss calculation. |
space |
Defines the hyperparameter space to search. Hyperopt provides great flexibility in how this space is defined. You can choose a categorical option such as algorithm, or probabilistic distribution for numeric values such as uniform and log. |
algo |
Hyperopt search algorithm to use to search hyperparameter space. Most commonly used are hyperopt.rand.suggest for Random Search and hyperopt.tpe.suggest for TPE. |
max_evals |
Number of hyperparameter settings to try (the number of models to fit). |
max_queue_len |
Number of hyperparameter settings Hyperopt should generate ahead of time. Because the Hyperopt TPE generation algorithm can take some time, it can be helpful to increase this beyond the default value of 1, but generally no larger than the SparkTrials setting parallelism . |
trials |
A Trials or SparkTrials object. Use SparkTrials when you call single-machine algorithms such as scikit-learn methods in the objective function. Use Trials when you call distributed training algorithms such as MLlib methods or Horovod in the objective function. |
early_stop_fn |
An optional early stopping function to determine if fmin should stop before max_evals is reached. Default is None . The input signature of the function is Trials, *args and the output signature is bool, *args . The output boolean indicates whether or not to stop. *args is any state, where the output of a call to early_stop_fn serves as input to the next call. Trials can be a SparkTrials object. When using SparkTrials , the early stopping function is not guaranteed to run after every trial, and is instead polled. Example of an early stopping function |
The SparkTrials
class
SparkTrials
is an API developed by Databricks that allows you to distribute a Hyperopt run without making other changes to your Hyperopt code. SparkTrials
accelerates single-machine tuning by distributing trials to Spark workers.
Note
SparkTrials
is designed to parallelize computations for single-machine ML models such as scikit-learn. For models created with distributed ML algorithms such as MLlib or Horovod, do not use SparkTrials
. In this case the model building process is automatically parallelized on the cluster and you should use the default Hyperopt class Trials
.
This section describes how to configure the arguments you pass to SparkTrials
and implementation aspects of SparkTrials
.
Arguments
SparkTrials
takes two optional arguments:
parallelism
: Maximum number of trials to evaluate concurrently. A higher number lets you scale-out testing of more hyperparameter settings. Because Hyperopt proposes new trials based on past results, there is a trade-off between parallelism and adaptivity. For a fixedmax_evals
, greater parallelism speeds up calculations, but lower parallelism may lead to better results since each iteration has access to more past results.Default: Number of Spark executors available. Maximum: 128. If the value is greater than the number of concurrent tasks allowed by the cluster configuration,
SparkTrials
reduces parallelism to this value.timeout
: Maximum number of seconds anfmin()
call can take. When this number is exceeded, all runs are terminated andfmin()
exits. Information about completed runs is saved.
Implementation
When defining the objective function fn
passed to fmin()
, and when selecting a cluster setup, it is helpful to understand how SparkTrials
distributes tuning tasks.
In Hyperopt, a trial generally corresponds to fitting one model on one setting of hyperparameters. Hyperopt iteratively generates trials, evaluates them, and repeats.
With SparkTrials
, the driver node of your cluster generates new trials, and worker nodes evaluate those trials. Each trial is generated with a Spark job which has one task, and is evaluated in the task on a worker machine. If your cluster is set up to run multiple tasks per worker, then multiple trials may be evaluated at once on that worker.
SparkTrials
and MLflow
Databricks Runtime ML supports logging to MLflow from workers. You can add custom logging code in the objective function you pass to Hyperopt.
SparkTrials
logs tuning results as nested MLflow runs as follows:
- Main or parent run: The call to
fmin()
is logged as the main run. If there is an active run,SparkTrials
logs to this active run and does not end the run whenfmin()
returns. If there is no active run,SparkTrials
creates a new run, logs to it, and ends the run beforefmin()
returns. - Child runs: Each hyperparameter setting tested (a “trial”) is logged as a child run under the main run. MLflow log records from workers are also stored under the corresponding child runs.
When calling fmin()
, Databricks recommends active MLflow run management; that is, wrap the call to fmin()
inside a with mlflow.start_run():
statement. This ensures that each fmin()
call is logged to a separate MLflow main run, and makes it easier to log extra tags, parameters, or metrics to that run.
Note
When you call fmin()
multiple times within the same active MLflow run, MLflow logs those calls to the same main run. To resolve name conflicts for logged parameters and tags, MLflow appends a UUID to names with conflicts.
When logging from workers, you do not need to manage runs explicitly in the objective function. Call mlflow.log_param("param_from_worker", x)
in the objective function to log a parameter to the child run. You can log parameters, metrics, tags, and artifacts in the objective function.