Hyperopt best practices and troubleshooting
Note
The open-source version of Hyperopt is no longer being maintained.
Hyperopt will be removed in the next major DBR ML version. Azure Databricks recommends using either Optuna for single-node optimization or RayTune for a similar experience to the deprecated Hyperopt distributed hyperparameter tuning functionality. Learn more about using RayTune on Azure Databricks.
Best practices
- Bayesian approaches can be much more efficient than grid search and random search. Hence, with the Hyperopt Tree of Parzen Estimators (TPE) algorithm, you can explore more hyperparameters and larger ranges. Using domain knowledge to restrict the search domain can optimize tuning and produce better results.
- When you use
hp.choice()
, Hyperopt returns the index of the choice list. Therefore the parameter logged in MLflow is also the index. Usehyperopt.space_eval()
to retrieve the parameter values. - For models with long training times, start experimenting with small datasets and many hyperparameters. Use MLflow to identify the best performing models and determine which hyperparameters can be fixed. In this way, you can reduce the parameter space as you prepare to tune at scale.
- Take advantage of Hyperopt support for conditional dimensions and hyperparameters. For example, when you evaluate multiple flavors of gradient descent, instead of limiting the hyperparameter space to just the common hyperparameters, you can have Hyperopt include conditional hyperparameters—the ones that are only appropriate for a subset of the flavors. For more information about using conditional parameters, see Defining a search space.
- When using
SparkTrials
, configure parallelism appropriately for CPU-only versus GPU-enabled clusters. In Azure Databricks, CPU and GPU clusters use different numbers of executor threads per worker node. CPU clusters use multiple executor threads per node. GPU clusters use only one executor thread per node to avoid conflicts among multiple Spark tasks trying to use the same GPU. While this is generally optimal for libraries written for GPUs, it means that maximum parallelism is reduced on GPU clusters, so be aware of how many GPUs each trial can use when selecting GPU instance types. See GPU-enabled Clusters for details. - Do not use
SparkTrials
on autoscaling clusters. Hyperopt selects the parallelism value when execution begins. If the cluster later autoscales, Hyperopt will not be able to take advantage of the new cluster size.
Troubleshooting
- A reported loss of NaN (not a number) usually means the objective function passed to
fmin()
returned NaN. This does not affect other runs and you can safely ignore it. To prevent this result, try adjusting the hyperparameter space or modifying the objective function. - Because Hyperopt uses stochastic search algorithms, the loss usually does not decrease monotonically with each run. However, these methods often find the best hyperparameters more quickly than other methods.
- Both Hyperopt and Spark incur overhead that can dominate the trial duration for short trial runs (low tens of seconds). The speedup you observe may be small or even zero.
Example notebook: Best practices for datasets of different sizes
SparkTrials
runs the trials on Spark worker nodes. This notebook provides guidelines on how to move datasets of different orders of magnitude to worker nodes when using SparkTrials
.