超参数优化（预览版）

项目
02/07/2025

超参数优化是查找训练过程中机器学习模型未学习的参数的最佳值的过程，而是在训练过程开始之前由用户设置。这些参数通常称为超参数，示例包括学习速率、神经网络中的隐藏层数、正则化强度和批大小。

机器学习模型的性能对超参数的选择高度敏感，最佳超参数集可能会因特定问题和数据集而异。因此，超参数优化是机器学习管道中的关键步骤，因为它可能会对模型的准确性和通用化性能产生重大影响。

在 Fabric 中，数据科学家可以利用 FLAML（一个轻型 Python 库）高效自动化机器学习和 AI 操作，以满足超参数优化要求。在 Fabric 笔记本中，用户可以调用 flaml.tune 进行高效的超参数优化。

重要

此功能目前为预览版。

优化工作流

使用 flaml.tune 来完成基本调试任务的三个必要步骤是：

指定与超参数相关的优化目标。
指定超参数的搜索空间。
指定优化约束，包括资源预算的约束以执行优化、配置约束或/和对某个（或多个）特定指标的约束。

优化目标

第一步是指定优化目标。为此，您应首先在用户定义的函数 evaluation_function中指定您的超参数评估过程。该函数需要超参数配置作为输入。它可以简单地以标量形式返回一个指标值，或返回一个包含指标名称和指标值的字典。

在下面的示例中，我们可以定义一个计算函数，该函数包含名为 x 和 y的 2 个超参数。

import time

def evaluate_config(config: dict):
    """evaluate a hyperparameter configuration"""
    score = (config["x"] - 85000) ** 2 - config["x"] / config["y"]


    faked_evaluation_cost = config["x"] / 100000
    time.sleep(faked_evaluation_cost)
    # we can return a single float as a score on the input config:
    # return score
    # or, we can return a dictionary that maps metric name to metric value:
    return {"score": score, "evaluation_cost": faked_evaluation_cost, "constraint_metric": config["x"] * config["y"]}

搜索空间

接下来，我们将指定超参数的搜索空间。在搜索空间中，需要为超参数指定有效值以及如何对这些值进行采样（例如，从统一分布或日志统一分布）。在下面的示例中，我们可以提供超参数 x 和 y的搜索空间。两者的有效值为介于 [1， 100,000] 的整数之间。这些超参数在指定范围内统一采样。

from flaml import tune

# construct a search space for the hyperparameters x and y.
config_search_space = {
    "x": tune.lograndint(lower=1, upper=100000),
    "y": tune.randint(lower=1, upper=100000)
}

# provide the search space to tune.run
tune.run(..., config=config_search_space, ...)

使用 FLAML，用户可以自定义特定超参数的域。这样，用户就可以指定类型，从中采样参数的有效范围。 FLAML 支持以下超参数类型：float、integer 和分类。对于常用域，可参阅以下示例：

config = {
    # Sample a float uniformly between -5.0 and -1.0
    "uniform": tune.uniform(-5, -1),

    # Sample a float uniformly between 3.2 and 5.4,
    # rounding to increments of 0.2
    "quniform": tune.quniform(3.2, 5.4, 0.2),

    # Sample a float uniformly between 0.0001 and 0.01, while
    # sampling in log space
    "loguniform": tune.loguniform(1e-4, 1e-2),

    # Sample a float uniformly between 0.0001 and 0.1, while
    # sampling in log space and rounding to increments of 0.00005
    "qloguniform": tune.qloguniform(1e-4, 1e-1, 5e-5),

    # Sample a random float from a normal distribution with
    # mean=10 and sd=2
    "randn": tune.randn(10, 2),

    # Sample a random float from a normal distribution with
    # mean=10 and sd=2, rounding to increments of 0.2
    "qrandn": tune.qrandn(10, 2, 0.2),

    # Sample a integer uniformly between -9 (inclusive) and 15 (exclusive)
    "randint": tune.randint(-9, 15),

    # Sample a random uniformly between -21 (inclusive) and 12 (inclusive (!))
    # rounding to increments of 3 (includes 12)
    "qrandint": tune.qrandint(-21, 12, 3),

    # Sample a integer uniformly between 1 (inclusive) and 10 (exclusive),
    # while sampling in log space
    "lograndint": tune.lograndint(1, 10),

    # Sample a integer uniformly between 2 (inclusive) and 10 (inclusive (!)),
    # while sampling in log space and rounding to increments of 2
    "qlograndint": tune.qlograndint(2, 10, 2),

    # Sample an option uniformly from the specified choices
    "choice": tune.choice(["a", "b", "c"]),
}

若要详细了解如何在搜索空间中自定义域，请访问有关自定义搜索空间的 FLAML 文档。

优化约束

最后一步是指定优化任务的约束。 flaml.tune 的一个显著属性是，它可以在所需的资源约束内完成优化过程。为此，用户可以通过 time_budget_s 参数提供时钟时间（以秒为单位）的资源约束，或者通过 num_samples 参数提供试验次数的资源约束。

# Set a resource constraint of 60 seconds wall-clock time for the tuning.
flaml.tune.run(..., time_budget_s=60, ...)

# Set a resource constraint of 100 trials for the tuning.
flaml.tune.run(..., num_samples=100, ...)

# Use at most 60 seconds and at most 100 trials for the tuning.
flaml.tune.run(..., time_budget_s=60, num_samples=100, ...)

若要了解有关添加配置约束的详细信息，请访问 FLAML 文档，了解高级优化选项。

总结

我们定义调试标准后，就可以执行调试试验。为了跟踪实验结果，我们可以利用 MLFlow 自动日志记录来捕获每次运行的指标和参数。此代码将捕获整个超参数调优试验，并突出显示 FLAML 探索的每个超参数组合。

import mlflow
mlflow.set_experiment("flaml_tune_experiment")
mlflow.autolog(exclusive=False)

with mlflow.start_run(nested=True, run_name="Child Run: "):
    analysis = tune.run(
        evaluate_config,  # the function to evaluate a config
        config=config_search_space,  # the search space defined
        metric="score",
        mode="min",  # the optimization mode, "min" or "max"
        num_samples=-1,  # the maximal number of configs to try, -1 means infinite
        time_budget_s=10,  # the time budget in seconds
    )

注意

启用 MLflow 自动记录时，应在 MLFlow 运行时自动记录指标、参数和模型。但是，这因框架而异。可能不会记录特定模型的指标和参数。例如，不会为 XGBoost、LightGBM、Spark 和 SynapseML 模型记录任何指标。可以使用 MLFlow 自动记录文档详细了解从每个框架捕获哪些指标和参数。

使用 Apache Spark 进行并行优化

flaml.tune 功能支持优化 Apache Spark 和单节点学习器。此外，在优化单节点学习器（例如 Scikit-Learn 学习器）时，还可以通过设置 use_spark = True并行化优化过程以加快优化过程。对于 Spark 群集，默认情况下，FLAML 将为每个执行程序启动一个试用版。还可以使用 n_concurrent_trials 参数自定义并发试用版数。


analysis = tune.run(
    evaluate_config,  # the function to evaluate a config
    config=config_search_space,  # the search space defined
    metric="score",
    mode="min",  # the optimization mode, "min" or "max"
    num_samples=-1,  # the maximal number of configs to try, -1 means infinite
    time_budget_s=10,  # the time budget in seconds
    use_spark=True,
)
print(analysis.best_trial.last_result)  # the best trial's result
print(analysis.best_config)  # the best config

要详细了解如何并行化优化跟踪，请访问 FLAML 文档，了解并行 Spark 作业。

可视化结果

flaml.visualization 模块提供实用工具函数，用于使用 Plotly 绘制优化过程。通过利用 Plotly，用户可以以交互方式浏览其 AutoML 试验结果。若要使用这些绘图函数，只需提供优化的 flaml.AutoML 或 flaml.tune.tune.ExperimentAnalysis 对象作为输入。

可以在笔记本中使用以下函数：

plot_optimization_history：绘制实验中所有试验的优化历史。
plot_feature_importance：为数据集中的每个特征绘制重要性。
plot_parallel_coordinate：绘制试验中的高维参数关系。
plot_contour：在试验中将参数关系绘制为轮廓图。
plot_edf：绘制实验的目标值 EDF（经验分布函数）。
plot_timeline：绘制试验的时间线。
plot_slice：在研究中将参数关系绘制为切片图。
plot_param_importance：绘制实验的超参数重要性图。

通过

超参数优化（预览版）

优化工作流

优化目标

搜索空间

优化约束

总结

使用 Apache Spark 进行并行优化

可视化结果

反馈

其他资源

通过

超参数优化（预览版）

优化工作流

优化目标

搜索空间

优化约束

总结

使用 Apache Spark 进行并行优化

可视化结果

相关内容

反馈

其他资源