逐步将 MLflow 模型推出到联机终结点

项目
2024-09-03

本文介绍如何在不导致服务中断的情况下逐步更新 MLflow 模型并将其部署到联机终结点。请使用蓝绿部署（也称为安全推出策略）将一个新版 Web 服务引入生产环境。使用此策略可以在全面推出新版 Web 服务之前，先向一小部分用户或请求者推出该版本。

关于此示例

联机终结点具有“终结点”和“部署”的概念。终结点表示可让客户使用模型的 API，而部署表示该 API 的特定实现。这种区别使用户能够将 API 与实现分离，并在不影响使用者的情况下更改基础实现。本示例将使用这种概念，在不造成服务中断的情况下在终结点中更新部署的模型。

我们要部署的模型基于 UCI 心脏病数据集。数据库包含 76 个属性，但我们使用其中 14 个。该模型尝试预测患者是否存在心脏疾病。它是从 0（不存在）到 1（存在）的整数值。它已使用 XGBBoost 分类器进行训练，所有必需的预处理都打包为 scikit-learn 管道，使此模型成为从原始数据到预测的端到端管道。

本文中的信息基于 azureml-examples 存储库中包含的代码示例。若要在不复制/粘贴文件的情况下在本地运行命令，请克隆存储库，然后将目录更改为 sdk/using-mlflow/deploy。

在 Jupyter Notebook 中继续操作

可以在以下笔记本中按照此示例进行操作。在克隆的存储库中，打开笔记本：mlflow_sdk_online_endpoints_progresive.ipynb。

先决条件

在按照本文中的步骤操作之前，请确保满足以下先决条件：

Azure 订阅。如果没有 Azure 订阅，请在开始操作前先创建一个免费帐户。试用免费版或付费版 Azure 机器学习。
Azure 基于角色的访问控制 (Azure RBAC) 用于授予对 Azure 机器学习中的操作的访问权限。若要执行本文中的步骤，必须为用户帐户分配 Azure 机器学习工作区的所有者或参与者角色，或者分配一个允许 Microsoft.MachineLearningServices/workspaces/onlineEndpoints/* 的自定义角色。有关详细信息，请参阅管理对 Azure 机器学习工作区的访问。

另外需要：

安装 Azure CLI 和 Azure CLI 的 ml 扩展。有关详细信息，请参阅安装、设置和使用 CLI (v2)。

安装 Mlflow SDK 包 mlflow 和适用于 MLflow azureml-mlflow 的 Azure 机器学习插件。
```
pip install mlflow azureml-mlflow
```
如果未在 Azure 机器学习计算中运行，请将 MLflow 跟踪 URI 或 MLflow 的注册表 URI 配置为指向正在处理的工作区。了解如何为 Azure 机器学习配置 MLflow。

连接到工作区

首先，让我们连接到要在其中工作的 Azure 机器学习工作区。

az account set --subscription <subscription>
az configure --defaults workspace=<workspace> group=<resource-group> location=<location>

工作区是 Azure 机器学习的顶级资源，为使用 Azure 机器学习时创建的所有项目提供了一个集中的处理位置。在本部分，我们将连接到要在其中执行部署任务的工作区。

导入所需的库：

from azure.ai.ml import MLClient, Input
from azure.ai.ml.entities import ManagedOnlineEndpoint, ManagedOnlineDeployment, Model
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential

配置工作区详细信息并获取工作区句柄：

subscription_id = "<subscription>"
resource_group = "<resource-group>"
workspace = "<workspace>"

ml_client = MLClient(DefaultAzureCredential(), subscription_id, resource_group, workspace)

导入所需的库

import json
import mlflow
import requests
import pandas as pd
from mlflow.deployments import get_deploy_client

配置 MLflow 客户端和部署客户端：

mlflow_client = mlflow.MLflowClient()
deployment_client = get_deploy_client(mlflow.get_tracking_uri())

在注册表中注册模型

确保在 Azure 机器学习注册表中注册模型。 Azure 机器学习中不支持部署未注册的模型。可以使用 MLflow SDK 注册新模型：

MODEL_NAME='heart-classifier'
az ml model create --name $MODEL_NAME --type "mlflow_model" --path "model"

model_name = 'heart-classifier'
model_local_path = "model"

model = ml_client.models.create_or_update(
     Model(name=model_name, path=model_local_path, type=AssetTypes.MLFLOW_MODEL)
)

model_name = 'heart-classifier'
model_local_path = "model"

registered_model = mlflow_client.create_model_version(
    name=model_name, source=f"file://{model_local_path}"
)
version = registered_model.version

创建联机终结点

联机终结点是用于联机（实时）推理的终结点。联机终结点包含已准备好从客户端接收数据并可实时发回响应的部署。

我们将通过在同一终结点下部署同一模型的多个版本来利用此功能。但是，新部署最初将接收 0% 的流量。在确定新模型正常工作后，我们逐步将流量从一个部署转移到另一个部署。

终结点需要一个名称，该名称在同一区域中必须唯一。确保创建一个不存在的名称：

ENDPOINT_SUFIX=$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w ${1:-5} | head -n 1)
ENDPOINT_NAME="heart-classifier-$ENDPOINT_SUFIX"

import random
import string

# Creating a unique endpoint name by including a random suffix
allowed_chars = string.ascii_lowercase + string.digits
endpoint_suffix = "".join(random.choice(allowed_chars) for x in range(5))
endpoint_name = "heart-classifier-" + endpoint_suffix

print(f"Endpoint name: {endpoint_name}")

import random
import string

# Creating a unique endpoint name by including a random suffix
allowed_chars = string.ascii_lowercase + string.digits
endpoint_suffix = "".join(random.choice(allowed_chars) for x in range(5))
endpoint_name = "heart-classifier-" + endpoint_suffix

print(f"Endpoint name: {endpoint_name}")

配置终结点

endpoint.yml

$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema.json
name: heart-classifier-edp
auth_mode: key

endpoint = ManagedOnlineEndpoint(
    name=endpoint_name,
    description="An endpoint to serve predictions of the UCI heart disease problem",
    auth_mode="key",
)

可以使用配置文件配置此终结点的属性。我们在以下示例中将终结点的身份验证模式配置为“密钥”：

endpoint_config = {
    "auth_mode": "key",
    "identity": {
        "type": "system_assigned"
    }
}

让我们将此配置写入 JSON 文件：

endpoint_config_path = "endpoint_config.json"
with open(endpoint_config_path, "w") as outfile:
    outfile.write(json.dumps(endpoint_config))

创建终结点：

az ml online-endpoint create -n $ENDPOINT_NAME -f endpoint.yml

ml_client.online_endpoints.begin_create_or_update(endpoint).result()

endpoint = deployment_client.create_endpoint(
    name=endpoint_name,
    config={"endpoint-config-file": endpoint_config_path},
)

获取终结点的身份验证机密。
```
ENDPOINT_SECRET_KEY=$(az ml online-endpoint get-credentials -n $ENDPOINT_NAME | jq -r ".accessToken")
```
```
endpoint_secret_key = ml_client.online_endpoints.list_keys(
    name=endpoint_name
).access_token
```
此功能目前在 MLflow SDK 中不可用。转到 Azure 机器学习工作室，导航到该终结点并从中检索密钥。

创建蓝色部署

到目前为止，终结点是空的。其中不包含任何部署。让我们通过部署以前使用的同一模型来创建第一个模型。我们将此部署称为“default”，表示“蓝色部署”。

配置部署

blue-deployment.yml

$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
name: default
endpoint_name: heart-classifier-edp
model: azureml:heart-classifier@latest
instance_type: Standard_DS2_v2
instance_count: 1

blue_deployment_name = "default"

配置部署的硬件要求：

blue_deployment = ManagedOnlineDeployment(
    name=blue_deployment_name,
    endpoint_name=endpoint_name,
    model=model,
    instance_type="Standard_DS2_v2",
    instance_count=1,
)

如果终结点没有传出连接，请通过添加 with_package=True 参数使用模型打包（预览版）：

blue_deployment = ManagedOnlineDeployment(
    name=blue_deployment_name,
    endpoint_name=endpoint_name,
    model=model,
    instance_type="Standard_DS2_v2",
    instance_count=1,
    with_package=True,
)

blue_deployment_name = "default"

若要配置部署的硬件要求，需要使用所需配置创建 JSON 文件：

deploy_config = {
    "instance_type": "Standard_DS2_v2",
    "instance_count": 1,
}

注意

可在托管联机部署架构 (v2) 中找到此配置的完整规范。

将配置写入文件：

deployment_config_path = "deployment_config.json"
with open(deployment_config_path, "w") as outfile:
    outfile.write(json.dumps(deploy_config))

创建部署

az ml online-deployment create --endpoint-name $ENDPOINT_NAME -f blue-deployment.yml --all-traffic

如果终结点没有传出连接，请通过添加 --with-package 标记使用模型打包（预览版）：

az ml online-deployment create --with-package --endpoint-name $ENDPOINT_NAME -f blue-deployment.yml --all-traffic

提示

在 create 命令中设置标志 --all-traffic，以便将所有流量分配到新部署。

ml_client.online_deployments.begin_create_or_update(blue_deployment).result()

blue_deployment = deployment_client.create_deployment(
    name=blue_deployment_name,
    endpoint=endpoint_name,
    model_uri=f"models:/{model_name}/{version}",
    config={"deploy-config-file": deployment_config_path},
)

将所有流量分配到部署

到目前为止，终结点有一个部署，但没有为其分配任何流量。让我们分配流量。
在 Azure CLI 中不需要执行此步骤，因为我们在创建期间使用了 --all-traffic。
```
endpoint.traffic = { blue_deployment_name: 100 }
```
```
traffic_config = {"traffic": {blue_deployment_name: 100}}
```
将配置写入文件：
```
traffic_config_path = "traffic_config.json"
with open(traffic_config_path, "w") as outfile:
    outfile.write(json.dumps(traffic_config))
```
更新终结点配置：
在 Azure CLI 中不需要执行此步骤，因为我们在创建期间使用了 --all-traffic。
```
ml_client.begin_create_or_update(endpoint).result()
```
```
deployment_client.update_endpoint(
    endpoint=endpoint_name,
    config={"endpoint-config-file": traffic_config_path},
)
```

创建示例输入以测试部署

sample.yml

{
    "input_data": {
        "columns": [
            "age",
            "sex",
            "cp",
            "trestbps",
            "chol",
            "fbs",
            "restecg",
            "thalach",
            "exang",
            "oldpeak",
            "slope",
            "ca",
            "thal"
        ],
        "data": [
            [ 48, 0, 3, 130, 275, 0, 0, 139, 0, 0.2, 1, 0, "normal" ]
        ]
    }
}

以下代码从训练数据集中抽取 5 个观测项的样本，删除 target 列（因为模型将预测它），并在文件 sample.json 中创建一个可用于模型部署的请求。

samples = (
    pd.read_csv("data/heart.csv")
    .sample(n=5)
    .drop(columns=["target"])
    .reset_index(drop=True)
)

with open("sample.json", "w") as f:
    f.write(
        json.dumps(
            {"input_data": json.loads(samples.to_json(orient="split", index=False))}
        )
    )

以下代码从训练数据集中抽取 5 个观测项的样本，删除 target 列（因为模型将预测它），并创建请求。

samples = (
    pd.read_csv("data/heart.csv")
    .sample(n=5)
    .drop(columns=["target"])
    .reset_index(drop=True)
)

测试部署

az ml online-endpoint invoke --name $ENDPOINT_NAME --request-file sample.json

ml_client.online_endpoints.invoke(
    endpoint_name=endpoint_name,
    request_file="sample.json",
)

deployment_client.predict(
    endpoint=endpoint_name, 
    df=samples
)

在终结点下创建绿色部署

假设开发团队创建了一个已准备好投入生产的新模型版本。我们首先可以尝试运行此模型，在对它有信心后，我们可以更新终结点以将流量路由到该模型。

注册新的模型版本

MODEL_NAME='heart-classifier'
az ml model create --name $MODEL_NAME --type "mlflow_model" --path "model"

让我们获取新模型的版本号：

VERSION=$(az ml model show -n heart-classifier --label latest | jq -r ".version")

model_name = 'heart-classifier'
model_local_path = "model"

model = ml_client.models.create_or_update(
     Model(name=model_name, path=model_local_path, type=AssetTypes.MLFLOW_MODEL)
)
version = model.version

model_name = 'heart-classifier'
model_local_path = "model"

registered_model = mlflow_client.create_model_version(
    name=model_name, source=f"file://{model_local_path}"
)
version = registered_model.version

配置新部署

green-deployment.yml

$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
name: xgboost-model
endpoint_name: heart-classifier-edp
model: azureml:heart-classifier@latest
instance_type: Standard_DS2_v2
instance_count: 1

如下所示为部署命名：

GREEN_DEPLOYMENT_NAME="xgboost-model-$VERSION"

green_deployment_name = f"xgboost-model-{version}"

配置部署的硬件要求：

green_deployment = ManagedOnlineDeployment(
    name=green_deployment_name,
    endpoint_name=endpoint_name,
    model=model,
    instance_type="Standard_DS2_v2",
    instance_count=1,
)

如果终结点没有传出连接，请通过添加 with_package=True 参数使用模型打包（预览版）：

green_deployment = ManagedOnlineDeployment(
    name=green_deployment_name,
    endpoint_name=endpoint_name,
    model=model,
    instance_type="Standard_DS2_v2",
    instance_count=1,
    with_package=True,
)

green_deployment_name = f"xgboost-model-{version}"

若要配置部署的硬件要求，需要使用所需配置创建 JSON 文件：

deploy_config = {
    "instance_type": "Standard_DS2_v2",
    "instance_count": 1,
}

提示

我们使用 deployment-config-file 中指示的相同硬件确认信息。但是，并不要求使用相同的配置。你可以根据要求为不同的模型配置不同的硬件。

将配置写入文件：

deployment_config_path = "deployment_config.json"
with open(deployment_config_path, "w") as outfile:
    outfile.write(json.dumps(deploy_config))

创建新部署

az ml online-deployment create -n $GREEN_DEPLOYMENT_NAME --endpoint-name $ENDPOINT_NAME -f green-deployment.yml

如果终结点没有传出连接，请通过添加 --with-package 标记使用模型打包（预览版）：

az ml online-deployment create --with-package -n $GREEN_DEPLOYMENT_NAME --endpoint-name $ENDPOINT_NAME -f green-deployment.yml

ml_client.online_deployments.begin_create_or_update(green_deployment).result()

new_deployment = deployment_client.create_deployment(
    name=green_deployment_name,
    endpoint=endpoint_name,
    model_uri=f"models:/{model_name}/{version}",
    config={"deploy-config-file": deployment_config_path},
)

在不更改流量的情况下测试部署

az ml online-endpoint invoke --name $ENDPOINT_NAME --deployment-name $GREEN_DEPLOYMENT_NAME --request-file sample.json

ml_client.online_endpoints.invoke(
    endpoint_name=endpoint_name,
    deployment_name=green_deployment_name
    request_file="sample.json",
)

deployment_client.predict(
    endpoint=endpoint_name, 
    deployment_name=green_deployment_name, 
    df=samples
)

提示

请注意我们现在如何指明要调用的部署的名称。

逐步更新流量

在对新部署有信心后，我们可以更新流量，以将一部分流量路由到新部署。流量是在终结点级别配置的：

配置流量：

在 Azure CLI 中不需要执行此步骤

endpoint.traffic = {blue_deployment_name: 90, green_deployment_name: 10}

traffic_config = {"traffic": {blue_deployment_name: 90, green_deployment_name: 10}}

将配置写入文件：

traffic_config_path = "traffic_config.json"
with open(traffic_config_path, "w") as outfile:
    outfile.write(json.dumps(traffic_config))

更新终结点

az ml online-endpoint update --name $ENDPOINT_NAME --traffic "default=90 $GREEN_DEPLOYMENT_NAME=10"

ml_client.begin_create_or_update(endpoint).result()

deployment_client.update_endpoint(
    endpoint=endpoint_name,
    config={"endpoint-config-file": traffic_config_path},
)

如果你决定将整个流量切换到新部署，请更新所有流量：

在 Azure CLI 中不需要执行此步骤

endpoint.traffic = {blue_deployment_name: 0, green_deployment_name: 100}

traffic_config = {"traffic": {blue_deployment_name: 0, green_deployment_name: 100}}

将配置写入文件：

traffic_config_path = "traffic_config.json"
with open(traffic_config_path, "w") as outfile:
    outfile.write(json.dumps(traffic_config))

更新终结点

az ml online-endpoint update --name $ENDPOINT_NAME --traffic "default=0 $GREEN_DEPLOYMENT_NAME=100"

ml_client.begin_create_or_update(endpoint).result()

deployment_client.update_endpoint(
    endpoint=endpoint_name,
    config={"endpoint-config-file": traffic_config_path},
)

由于旧部署未收到任何流量，因此你可以放心地删除旧部署：

az ml online-deployment delete --endpoint-name $ENDPOINT_NAME --name default

ml_client.online_deployments.begin_delete(
    name=blue_deployment_name, 
    endpoint_name=endpoint_name
)

deployment_client.delete_deployment(
    blue_deployment_name, 
    endpoint=endpoint_name
)

提示

请注意，此时，以前的“蓝色部署”已删除，新的“绿色部署”取代了“蓝色部署”。

清理资源

az ml online-endpoint delete --name $ENDPOINT_NAME --yes

ml_client.online_endpoints.begin_delete(name=endpoint_name)

deployment_client.delete_endpoint(endpoint_name)

重要

请注意，删除终结点也会删除其下的所有部署。

通过

逐步将 MLflow 模型推出到联机终结点

关于此示例

在 Jupyter Notebook 中继续操作

先决条件

连接到工作区

在注册表中注册模型

创建联机终结点

创建蓝色部署

在终结点下创建绿色部署

逐步更新流量

清理资源

后续步骤

反馈

其他资源