优化的大语言模型 (LLM) 服务

项目
12/30/2024

重要

此功能目前以公共预览版提供。

重要

本指南中的代码示例使用已弃用的 API。 Databricks 建议使用预配的吞吐量体验来优化 LLM 推理。请参阅将优化的 LLM 终结点迁移到预配的吞吐量。

本文演示如何在 Mosaic AI 模型服务上为大型语言模型 (LLM) 启用优化。

与传统的服务方法相比，优化的 LLM 服务在吞吐量和延迟方面都有着 3-5 倍的提升。下表汇总了支持的 LLM 系列及其变体。

Databricks 建议使用 Databricks 市场安装基础模型。可以搜索模型系列，然后从模型页面选择“获取访问权限”并提供登录凭据，以将模型安装到 Unity Catalog。

模型系列	从市场安装
Llama 2	Llama 2 模型
MPT
Mistral	Mistral 模型

要求

GPU 部署的公共预览版中支持优化的大语言模型服务。
必须使用 MLflow 2.4 及更高版本或 Databricks Runtime 13.2 ML 及更高版本记录模型。
部署模型时，必须将模型的参数大小与适当的计算大小相匹配。对于具有 500 亿个或更多参数的模型，请联系 Azure Databricks 帐户团队，请求访问所需的 GPU。

模型参数大小建议的计算大小工作负荷类型

70 亿 1xA100 GPU_LARGE

130 亿 1xA100 GPU_LARGE

300-340 亿 1xA100 GPU_LARGE

700 亿 2xA100 GPU_LARGE_2

模型参数大小	建议的计算大小	工作负荷类型
70 亿	1xA100	`GPU_LARGE`
130 亿	1xA100	`GPU_LARGE`
300-340 亿	1xA100	`GPU_LARGE`
700 亿	2xA100	`GPU_LARGE_2`

记录大型语言模型

首先，使用 MLflow transformers 风格记录模型，并使用 metadata = {"task": "llm/v1/completions"} 在 MLflow 元数据中指定任务字段。这指定用于模型服务终结点的 API 签名。

优化的大语言模型服务与 Azure Databricks AI 网关支持的路由类型兼容；目前为 llm/v1/completions。如果想要提供服务的模型系列或任务类型不受支持，请联系 Azure Databricks 帐户团队。

model = AutoModelForCausalLM.from_pretrained("mosaicml/mpt-7b-instruct",torch_dtype=torch.bfloat16, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("mosaicml/mpt-7b-instruct")
with mlflow.start_run():
    components = {
        "model": model,
        "tokenizer": tokenizer,
    }
    mlflow.transformers.log_model(
        artifact_path="model",
        transformers_model=components,
        input_example=["Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat is Apache Spark?\n\n### Response:\n"],
        metadata={"task": "llm/v1/completions"},
        registered_model_name='mpt'
    )

记录模型后，可以在 Unity 目录中注册模型，并在其中将 CATALOG.SCHEMA.MODEL_NAME 替换为模型的三级名称。


mlflow.set_registry_uri("databricks-uc")

registered_model_name=CATALOG.SCHEMA.MODEL_NAME

创建模型服务终结点

接下来，创建模型服务终结点。如果模型受优化的大语言模型服务支持，则当你尝试提供服务时，Azure Databricks 会自动创建优化的模型服务终结点。

import requests
import json

# Set the name of the MLflow endpoint
endpoint_name = "llama2-3b-chat"

# Name of the registered MLflow model
model_name = "ml.llm-catalog.llama-13b"

# Get the latest version of the MLflow model
model_version = 3

# Specify the type of compute (CPU, GPU_SMALL, GPU_LARGE, etc.)
workload_type = "GPU_LARGE"

# Specify the scale-out size of compute (Small, Medium, Large, etc.)
workload_size = "Small"

# Specify Scale to Zero (only supported for CPU endpoints)
scale_to_zero = False

# Get the API endpoint and token for the current notebook context
API_ROOT = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get()
API_TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()

# send the POST request to create the serving endpoint

data = {
    "name": endpoint_name,
    "config": {
        "served_models": [
            {
                "model_name": model_name,
                "model_version": model_version,
                "workload_size": workload_size,
                "scale_to_zero_enabled": scale_to_zero,
                "workload_type": workload_type,
            }
        ]
    },
}

headers = {"Context-Type": "text/json", "Authorization": f"Bearer {API_TOKEN}"}

response = requests.post(
    url=f"{API_ROOT}/api/2.0/serving-endpoints", json=data, headers=headers
)

print(json.dumps(response.json(), indent=4))

输入和输出架构格式

优化的 LLM 服务终结点具有 Azure Databricks 控制的输入和输出架构。支持四种不同的格式。

dataframe_split 是 split 方向的 JSON 序列化 Pandas 数据帧。

{
  "dataframe_split": {
    "columns": ["prompt"],
    "index": [0],
    "data": [
      [
        "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instructions:\nWhat is Apache Spark?\n\n### Response:\n"
      ]
    ]
  },
  "params": {
    "temperature": 0.5,
    "max_tokens": 100,
    "stop": ["word1", "word2"],
    "candidate_count": 1
  }
}

dataframe_records 是 records 方向的 JSON 序列化 Pandas 数据帧。

{
  "dataframe_records": [
    {
      "prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instructions:\nWhat is Apache Spark?\n\n### Response:\n"
    }
  ],
  "params": {
    "temperature": 0.5,
    "max_tokens": 100,
    "stop": ["word1", "word2"],
    "candidate_count": 1
  }
}

instances

{
  "instances": [
    {
      "prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instructions:\nWhat is Apache Spark?\n\n### Response:\n"
    }
  ],
  "params": {
    "temperature": 0.5,
    "max_tokens": 100,
    "stop": ["word1", "word2"],
    "candidate_count": 1
  }
}

inputs

{
  "inputs": {
    "prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instructions:\nWhat is Apache Spark?\n\n### Response:\n"
  },
  "params": {
    "temperature": 0.5,
    "max_tokens": 100,
    "stop": ["word1", "word2"],
    "candidate_count": 1
  }
}

查询终结点

终结点准备就绪后，可以通过发出 API 请求对其进行查询。根据模型的大小和复杂性，终结点可能需要 30 分钟或更长时间才能准备就绪。


data = {
    "inputs": {
        "prompt": [
            "Hello, I'm a language model,"
        ]
    },
    "params": {
        "max_tokens": 100,
        "temperature": 0.0
    }
}

headers = {"Context-Type": "text/json", "Authorization": f"Bearer {API_TOKEN}"}

response = requests.post(
    url=f"{API_ROOT}/serving-endpoints/{endpoint_name}/invocations", json=data, headers=headers
)

print(json.dumps(response.json()))

限制

鉴于对 GPU 上提供的模型的安装要求增加，GPU 服务的容器映像创建比为 CPU 服务创建映像所需的时间更长。
- 模型大小也会影响映像创建。例如，具有 300 亿个参数或更多参数的模型可能需要至少一个小时进行生成。
- Databricks 会在下次部署同一版本的模型时重复使用同一容器，因此后续部署所需的时间会减少。
GPU 服务的自动缩放比 CPU 服务花费的时间长，因为 GPU 计算上提供的模型设置时间增加了。 Databricks 建议超量预配，以避免请求超时。

通过

优化的大语言模型 (LLM) 服务

要求

记录大型语言模型

创建模型服务终结点

输入和输出架构格式

查询终结点

限制

反馈

其他资源