Deploy language models in batch endpoints
APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)
Batch Endpoints can be used to deploy expensive models, like language models, over text data. In this tutorial, you learn how to deploy a model that can perform text summarization of long sequences of text using a model from HuggingFace. It also shows how to do inference optimization using HuggingFace optimum
and accelerate
libraries.
About this sample
The model we are going to work with was built using the popular library transformers from HuggingFace along with a pre-trained model from Facebook with the BART architecture. It was introduced in the paper BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation. This model has the following constraints, which are important to keep in mind for deployment:
- It can work with sequences up to 1024 tokens.
- It is trained for summarization of text in English.
- We are going to use Torch as a backend.
The example in this article is based on code samples contained in the azureml-examples repository. To run the commands locally without having to copy or paste YAML and other files, use the following commands to clone the repository and go to the folder for your coding language:
git clone https://github.com/Azure/azureml-examples --depth 1
cd azureml-examples/cli
The files for this example are in:
cd endpoints/batch/deploy-models/huggingface-text-summarization
Follow along in Jupyter Notebooks
You can follow along this sample in a Jupyter Notebook. In the cloned repository, open the notebook: text-summarization-batch.ipynb.
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free account before you begin.
An Azure Machine Learning workspace. To create a workspace, see Manage Azure Machine Learning workspaces.
The following permissions in the Azure Machine Learning workspace:
- For creating or managing batch endpoints and deployments: Use an Owner, Contributor, or custom role that has been assigned the
Microsoft.MachineLearningServices/workspaces/batchEndpoints/*
permissions. - For creating Azure Resource Manager deployments in the workspace resource group: Use an Owner, Contributor, or custom role that has been assigned the
Microsoft.Resources/deployments/write
permission in the resource group where the workspace is deployed.
- For creating or managing batch endpoints and deployments: Use an Owner, Contributor, or custom role that has been assigned the
The Azure Machine Learning CLI or the Azure Machine Learning SDK for Python:
Run the following command to install the Azure CLI and the
ml
extension for Azure Machine Learning:az extension add -n ml
Pipeline component deployments for batch endpoints are introduced in version 2.7 of the
ml
extension for the Azure CLI. Use theaz extension update --name ml
command to get the latest version.
Connect to your workspace
The workspace is the top-level resource for Azure Machine Learning. It provides a centralized place to work with all artifacts you create when you use Azure Machine Learning. In this section, you connect to the workspace where you perform your deployment tasks.
In the following command, enter your subscription ID, workspace name, resource group name, and location:
az account set --subscription <subscription>
az configure --defaults workspace=<workspace> group=<resource-group> location=<location>
Registering the model
Due to the size of the model, it hasn't been included in this repository. Instead, you can download a copy from the HuggingFace model's hub. You need the packages transformers
and torch
installed in the environment you are using.
%pip install transformers torch
Use the following code to download the model to a folder model
:
from transformers import pipeline
model = pipeline("summarization", model="facebook/bart-large-cnn")
model_local_path = 'model'
summarizer.save_pretrained(model_local_path)
We can now register this model in the Azure Machine Learning registry:
MODEL_NAME='bart-text-summarization'
az ml model create --name $MODEL_NAME --path "model"
Creating the endpoint
We are going to create a batch endpoint named text-summarization-batch
where to deploy the HuggingFace model to run text summarization on text files in English.
Decide on the name of the endpoint. The name of the endpoint ends-up in the URI associated with your endpoint. Because of that, batch endpoint names need to be unique within an Azure region. For example, there can be only one batch endpoint with the name
mybatchendpoint
inwestus2
.Configure your batch endpoint
The following YAML file defines a batch endpoint:
endpoint.yml
$schema: https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.json name: text-summarization-batch description: A batch endpoint for summarizing text using a HuggingFace transformer model. auth_mode: aad_token
Create the endpoint:
Creating the deployment
Let's create the deployment that hosts the model:
We need to create a scoring script that can read the CSV files provided by the batch deployment and return the scores of the model with the summary. The following script performs these actions:
- Indicates an
init
function that detects the hardware configuration (CPU vs GPU) and loads the model accordingly. Both the model and the tokenizer are loaded in global variables. We are not using apipeline
object from HuggingFace to account for the limitation in the sequence lenghs of the model we are currently using. - Notice that we are doing performing model optimizations to improve the performance using
optimum
andaccelerate
libraries. If the model or hardware doesn't support it, we will run the deployment without such optimizations. - Indicates a
run
function that is executed for each mini-batch the batch deployment provides. - The
run
function read the entire batch using thedatasets
library. The text we need to summarize is on the columntext
. - The
run
method iterates over each of the rows of the text and run the prediction. Since this is a very expensive model, running the prediction over entire files will result in an out-of-memory exception. Notice that the model is not execute with thepipeline
object fromtransformers
. This is done to account for long sequences of text and the limitation of 1024 tokens in the underlying model we are using. - It returns the summary of the provided text.
code/batch_driver.py
import os import time import torch import subprocess import mlflow from pprint import pprint from transformers import AutoTokenizer, BartForConditionalGeneration from optimum.bettertransformer import BetterTransformer from datasets import load_dataset def init(): global model global tokenizer global device cuda_available = torch.cuda.is_available() device = "cuda" if cuda_available else "cpu" if cuda_available: print(f"[INFO] CUDA version: {torch.version.cuda}") print(f"[INFO] ID of current CUDA device: {torch.cuda.current_device()}") print("[INFO] nvidia-smi output:") pprint( subprocess.run(["nvidia-smi"], stdout=subprocess.PIPE).stdout.decode( "utf-8" ) ) else: print( "[WARN] CUDA acceleration is not available. This model takes hours to run on medium size data." ) # AZUREML_MODEL_DIR is an environment variable created during deployment model_path = os.path.join(os.environ["AZUREML_MODEL_DIR"], "model") # load the tokenizer tokenizer = AutoTokenizer.from_pretrained( model_path, truncation=True, max_length=1024 ) # Load the model try: model = BartForConditionalGeneration.from_pretrained( model_path, device_map="auto" ) except Exception as e: print( f"[ERROR] Error happened when loading the model on GPU or the default device. Error: {e}" ) print("[INFO] Trying on CPU.") model = BartForConditionalGeneration.from_pretrained(model_path) device = "cpu" # Optimize the model if device != "cpu": try: model = BetterTransformer.transform(model, keep_original_model=False) print("[INFO] BetterTransformer loaded.") except Exception as e: print( f"[ERROR] Error when converting to BetterTransformer. An unoptimized version of the model will be used.\n\t> {e}" ) mlflow.log_param("device", device) mlflow.log_param("model", type(model).__name__) def run(mini_batch): resultList = [] print(f"[INFO] Reading new mini-batch of {len(mini_batch)} file(s).") ds = load_dataset("csv", data_files={"score": mini_batch}) start_time = time.perf_counter() for idx, text in enumerate(ds["score"]["text"]): # perform inference inputs = tokenizer.batch_encode_plus( [text], truncation=True, padding=True, max_length=1024, return_tensors="pt" ) input_ids = inputs["input_ids"].to(device) summary_ids = model.generate( input_ids, max_length=130, min_length=30, do_sample=False ) summaries = tokenizer.batch_decode( summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False ) # Get results: resultList.append(summaries[0]) rps = idx / (time.perf_counter() - start_time + 00000.1) print("Rows per second:", rps) mlflow.log_metric("rows_per_second", rps) return resultList
Tip
Although files are provided in mini-batches by the deployment, this scoring script processes one row at a time. This is a common pattern when dealing with expensive models (like transformers) as trying to load the entire batch and send it to the model at once may result in high-memory pressure on the batch executor (OOM exeptions).
- Indicates an
We need to indicate over which environment we are going to run the deployment. In our case, our model runs on
Torch
and it requires the librariestransformers
,accelerate
, andoptimum
from HuggingFace. Azure Machine Learning already has an environment with Torch and GPU support available. We are just going to add a couple of dependencies in aconda.yaml
file.environment/torch200-conda.yaml
name: huggingface-env channels: - conda-forge dependencies: - python=3.8.5 - pip - pip: - torch==2.0 - transformers - accelerate - optimum - datasets - mlflow - azureml-mlflow - azureml-core - azureml-dataset-runtime[fuse]
We can use the conda file mentioned before as follows:
The environment definition is included in the deployment file.
deployment.yml
compute: azureml:gpu-cluster environment: name: torch200-transformers-gpu image: mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.8-cudnn8-ubuntu22.04:latest
Important
The environment
torch200-transformers-gpu
we've created requires a CUDA 11.8 compatible hardware device to run Torch 2.0 and Ubuntu 20.04. If your GPU device doesn't support this version of CUDA, you can check the alternativetorch113-conda.yaml
conda environment (also available on the repository), which runs Torch 1.3 over Ubuntu 18.04 with CUDA 10.1. However, acceleration using theoptimum
andaccelerate
libraries won't be supported on this configuration.Each deployment runs on compute clusters. They support both Azure Machine Learning Compute clusters (AmlCompute) or Kubernetes clusters. In this example, our model can benefit from GPU acceleration, which is why we use a GPU cluster.
az ml compute create -n gpu-cluster --type amlcompute --size STANDARD_NV6 --min-instances 0 --max-instances 2
Note
You are not charged for compute at this point as the cluster remains at 0 nodes until a batch endpoint is invoked and a batch scoring job is submitted. Learn more about manage and optimize cost for AmlCompute.
Now, let's create the deployment.
To create a new deployment under the created endpoint, create a
YAML
configuration like the following. You can check the full batch endpoint YAML schema for extra properties.deployment.yml
$schema: https://azuremlschemas.azureedge.net/latest/modelBatchDeployment.schema.json endpoint_name: text-summarization-batch name: text-summarization-optimum description: A text summarization deployment implemented with HuggingFace and BART architecture with GPU optimization using Optimum. type: model model: azureml:bart-text-summarization@latest compute: azureml:gpu-cluster environment: name: torch200-transformers-gpu image: mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.8-cudnn8-ubuntu22.04:latest conda_file: environment/torch200-conda.yaml code_configuration: code: code scoring_script: batch_driver.py resources: instance_count: 2 settings: max_concurrency_per_instance: 1 mini_batch_size: 1 output_action: append_row output_file_name: predictions.csv retry_settings: max_retries: 1 timeout: 3000 error_threshold: -1 logging_level: info
Then, create the deployment with the following command:
az ml batch-deployment create --file deployment.yml --endpoint-name $ENDPOINT_NAME --set-default
Important
You will notice in this deployment a high value in
timeout
in the parameterretry_settings
. The reason for it is due to the nature of the model we are running. This is a very expensive model and inference on a single row may take up to 60 seconds. Thetimeout
parameters controls how much time the Batch Deployment should wait for the scoring script to finish processing each mini-batch. Since our model runs predictions row by row, processing a long file may take time. Also notice that the number of files per batch is set to 1 (mini_batch_size=1
). This is again related to the nature of the work we are doing. Processing one file at a time per batch is expensive enough to justify it. You will notice this being a pattern in NLP processing.Although you can invoke a specific deployment inside of an endpoint, you usually want to invoke the endpoint itself and let the endpoint decide which deployment to use. Such deployment is named the "default" deployment. This gives you the possibility of changing the default deployment and hence changing the model serving the deployment without changing the contract with the user invoking the endpoint. Use the following instruction to update the default deployment:
At this point, our batch endpoint is ready to be used.
Testing out the deployment
For testing our endpoint, we are going to use a sample of the dataset BillSum: A Corpus for Automatic Summarization of US Legislation. This sample is included in the repository in the folder data
. Notice that the format of the data is CSV and the content to be summarized is under the column text
, as expected by the model.
Let's invoke the endpoint:
JOB_NAME=$(az ml batch-endpoint invoke --name $ENDPOINT_NAME --input data --input-type uri_folder --query name -o tsv)
Note
The utility
jq
may not be installed on every installation. You can get instructions in this link.Tip
Notice that by indicating a local path as an input, the data is uploaded to Azure Machine Learning default's storage account.
A batch job is started as soon as the command returns. You can monitor the status of the job until it finishes:
Once the deployment is finished, we can download the predictions:
Considerations when deploying models that process text
As mentioned in some of the notes along this tutorial, processing text may have some peculiarities that require specific configuration for batch deployments. Take the following consideration when designing the batch deployment:
- Some NLP models may be very expensive in terms of memory and compute time. If this is the case, consider decreasing the number of files included on each mini-batch. In the example above, the number was taken to the minimum, 1 file per batch. While this may not be your case, take into consideration how many files your model can score at each time. Have in mind that the relationship between the size of the input and the memory footprint of your model may not be linear for deep learning models.
- If your model can't even handle one file at a time (like in this example), consider reading the input data in rows/chunks. Implement batching at the row level if you need to achieve higher throughput or hardware utilization.
- Set the
timeout
value of your deployment accordly to how expensive your model is and how much data you expect to process. Remember that thetimeout
indicates the time the batch deployment would wait for your scoring script to run for a given batch. If your batch have many files or files with many rows, this impacts the right value of this parameter.
Considerations for MLflow models that process text
The same considerations mentioned above apply to MLflow models. However, since you are not required to provide a scoring script for your MLflow model deployment, some of the recommendations mentioned may require a different approach.
- MLflow models in Batch Endpoints support reading tabular data as input data, which may contain long sequences of text. See File's types support for details about which file types are supported.
- Batch deployments calls your MLflow model's predict function with the content of an entire file in as Pandas dataframe. If your input data contains many rows, chances are that running a complex model (like the one presented in this tutorial) results in an out-of-memory exception. If this is your case, you can consider:
- Customize how your model runs predictions and implement batching. To learn how to customize MLflow model's inference, see Logging custom models.
- Author a scoring script and load your model using
mlflow.<flavor>.load_model()
. See Using MLflow models with a scoring script for details.