Perform batch LLM inference using AI Functions
Important
This feature is in Public Preview.
This article describes how to perform batch inference using AI Functions.
You can run batch inference using task-specific AI functions or the general purpose function, ai_query
. The examples in this article focus on the flexibility of ai_query
and how to use it in batch inference pipelines and workflows.
There are two main ways to use ai_query
for batch inference:
- Batch inference using
ai_query
and Databricks-hosted foundation models: When you use this method, Databricks configures a model serving endpoint that scales automatically based on the workload. See which pre-provisioned LLMs are supported. - Batch inference using
ai_query
and a model serving endpoint you configure yourself: This method is required for batch inference workflows that use foundation models hosted outside of Databricks, fine-tuned foundation models, or traditional ML models. After deployment, the endpoint can be directly used withai_query
. See Batch inference using custom models or fine-tuned foundation models.
Requirements
- A workspace in a Foundation Model APIs supported region.
- Query permission on the Delta table in Unity Catalog that contains the data you want to use.
- Set the
pipelines.channel
in the table properties as ‘preview’ to useai_query()
. See Requirements for an example query.
Batch LLM inference using ai_query
and Databricks-hosted foundation models
When you use a Databricks-hosted and pre-provisioned foundation model for batch inference, Databricks configures a provisioned throughput endpoint on your behalf that scales automatically based on the workload.
To use this method for batch inference, specify the following in your request:
- The pre-provisioned LLM you want to use in
ai_query
. Select from supported pre-provisioned LLMs. - The Unity Catalog input table and output table.
- The model prompt and any model parameters.
SELECT text, ai_query(
"databricks-meta-llama-3-1-8b-instruct",
"Summarize the given text comprehensively, covering key points and main ideas concisely while retaining relevant details and examples. Ensure clarity and accuracy without unnecessary repetition or omissions: " || text
) AS summary
FROM uc_catalog.schema.table;
Deploy batch inference pipelines
This section shows how you can integrate AI Functions into other Databricks data and AI products to build complete batch inference pipelines. These pipelines can perform end to end workflows that include ingestion, preprocessing, inference, and post-processing. Pipelines can be authored in SQL or Python and deployed as:
- Scheduled workflows using Databricks workflows
- Streaming inference workflows using Structured Streaming
Automate batch inference jobs using Databricks workflows
Schedule batch inference jobs and automate AI pipelines.
SQL
SELECT
*,
ai_query('databricks-meta-llama-3-3-70b-instruct', request => concat("You are an opinion mining service. Given a piece of text, output an array of json results that extracts key user opinions, a classification, and a Positive, Negative, Neutral, or Mixed sentiment about that subject.
AVAILABLE CLASSIFICATIONS
Quality, Service, Design, Safety, Efficiency, Usability, Price
Examples below:
DOCUMENT
I got soup. It really did take only 20 minutes to make some pretty good soup. The noises it makes when it's blending are somewhat terrifying, but it gives a little beep to warn you before it does that. It made three or four large servings of soup. It's a single layer of steel, so the outside gets pretty hot. It can be hard to unplug the lid without knocking the blender against the side, which is not a nice sound. The soup was good and the recipes it comes with look delicious, but I'm not sure I'll use it often. 20 minutes of scary noises from the kitchen when I already need comfort food is not ideal for me. But if you aren't sensitive to loud sounds it does exactly what it says it does..
RESULT
[
{'Classification': 'Efficiency', 'Comment': 'only 20 minutes','Sentiment': 'Positive'},
{'Classification': 'Quality','Comment': 'pretty good soup','Sentiment': 'Positive'},
{'Classification': 'Usability', 'Comment': 'noises it makes when it's blending are somewhat terrifying', 'Sentiment': 'Negative'},
{'Classification': 'Safety','Comment': 'outside gets pretty hot','Sentiment': 'Negative'},
{'Classification': 'Design','Comment': 'Hard to unplug the lid without knocking the blender against the side, which is not a nice sound', 'Sentiment': 'Negative'}
]
DOCUMENT
", REVIEW_TEXT, '\n\nRESULT\n')) as result
FROM catalog.schema.product_reviews
LIMIT 10
Python
import json
from pyspark.sql.functions import expr
# Define the opinion mining prompt as a multi-line string.
opinion_prompt = """You are an opinion mining service. Given a piece of text, output an array of json results that extracts key user opinions, a classification, and a Positive, Negative, Neutral, or Mixed sentiment about that subject.
AVAILABLE CLASSIFICATIONS
Quality, Service, Design, Safety, Efficiency, Usability, Price
Examples below:
DOCUMENT
I got soup. It really did take only 20 minutes to make some pretty good soup.The noises it makes when it's blending are somewhat terrifying, but it gives a little beep to warn you before it does that.It made three or four large servings of soup.It's a single layer of steel, so the outside gets pretty hot. It can be hard to unplug the lid without knocking the blender against the side, which is not a nice sound.The soup was good and the recipes it comes with look delicious, but I'm not sure I'll use it often. 20 minutes of scary noises from the kitchen when I already need comfort food is not ideal for me. But if you aren't sensitive to loud sounds it does exactly what it says it does.
RESULT
[
{'Classification': 'Efficiency', 'Comment': 'only 20 minutes','Sentiment': 'Positive'},
{'Classification': 'Quality','Comment': 'pretty good soup','Sentiment': 'Positive'},
{'Classification': 'Usability', 'Comment': 'noises it makes when it's blending are somewhat terrifying', 'Sentiment': 'Negative'},
{'Classification': 'Safety','Comment': 'outside gets pretty hot','Sentiment': 'Negative'},
{'Classification': 'Design','Comment': 'Hard to unplug the lid without knocking the blender against the side, which is not a nice sound', 'Sentiment': 'Negative'}
]
DOCUMENT
"""
# Escape the prompt so it can be safely embedded in the SQL expression.
escaped_prompt = json.dumps(opinion_prompt)
# Read the source table and limit to 10 rows.
df = spark.table("catalog.schema.product_reviews").limit(10)
# Apply the LLM inference to each row, concatenating the prompt, the review text, and the tail string.
result_df = df.withColumn(
"result",
expr(f"ai_query('databricks-meta-llama-3-3-70b-instruct', request => concat({escaped_prompt}, REVIEW_TEXT, '\\n\\nRESULT\\n'))")
)
# Display the result DataFrame.
display(result_df)
AI Functions using Structured Streaming
Apply AI inference in near real-time or micro-batch scenarios using ai_query
and Structured Streaming.
Step 1. Read your static Delta table
Read your static Delta table as if it were a stream
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
# Spark processes all existing rows exactly once in the first micro-batch.
df = spark.table("enterprise.docs") # Replace with your table name containing enterprise documents
df.repartition(50).write.format("delta").mode("overwrite").saveAsTable("enterprise.docs")
df_stream = spark.readStream.format("delta").option("maxBytesPerTrigger", "50K").table("enterprise.docs")
# Define the prompt outside the SQL expression.
prompt = (
"You are provided with an enterprise document. Summarize the key points in a concise paragraph. "
"Do not include extra commentary or suggestions. Document: "
)
Step 2. Apply ai_query
Spark processes this only once for static data unless new rows arrive in the table.
df_transformed = df_stream.select(
"document_text",
F.expr(f"""
ai_query(
'llama_3_8b',
CONCAT('{prompt}', document_text)
)
""").alias("summary")
)
Step 3: Write the summarized output
Write the summarized output to another Delta table
# Time-based triggers apply, but only the first trigger processes all existing static data.
query = df_transformed.writeStream \
.format("delta") \
.option("checkpointLocation", "/tmp/checkpoints/_docs_summary") \
.outputMode("append") \
.toTable("enterprise.docs_summary")
query.awaitTermination()
Batch inference using custom models or fine-tuned foundation models
The notebook examples in this section demonstrate batch inference workloads that use custom or fine-tuned foundation models to process multiple inputs. The examples require an existing model serving endpoint that uses Foundation Model APIs provisioned throughput.
LLM batch inference using a custom foundation model
The following example notebook creates a provisioned throughput endpoint and runs batch LLM inference using Python and the Meta Llama 3.1 70B model. It also has guidance on benchmarking your batch inference workload and creating a provisioned throughput model serving endpoint.
LLM batch inference with a custom foundation model and a provisioned throughput endpoint notebook
LLM batch inference using an embeddings model
The following example notebook creates a provisioned throughput endpoint and runs batch LLM inference using Python and your choice of either the GTE Large (English) or BGE Large (English) embeddings model.
LLM batch inference embeddings with a provisioned throughput endpoint notebook
Batch inference and structured data extraction
The following example notebook demonstrates how to perform basic structured data extraction using ai_query
to transform raw, unstructured data into organized, useable information through automated extraction techniques. This notebook also shows how to leverage Mosaic AI Agent Evaluation to evaluate the accuracy using ground truth data.
Batch inference and structured data extraction notebook
Batch inference using BERT for named entity recognition
The following notebook shows a traditional ML model batch inference example using BERT.