Query serving endpoints for custom models
In this article, learn how to format scoring requests for your served model, and how to send those requests to the model serving endpoint. The guidance is relevant to serving custom models, which Databricks defines as traditional ML models or customized Python models packaged in the MLflow format. They can be registered either in Unity Catalog or in the workspace model registry. Examples include scikit-learn, XGBoost, PyTorch, and Hugging Face transformer models. See Model serving with Azure Databricks for more information about this functionality and supported model categories.
For query requests for generative AI and LLM workloads, see Query generative AI models.
Requirements
- A model serving endpoint.
- For the MLflow Deployment SDK, MLflow 2.9 or above is required.
- Scoring request in an accepted format.
- To send a scoring request through the REST API or MLflow Deployment SDK, you must have a Databricks API token.
Important
As a security best practice for production scenarios, Databricks recommends that you use machine-to-machine OAuth tokens for authentication during production.
For testing and development, Databricks recommends using a personal access token belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.
Querying methods and examples
Mosaic AI Model Serving provides the following options for sending scoring requests to served models:
Method | Details |
---|---|
Serving UI | Select Query endpoint from the Serving endpoint page in your Databricks workspace. Insert JSON format model input data and click Send Request. If the model has an input example logged, use Show Example to load it. |
REST API | Call and query the model using the REST API. See POST /serving-endpoints/{name}/invocations for details. For scoring requests to endpoints serving multiple models, see Query individual models behind an endpoint. |
MLflow Deployments SDK | Use MLflow Deployments SDK’s predict() function to query the model. |
SQL function | Invoke model inference directly from SQL using the ai_query SQL function. See Query a served model with ai_query. |
Pandas DataFrame scoring example
The following example assumes a MODEL_VERSION_URI
like https://<databricks-instance>/model/iris-classifier/Production/invocations
, where <databricks-instance>
is the name of your Databricks instance, and a Databricks REST API token called DATABRICKS_API_TOKEN
.
See Supported scoring formats.
REST API
Score a model accepting dataframe split input format.
curl -X POST -u token:$DATABRICKS_API_TOKEN $MODEL_VERSION_URI \
-H 'Content-Type: application/json' \
-d '{"dataframe_split": [{
"columns": ["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"],
"data": [[5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2]]
}]
}'
Score a model accepting tensor inputs. Tensor inputs should be formatted as described in TensorFlow Serving’s API documentation.
curl -X POST -u token:$DATABRICKS_API_TOKEN $MODEL_VERSION_URI \
-H 'Content-Type: application/json' \
-d '{"inputs": [[5.1, 3.5, 1.4, 0.2]]}'
MLflow Deployments SDK
Important
The following example uses the predict()
API from the MLflow Deployments SDK.
import mlflow.deployments
export DATABRICKS_HOST="https://<workspace_host>.databricks.com"
export DATABRICKS_TOKEN="dapi-your-databricks-token"
client = mlflow.deployments.get_deploy_client("databricks")
response = client.predict(
endpoint="test-model-endpoint",
inputs={"dataframe_split": {
"index": [0, 1],
"columns": ["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"],
"data": [[5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2]]
}
}
)
SQL
Important
The following example uses the built-in SQL function, ai_query. This function is Public Preview and the definition might change. See Query a served model with ai_query.
The following example queries the model behind the sentiment-analysis
endpoint with the text
dataset and specifies the return type of the request.
SELECT text, ai_query(
"sentiment-analysis",
text,
returnType => "STRUCT<label:STRING, score:DOUBLE>"
) AS predict
FROM
catalog.schema.customer_reviews
PowerBI
You can score a dataset in Power BI Desktop using the following steps:
Open dataset you want to score.
Go to Transform Data.
Right-click in the left panel and select Create New Query.
Go to View > Advanced Editor.
Replace the query body with the code snippet below, after filling in an appropriate
DATABRICKS_API_TOKEN
andMODEL_VERSION_URI
.(dataset as table ) as table => let call_predict = (dataset as table ) as list => let apiToken = DATABRICKS_API_TOKEN, modelUri = MODEL_VERSION_URI, responseList = Json.Document(Web.Contents(modelUri, [ Headers = [ #"Content-Type" = "application/json", #"Authorization" = Text.Format("Bearer #{0}", {apiToken}) ], Content = {"dataframe_records": Json.FromValue(dataset)} ] )) in responseList, predictionList = List.Combine(List.Transform(Table.Split(dataset, 256), (x) => call_predict(x))), predictionsTable = Table.FromList(predictionList, (x) => {x}, {"Prediction"}), datasetWithPrediction = Table.Join( Table.AddIndexColumn(predictionsTable, "index"), "index", Table.AddIndexColumn(dataset, "index"), "index") in datasetWithPrediction
Name the query with your desired model name.
Open the advanced query editor for your dataset and apply the model function.
Tensor input example
The following example scores a model accepting tensor inputs. Tensor inputs should be formatted as described in TensorFlow Serving’s API docs. This example assumes a MODEL_VERSION_URI
like https://<databricks-instance>/model/iris-classifier/Production/invocations
, where <databricks-instance>
is the name of your Databricks instance, and a Databricks REST API token called DATABRICKS_API_TOKEN
.
curl -X POST -u token:$DATABRICKS_API_TOKEN $MODEL_VERSION_URI \
-H 'Content-Type: application/json' \
-d '{"inputs": [[5.1, 3.5, 1.4, 0.2]]}'
Supported scoring formats
For custom models, Model Serving supports scoring requests in Pandas DataFrame or Tensor input.
Pandas DataFrame
Requests should be sent by constructing a JSON-serialized Pandas DataFrame with one of the supported keys and a JSON object corresponding to the input format.
(Recommended)
dataframe_split
format is a JSON-serialized Pandas DataFrame in thesplit
orientation.{ "dataframe_split": { "index": [0, 1], "columns": ["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"], "data": [[5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2]] } }
dataframe_records
is JSON-serialized Pandas DataFrame in therecords
orientation.Note
This format does not guarantee the preservation of column ordering, and the
split
format is preferred over therecords
format.{ "dataframe_records": [ { "sepal length (cm)": 5.1, "sepal width (cm)": 3.5, "petal length (cm)": 1.4, "petal width (cm)": 0.2 }, { "sepal length (cm)": 4.9, "sepal width (cm)": 3, "petal length (cm)": 1.4, "petal width (cm)": 0.2 }, { "sepal length (cm)": 4.7, "sepal width (cm)": 3.2, "petal length (cm)": 1.3, "petal width (cm)": 0.2 } ] }
The response from the endpoint contains the output from your model, serialized with JSON, wrapped in a predictions
key.
{
"predictions": [0,1,1,1,0]
}
Tensor input
When your model expects tensors, like a TensorFlow or Pytorch model, there are two supported format options for sending requests: instances
and inputs
.
If you have multiple named tensors per row, then you have to have one of each tensor for every row.
instances
is a tensors-based format that accepts tensors in row format. Use this format if all the input tensors have the same 0-th dimension. Conceptually, each tensor in the instances list could be joined with the other tensors of the same name in the rest of the list to construct the full input tensor for the model, which would only be possible if all of the tensors have the same 0-th dimension.{"instances": [ 1, 2, 3 ]}
The following example shows how to specify multiple named tensors.
{ "instances": [ { "t1": "a", "t2": [1, 2, 3, 4, 5], "t3": [[1, 2], [3, 4], [5, 6]] }, { "t1": "b", "t2": [6, 7, 8, 9, 10], "t3": [[7, 8], [9, 10], [11, 12]] } ] }
inputs
send queries with tensors in columnar format. This request is different because there are actually a different number of tensor instances oft2
(3) thant1
andt3
, so it is not possible to represent this input in theinstances
format.{ "inputs": { "t1": ["a", "b"], "t2": [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]], "t3": [[[1, 2], [3, 4], [5, 6]], [[7, 8], [9, 10], [11, 12]]] } }
The response from the endpoint is in the following format.
{
"predictions": [0,1,1,1,0]
}
Notebook example
See the following notebook for an example of how to test your Model Serving endpoint with a Python model: