Muokkaa

Jaa


Predicted outputs (preview)

Predicted outputs can improve model response latency for chat completions calls where minimal changes are needed to a larger body of text. If you're asking the model to provide a response where a large portion of the expected response is already known, predicted outputs can significantly reduce the latency of this request. This capability is particularly well-suited for coding scenarios, including autocomplete, error detection, and real-time editing, where speed and responsiveness are critical for developers and end-users. Rather than have the model regenerate all the text from scratch, you can indicate to the model that most of the response is already known by passing the known text to the prediction parameter.

Model support

  • gpt-4o-mini version: 2024-07-18
  • gpt-4o version: 2024-08-06
  • gpt-4o version: 2024-11-20

API support

  • 2025-01-01-preview

Unsupported features

Predicted outputs is currently text-only. These features can't be used in conjunction with the prediction parameter and predicted outputs.

  • Tools/Function calling
  • audio models/inputs and outputs
  • n values higher than 1
  • logprobs
  • presence_penalty values greater than 0
  • frequency_penalty values greater than 0
  • max_completion_tokens

Note

The predicted outputs feature is currently unavailable for models in the South East Asia region.

Getting started

To demonstrate the basics of predicted outputs, we'll start by asking a model to refactor the code from the common programming FizzBuzz problem to replace the instance of FizzBuzz with MSFTBuzz. We'll pass our example code to the model in two places. First as part of a user message in the messages array/list, and a second time as part of the content of the new prediction parameter.

You might need to upgrade your OpenAI client library to access the prediction parameter.

pip install openai --upgrade
import os
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

token_provider = get_bearer_token_provider(
    DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default"
)

client = AzureOpenAI(
  azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"), 
  azure_ad_token_provider=token_provider,
  api_version="2025-01-01-preview"
)

code = """
for number in range(1, 101):
    if number % 3 == 0 and number % 5 == 0:
        print("FizzBuzz")
    elif number % 3 == 0:
        print("Fizz")
    elif number % 5 == 0:
        print("Buzz")
    else:
        print(number)
"""

instructions = """
Replace string `FizzBuzz` with `MSFTBuzz`. Respond only 
with code, and with no markdown formatting.
"""


completion = client.chat.completions.create(
    model="gpt-4o-mini", # replace with your unique model deployment name
    messages=[
        {
            "role": "user",
            "content": instructions
        },
        {
            "role": "user",
            "content": code
        }
    ],
    prediction={
        "type": "content",
        "content": code
    }
)

print(completion.model_dump_json(indent=2))

Output

{
  "id": "chatcmpl-AskZk3P5QGmefqobDw4Ougo6jLxSP",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "for number in range(1, 101):\n    if number % 3 == 0 and number % 5 == 0:\n        print(\"MSFTBuzz\")\n    elif number % 3 == 0:\n        print(\"Fizz\")\n    elif number % 5 == 0:\n        print(\"Buzz\")\n    else:\n        print(number)",
        "refusal": null,
        "role": "assistant",
        "audio": null,
        "function_call": null,
        "tool_calls": null
      },
      "content_filter_results": {
        "hate": {
          "filtered": false,
          "severity": "safe"
        },
        "protected_material_code": {
          "filtered": false,
          "detected": false
        },
        "protected_material_text": {
          "filtered": false,
          "detected": false
        },
        "self_harm": {
          "filtered": false,
          "severity": "safe"
        },
        "sexual": {
          "filtered": false,
          "severity": "safe"
        },
        "violence": {
          "filtered": false,
          "severity": "safe"
        }
      }
    }
  ],
  "created": 1737612112,
  "model": "gpt-4o-mini-2024-07-18",
  "object": "chat.completion",
  "service_tier": null,
  "system_fingerprint": "fp_5154047bf2",
  "usage": {
    "completion_tokens": 77,
    "prompt_tokens": 124,
    "total_tokens": 201,
    "completion_tokens_details": {
      "accepted_prediction_tokens": 6,
      "audio_tokens": 0,
      "reasoning_tokens": 0,
      "rejected_prediction_tokens": 4
    },
    "prompt_tokens_details": {
      "audio_tokens": 0,
      "cached_tokens": 0
    }
  },
  "prompt_filter_results": [
    {
      "prompt_index": 0,
      "content_filter_results": {
        "hate": {
          "filtered": false,
          "severity": "safe"
        },
        "jailbreak": {
          "filtered": false,
          "detected": false
        },
        "self_harm": {
          "filtered": false,
          "severity": "safe"
        },
        "sexual": {
          "filtered": false,
          "severity": "safe"
        },
        "violence": {
          "filtered": false,
          "severity": "safe"
        }
      }
    }
  ]
}

Notice in the output the new response parameters for accepted_prediction_tokens and rejected_prediction_tokens:

  "usage": {
    "completion_tokens": 77,
    "prompt_tokens": 124,
    "total_tokens": 201,
    "completion_tokens_details": {
      "accepted_prediction_tokens": 6,
      "audio_tokens": 0,
      "reasoning_tokens": 0,
      "rejected_prediction_tokens": 4
    }

The accepted_prediction_tokens help reduce model response latency, but any rejected_prediction_tokens have the same cost implication as additional output tokens generated by the model. For this reason, while predicted outputs can improve model response times, it can result in greater costs. You'll need to evaluate and balance the increased model performance against the potential increases in cost.

It's also important to understand, that using predictive outputs doesn't guarantee a reduction in latency. A large request with a greater percentage of rejected prediction tokens than accepted prediction tokens could result in an increase in model response latency, rather than a decrease.

Note

Unlike prompt caching which only works when a set minimum number of initial tokens at the beginning of a request are identical, predicted outputs isn't constrained by token location. Even if your response text contains new output that will be returned prior to the predicted output, accepted_prediction_tokens can still occur.

Streaming

Predicted outputs performance boost is often most obvious if you're returning your responses with streaming enabled.

You might need to upgrade your OpenAI client library to access the prediction parameter.

pip install openai --upgrade
import os
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

token_provider = get_bearer_token_provider(
    DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default"
)

client = AzureOpenAI(
  azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"), 
  azure_ad_token_provider=token_provider,
  api_version="2025-01-01-preview"
)

code = """
for number in range(1, 101):
    if number % 3 == 0 and number % 5 == 0:
        print("FizzBuzz")
    elif number % 3 == 0:
        print("Fizz")
    elif number % 5 == 0:
        print("Buzz")
    else:
        print(number)
"""

instructions = """
Replace string `FizzBuzz` with `MSFTBuzz`. Respond only 
with code, and with no markdown formatting.
"""


completion = client.chat.completions.create(
    model="gpt-4o-mini", # replace with your unique model deployment name
    messages=[
        {
            "role": "user",
            "content": instructions
        },
        {
            "role": "user",
            "content": code
        }
    ],
    prediction={
        "type": "content",
        "content": code
    },
    stream=True
)

for chunk in completion:
    if chunk.choices and chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end='',)