你当前正在访问 Microsoft Azure Global Edition 技术文档网站。如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站，请访问 https://docs.azure.cn。

实时 API（预览版）参考

项目
12/18/2024

注意

此功能目前处于公开预览状态。此预览版未提供服务级别协议，不建议将其用于生产工作负载。某些功能可能不受支持或者受限。有关详细信息，请参阅 Microsoft Azure 预览版补充使用条款。

实时 API 是基于 WebSocket 的 API，可让你与 Azure OpenAI 服务实时交互。

实时 API（通过 /realtime）构建在 WebSocket API 之上，以方便最终用户和模型之间进行完全异步的流式通信。设备详细信息（如捕获和呈现音频数据）不在实时 API 的范围内。它应在一个用于管理与最终用户的连接和模型终结点连接的受信任中间服务的环境中使用。请勿直接从不受信任的最终用户设备使用它。

提示

若要开始使用实时 API，请参阅快速入门和操作指南。

连接

实时 API 需要受支持区域中的现有 Azure OpenAI 资源终结点。该 API 是通过与 Azure OpenAI 资源的 /realtime 终结点之间的安全 WebSocket 连接进行访问的。

可以通过连接以下内容来构造完整的请求 URI：

安全 WebSocket (wss://) 协议
你的 Azure OpenAI 资源终结点主机名，例如 my-aoai-resource.openai.azure.com
openai/realtime API 路径
受支持 API 版本的 api-version 查询字符串参数，例如 2024-10-01-preview
带有 gpt-4o-realtime-preview 模型部署的名称的 deployment 查询字符串参数

以下示例是一个结构良好的 /realtime 请求 URI：

wss://my-eastus2-openai-resource.openai.azure.com/openai/realtime?api-version=2024-10-01-preview&deployment=gpt-4o-realtime-preview-1001

身份验证

若要进行身份验证：

Microsoft Entra（推荐）：对于启用了托管标识的 Azure OpenAI 服务资源，通过 /realtime API 使用基于令牌的身份验证。将 Bearer 令牌与 Authorization 标头配合使用来应用一个检索到的身份验证令牌。
API 密钥：可通过以下两种方式之一提供 api-key：
- 对预握手连接使用 api-key 连接标头。此选项在浏览器环境中不可用。
- 对请求 URI 使用 api-key 查询字符串参数。使用 https/wss 时，查询字符串参数是加密的。

客户端事件

可以从客户端发送到服务器的客户端事件有九种：

事件	说明
RealtimeClientEventConversationItemCreate	将项添加到对话时发送此客户端事件。
RealtimeClientEventConversationItemDelete	当你想要从对话历史记录中删除任何项时发送此客户端事件。
RealtimeClientEventConversationItemTruncate	当你想要截断先前助手消息的音频时发送此客户端事件。
RealtimeClientEventInputAudioBufferAppend	发送此客户端事件可将音频字节追加到输入音频缓冲区。
RealtimeClientEventInputAudioBufferClear	发送此客户端事件可清除缓冲区中的音频字节。
RealtimeClientEventInputAudioBufferCommit	发送此客户端事件可将音频字节提交到用户消息。
RealtimeClientEventResponseCancel	发送此客户端事件可取消正在进行的响应。
RealtimeClientEventResponseCreate	发送此客户端事件可触发响应生成。
RealtimeClientEventSessionUpdate	发送此客户端事件可更新会话的默认配置。

RealtimeClientEventConversationItemCreate

客户端 conversation.item.create 事件用于向对话的上下文添加新项，包括消息、函数调用和函数调用响应。此事件可用于填充对话历史记录以及在中途添加新项。目前，此事件无法填充助手音频消息。

如果成功，服务器将使用 conversation.item.created 事件进行响应，否则将发送 error 事件。

事件结构

{
  "type": "conversation.item.create",
  "previous_item_id": "<previous_item_id>"
}

属性

字段	类型	说明
type	string	事件类型必须是 `conversation.item.create`。
previous_item_id	string	前一项（新项将插入到其后）的 ID。如果未设置，新项将追加到对话的末尾。如果已设置，则允许在对话中插入项。如果找不到该 ID，则会返回错误，并且不会添加此项。
item	RealtimeConversationRequestItem	要添加到对话中的项。

RealtimeClientEventConversationItemDelete

客户端 conversation.item.delete 事件用于从对话历史记录中删除项。

服务器使用 conversation.item.deleted 事件进行响应，除非对话历史记录中不存在该项（在这种情况下，服务器将以一个错误进行响应）。

事件结构

{
  "type": "conversation.item.delete",
  "item_id": "<item_id>"
}

属性

字段	类型	说明
type	string	事件类型必须是 `conversation.item.delete`。
item_id	string	要删除的项的 ID。

RealtimeClientEventConversationItemTruncate

客户端 conversation.item.truncate 事件用于截断先前助手消息的音频。服务器生成音频的速度比实时速度快，因此当用户进行中断以截断已发送到客户端但尚未播放的音频时，此事件非常有用。服务器对音频的理解与客户端的播放是同步的。

截断音频会删除服务器端文本脚本，以确保上下文中不存在用户不知道的文本。

如果客户端事件成功，服务器将使用 conversation.item.truncated 事件进行响应。

事件结构

{
  "type": "conversation.item.truncate",
  "item_id": "<item_id>",
  "content_index": 0,
  "audio_end_ms": 0
}

属性

字段	类型	说明
type	string	事件类型必须是 `conversation.item.truncate`。
item_id	string	要截断的助手消息项的 ID。只有助手消息项可以截断。
content_index	integer	要截断的内容部分的索引。将此属性设置为“0”。
audio_end_ms	integer	音频被截断的非独占持续时间（以毫秒为单位）。如果 audio_end_ms 大于实际音频持续时间，服务器将用一条错误进行响应。

RealtimeClientEventInputAudioBufferAppend

客户端 input_audio_buffer.append 事件用于将音频字节追加到输入音频缓冲区。音频缓冲区是你可以写入并稍后提交的临时存储。

在服务器 VAD（语音活动检测）模式下，音频缓冲区用于检测语音，服务器决定何时提交。禁用服务器 VAD 后，客户端可以选择每个事件中放置多少音频量，最多放置 15 MiB。例如，从客户端流式处理较小的数据块可以让 VAD 响应更迅速。

与进行其他客户端事件不同，服务器不会向客户端 input_audio_buffer.append 事件发送确认响应。

事件结构

{
  "type": "input_audio_buffer.append",
  "audio": "<audio>"
}

属性

字段	类型	说明
type	string	事件类型必须是 `input_audio_buffer.append`。
audio	string	Base64 编码的音频字节。此值必须采用会话配置中 `input_audio_format` 字段指定的格式。

RealtimeClientEventInputAudioBufferClear

客户端 input_audio_buffer.clear 事件用于清除缓冲区中的音频字节。

服务器使用 input_audio_buffer.cleared 事件进行响应。

事件结构

{
  "type": "input_audio_buffer.clear"
}

属性

字段	类型	说明
type	string	事件类型必须是 `input_audio_buffer.clear`。

RealtimeClientEventInputAudioBufferCommit

客户端 input_audio_buffer.commit 事件用于提交用户输入音频缓冲区，从而在对话中创建新的用户消息项。如果 input_audio_transcription 为会话配置了音频，则系统会转录音频。

处于服务器 VAD 模式时，客户端不需要发送此事件，服务器会自动提交音频缓冲区。如果没有服务器 VAD，客户端必须提交音频缓冲区才能创建用户消息项。如果输入音频缓冲区为空，则此客户端事件将生成错误。

提交输入音频缓冲区不会从模型创建响应。

服务器使用 input_audio_buffer.committed 事件进行响应。

事件结构

{
  "type": "input_audio_buffer.commit"
}

属性

字段	类型	说明
type	string	事件类型必须是 `input_audio_buffer.commit`。

RealtimeClientEventResponseCancel

客户端 response.cancel 事件用于取消正在进行的响应。

服务器将使用 response.cancelled 事件进行响应；如果没有任何响应可供取消，服务器将以一个错误进行响应。

事件结构

{
  "type": "response.cancel"
}

属性

字段	类型	说明
type	string	事件类型必须是 `response.cancel`。

RealtimeClientEventResponseCreate

客户端 response.create 事件用于指示服务器通过模型推理创建响应。在服务器 VAD 模式下配置会话时，服务器会自动创建响应。

响应至少包含一个 item，可以包含两个（在这种情况下，第二个是函数调用）。这些项将追加到对话历史记录中。

服务器使用 response.created 事件、一个或多个项和内容事件（如 conversation.item.created 和 response.content_part.added）进行响应，最后用一个 response.done 事件指示响应已完成。

注意

客户端 response.create 事件包括推理配置（如 instructions 和 temperature）。这些字段仅可替代此响应的会话配置。

事件结构

{
  "type": "response.create"
}

属性

字段	类型	说明
type	string	事件类型必须是 `response.create`。
response	RealtimeResponseOptions	响应选项。

RealtimeClientEventSessionUpdate

客户端 session.update 事件用于更新会话的默认配置。客户端可以随时发送此事件来更新会话配置，并且，除了语音之外，任何字段都可以随时更新。

仅更新存在的字段。若要清除字段（例如 instructions），请传递一个空字符串。

服务器使用一个包含完整有效配置的 session.updated 事件进行响应。

事件结构

{
  "type": "session.update"
}

属性

字段	类型	说明
type	string	事件类型必须是 `session.update`。
会话	RealtimeRequestSession	会话配置。

服务器事件

可以从服务器接收的服务器事件有 28 种：

事件	说明
RealtimeServerEventConversationCreated	创建对话时的服务器事件。在创建会话后立即发出。
RealtimeServerEventConversationItemCreated	创建对话项时的服务器事件。
RealtimeServerEventConversationItemDeleted	删除对话中的项时的服务器事件。
RealtimeServerEventConversationItemInputAudioTranscriptionCompleted	启用了输入音频听录并且听录成功时的服务器事件。
RealtimeServerEventConversationItemInputAudioTranscriptionFailed	配置了输入音频听录并且用户消息的听录请求失败时的服务器事件。
RealtimeServerEventConversationItemTruncated	客户端截断先前的助手音频消息项时的服务器事件。
RealtimeServerEventError	发生错误时的服务器事件。
RealtimeServerEventInputAudioBufferCleared	客户端清除输入音频缓冲区时的服务器事件。
RealtimeServerEventInputAudioBufferCommitted	当输入音频缓冲区由客户端提交或在服务器 VAD 模式下自动提交时的服务器事件。
RealtimeServerEventInputAudioBufferSpeechStarted	检测到语音时服务器轮次检测模式下的服务器事件。
RealtimeServerEventInputAudioBufferSpeechStopped	语音停止时服务器轮次检测模式下的服务器事件。
RealtimeServerEventRateLimitsUpdated	在每个“response.done”事件后发出，以指示已更新的速率限制。
RealtimeServerEventResponseAudioDelta	更新模型生成的音频时的服务器事件。
RealtimeServerEventResponseAudioDone	完成模型生成的音频时的服务器事件。当响应被中断、不完整或取消时也会发出。
RealtimeServerEventResponseAudioTranscriptDelta	更新模型生成的音频输出听录时的服务器事件。
RealtimeServerEventResponseAudioTranscriptDone	模型生成的音频输出听录完成流式处理时的服务器事件。当响应被中断、不完整或取消时也会发出。
RealtimeServerEventResponseContentPartAdded	在响应生成期间将新的内容部分添加到助手消息项时的服务器事件。
RealtimeServerEventResponseContentPartDone	当内容部分在助手消息项中完成流式处理时的服务器事件。当响应被中断、不完整或取消时也会发出。
RealtimeServerEventResponseCreated	创建新的响应时的服务器事件。响应创建的第一个事件，其中响应处于初始状态“in_progress”。
RealtimeServerEventResponseDone	响应完成流式处理时的服务器事件。始终发出，无论最终状态如何。
RealtimeServerEventResponseFunctionCallArgumentsDelta	更新模型生成的函数调用参数时的服务器事件。
RealtimeServerEventResponseFunctionCallArgumentsDone	模型生成的函数调用参数完成流式处理时的服务器事件。当响应被中断、不完整或取消时也会发出。
RealtimeServerEventResponseOutputItemAdded	新的输出项添加到响应时的服务器事件。
RealtimeServerEventResponseOutputItemDone	输出项完成流式处理时的服务器事件。当响应被中断、不完整或取消时也会发出。
RealtimeServerEventResponseTextDelta	更新模型生成的文本时的服务器事件。
RealtimeServerEventResponseTextDone	完成模型生成的文本时的服务器事件。当响应被中断、不完整或取消时也会发出。
RealtimeServerEventSessionCreated	创建会话时的服务器事件。
RealtimeServerEventSessionUpdated	更新会话时的服务器事件。

RealtimeServerEventConversationCreated

在创建会话后会立即返回服务器 conversation.created 事件。每个会话创建一个对话。

事件结构

{
  "type": "conversation.created",
  "conversation": {
    "id": "<id>",
    "object": "<object>"
  }
}

属性

字段	类型	说明
type	string	事件类型必须是 `conversation.created`。
聊天	object	对话资源。

对话属性

字段	类型	描述
id	string	对话的唯一 ID。
object	string	对象类型必须为 `realtime.conversation`。

RealtimeServerEventConversationItemCreated

创建对话项时，将返回服务器 conversation.item.created 事件。有几种情况会产生此事件：

服务器正在生成响应，如果成功，则生成一个或两个项，其类型为 message（角色 assistant）或 function_call。
输入音频缓冲区由客户端或服务器（在 server_vad 模式下）提交。服务器获取输入音频缓冲区的内容并将其添加到新的用户消息项中。
客户端发送了 conversation.item.create 事件，以向对话添加新项。

事件结构

{
  "type": "conversation.item.created",
  "previous_item_id": "<previous_item_id>"
}

属性

字段	类型	说明
type	string	事件类型必须是 `conversation.item.created`。
previous_item_id	string	对话上下文中前一项的 ID 使客户端可以了解对话的顺序。
item	RealtimeConversationResponseItem	已创建的项。

RealtimeServerEventConversationItemDeleted

客户端使用 conversation.item.delete 事件删除对话中的项时，系统会返回服务器 conversation.item.deleted 事件。此事件用于将服务器对对话历史记录的理解与客户端的视图进行同步。

事件结构

{
  "type": "conversation.item.deleted",
  "item_id": "<item_id>"
}

属性

字段	类型	说明
type	string	事件类型必须是 `conversation.item.deleted`。
item_id	string	已删除的项的 ID。

RealtimeServerEventConversationItemInputAudioTranscriptionCompleted

服务器 conversation.item.input_audio_transcription.completed 事件是写入音频缓冲区的语音听录的结果。

当输入音频缓冲区由客户端或服务器（在 server_vad 模式下）提交时，听录开始。听录与响应创建异步运行，因此该事件可以发生在响应事件之前或之后。

实时 API 模型本身接受音频，因此输入听录是在单独的语音识别模型（目前始终为 whisper-1）上运行的单独进程。因此，脚本可能与模型的解释有所不同，应将其视为粗略指南。

事件结构

{
  "type": "conversation.item.input_audio_transcription.completed",
  "item_id": "<item_id>",
  "content_index": 0,
  "transcript": "<transcript>"
}

属性

字段	类型	说明
type	string	事件类型必须是 `conversation.item.input_audio_transcription.completed`。
item_id	string	包含音频的用户消息项的 ID。
content_index	integer	包含音频的内容部分的索引。
脚本	string	听录的文本。

RealtimeServerEventConversationItemInputAudioTranscriptionFailed

配置了输入音频听录并且用户消息的听录请求失败时，系统会返回服务器 conversation.item.input_audio_transcription.failed 事件。此事件是与其他 error 事件分开的，以便客户端能够识别相关项。

事件结构

{
  "type": "conversation.item.input_audio_transcription.failed",
  "item_id": "<item_id>",
  "content_index": 0,
  "error": {
    "code": "<code>",
    "message": "<message>",
    "param": "<param>"
  }
}

属性

字段	类型	说明
type	string	事件类型必须是 `conversation.item.input_audio_transcription.failed`。
item_id	string	用户消息项的 ID。
content_index	integer	包含音频的内容部分的索引。
error	object	听录错误的详细信息。请参阅下一个表中的嵌套属性。

错误属性

字段	类型	说明
type	string	错误的类型。
code	string	错误代码（如果有）。
message	string	用户可读的错误消息。
param	string	与错误相关的参数（如果有）。

RealtimeServerEventConversationItemTruncated

客户端使用 conversation.item.truncate 事件截断先前的助手音频消息项时，系统会返回服务器 conversation.item.truncated 事件。此事件用于将服务器对对话历史记录的理解与客户端的播放进行同步。

此事件会截断音频并删除服务器端文本脚本，以确保上下文中不存在用户不知道的文本。

事件结构

{
  "type": "conversation.item.truncated",
  "item_id": "<item_id>",
  "content_index": 0,
  "audio_end_ms": 0
}

属性

字段	类型	说明
type	string	事件类型必须是 `conversation.item.truncated`。
item_id	string	已截断的助手消息项的 ID。
content_index	integer	已截断的内容部分的索引。
audio_end_ms	integer	音频被截断的持续时间（以毫秒为单位）。

RealtimeServerEventError

发生错误时，系统会返回服务器 error 事件（可能是客户端问题，也可能是服务器问题）。大多数错误都是可恢复的，并且会话将保持打开状态。

事件结构

{
  "type": "error",
  "error": {
    "code": "<code>",
    "message": "<message>",
    "param": "<param>",
    "event_id": "<event_id>"
  }
}

属性

字段	类型	说明
type	string	事件类型必须是 `error`。
error	object	错误的详细信息。请参阅下一个表中的嵌套属性。

错误属性

字段	类型	说明
type	string	错误的类型。例如，“invalid_request_error”和“server_error”是错误类型。
code	string	错误代码（如果有）。
message	string	用户可读的错误消息。
param	string	与错误相关的参数（如果有）。
event_id	string	导致错误的客户端事件的 ID（如果适用）。

RealtimeServerEventInputAudioBufferCleared

客户端使用 input_audio_buffer.clear 事件清除输入音频缓冲区时，系统会返回服务器 input_audio_buffer.cleared 事件。

事件结构

{
  "type": "input_audio_buffer.cleared"
}

属性

字段	类型	说明
type	string	事件类型必须是 `input_audio_buffer.cleared`。

RealtimeServerEventInputAudioBufferCommitted

当输入音频缓冲区由客户端提交或在服务器 VAD 模式下自动提交时，系统会返回服务器 input_audio_buffer.committed 事件。 item_id 属性是创建的用户消息项的 ID。因此，conversation.item.created 事件也会发送到客户端。

事件结构

{
  "type": "input_audio_buffer.committed",
  "previous_item_id": "<previous_item_id>",
  "item_id": "<item_id>"
}

属性

字段	类型	说明
type	string	事件类型必须是 `input_audio_buffer.committed`。
previous_item_id	string	前一项（新项将插入到其后）的 ID。
item_id	string	创建的用户消息项的 ID。

RealtimeServerEventInputAudioBufferSpeechStarted

在音频缓冲区中检测到语音时，系统会以 server_vad 模式返回服务器 input_audio_buffer.speech_started 事件。每当音频添加到缓冲区时，此事件都可能发生（除非已检测到语音）。

注意

客户端可能希望使用此事件来中断音频播放或向用户提供视觉反馈。

当语音停止时，客户端应会期望收到 input_audio_buffer.speech_stopped 事件。 item_id 属性是语音停止时创建的用户消息项的 ID。除非客户端在 VAD 激活期间手动提交音频缓冲区，否则 item_id 也包含在 input_audio_buffer.speech_stopped 事件中。

事件结构

{
  "type": "input_audio_buffer.speech_started",
  "audio_start_ms": 0,
  "item_id": "<item_id>"
}

属性

字段	类型	说明
type	string	事件类型必须是 `input_audio_buffer.speech_started`。
audio_start_ms	integer	首次检测到语音时，从在会话期间写入到缓冲区的全部音频的开头算起已经过的毫秒数。此属性对应于发送到模型的音频的开头，因此包括会话中配置的 `prefix_padding_ms`。
item_id	string	语音停止时创建的用户消息项的 ID。

RealtimeServerEventInputAudioBufferSpeechStopped

server_vad 模式下服务器在音频缓冲区中检测到语音结束时，系统会返回服务器 input_audio_buffer.speech_stopped 事件。

服务器还发送一个 conversation.item.created 事件，其中包含从音频缓冲区创建的用户消息项。

事件结构

{
  "type": "input_audio_buffer.speech_stopped",
  "audio_end_ms": 0,
  "item_id": "<item_id>"
}

属性

字段	类型	说明
type	string	事件类型必须是 `input_audio_buffer.speech_stopped`。
audio_end_ms	integer	语音停止时，自会话开始起已经过的毫秒数。此属性对应于发送到模型的音频的末尾，因此包括会话中配置的 `min_silence_duration_ms`。
item_id	string	创建的用户消息项的 ID。

RealtimeServerEventRateLimitsUpdated

响应开始时发出的服务器 rate_limits.updated 事件，用于指示已更新的速率限制。

创建响应时，某些词元 (Token) 将保留用于输出词元。此处显示的速率限制反映这种保留，保留随后会在响应完成后相应地得到调整。

事件结构

{
  "type": "rate_limits.updated",
  "rate_limits": [
    {
      "name": "<name>",
      "limit": 0,
      "remaining": 0,
      "reset_seconds": 0
    }
  ]
}

属性

字段	类型	说明
type	string	事件类型必须是 `rate_limits.updated`。
rate_limits	RealtimeServerEventRateLimitsUpdatedRateLimitsItem 数组	速率限制信息列表。

RealtimeServerEventResponseAudioDelta

更新模型生成的音频时，系统将返回服务器 response.audio.delta 事件。

事件结构

{
  "type": "response.audio.delta",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "content_index": 0,
  "delta": "<delta>"
}

属性

字段	类型	说明
type	string	事件类型必须是 `response.audio.delta`。
response_id	string	响应的 ID。
item_id	string	项的 ID。
output_index	integer	响应中的输出项的索引。
content_index	integer	项内容数组中的内容部分的索引。
delta	string	Base64 编码的音频数据增量。

RealtimeServerEventResponseAudioDone

模型生成完音频后，系统将返回服务器 response.audio.done 事件。

当响应中断、不完整或取消时，系统也会返回此事件。

事件结构

{
  "type": "response.audio.done",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "content_index": 0
}

属性

字段	类型	说明
type	string	事件类型必须是 `response.audio.done`。
response_id	string	响应的 ID。
item_id	string	项的 ID。
output_index	integer	响应中的输出项的索引。
content_index	integer	项内容数组中的内容部分的索引。

RealtimeServerEventResponseAudioTranscriptDelta

更新模型生成的音频输出听录时，系统会返回服务器 response.audio_transcript.delta 事件。

事件结构

{
  "type": "response.audio_transcript.delta",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "content_index": 0,
  "delta": "<delta>"
}

属性

字段	类型	说明
type	string	事件类型必须是 `response.audio_transcript.delta`。
response_id	string	响应的 ID。
item_id	string	项的 ID。
output_index	integer	响应中的输出项的索引。
content_index	integer	项内容数组中的内容部分的索引。
delta	string	脚本增量。

RealtimeServerEventResponseAudioTranscriptDone

模型生成的音频输出听录完成流式处理时，系统会返回服务器 response.audio_transcript.done 事件。

当响应中断、不完整或取消时，系统也会返回此事件。

事件结构

{
  "type": "response.audio_transcript.done",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "content_index": 0,
  "transcript": "<transcript>"
}

属性

字段	类型	说明
type	string	事件类型必须是 `response.audio_transcript.done`。
response_id	string	响应的 ID。
item_id	string	项的 ID。
output_index	integer	响应中的输出项的索引。
content_index	integer	项内容数组中的内容部分的索引。
脚本	string	音频的最终脚本。

RealtimeServerEventResponseContentPartAdded

在响应生成期间将新的内容部分添加到助手消息项时，系统会返回服务器 response.content_part.added 事件。

事件结构

{
  "type": "response.content_part.added",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "content_index": 0
}

属性

字段	类型	说明
type	string	事件类型必须是 `response.content_part.added`。
response_id	string	响应的 ID。
item_id	string	内容部分已添加到的项的 ID。
output_index	integer	响应中的输出项的索引。
content_index	integer	项内容数组中的内容部分的索引。
part	RealtimeContentPart	已添加的内容部分。

部分属性

字段	类型	说明
type	RealtimeContentPartType

RealtimeServerEventResponseContentPartDone

当内容部分在助手消息项中完成流式处理时，系统会返回服务器 response.content_part.done 事件。

当响应中断、不完整或取消时，系统也会返回此事件。

事件结构

{
  "type": "response.content_part.done",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "content_index": 0
}

属性

字段	类型	说明
type	string	事件类型必须是 `response.content_part.done`。
response_id	string	响应的 ID。
item_id	string	项的 ID。
output_index	integer	响应中的输出项的索引。
content_index	integer	项内容数组中的内容部分的索引。
part	RealtimeContentPart	完成的内容部分。

部分属性

字段	类型	说明
type	RealtimeContentPartType

RealtimeServerEventResponseCreated

创建新响应时，系统会返回服务器 response.created 事件。这是响应创建的第一个事件，其中响应处于初始状态 in_progress。

事件结构

{
  "type": "response.created"
}

属性

字段	类型	说明
type	string	事件类型必须是 `response.created`。
response	RealtimeResponse	响应对象。

RealtimeServerEventResponseDone

当响应完成流式处理时，系统会返回服务器 response.done 事件。无论最终状态如何，始终发出此事件。 response.done 事件中包含的响应对象包括响应中的所有输出项，但省略原始音频数据。

事件结构

{
  "type": "response.done"
}

属性

字段	类型	说明
type	string	事件类型必须是 `response.done`。
response	RealtimeResponse	响应对象。

RealtimeServerEventResponseFunctionCallArgumentsDelta

更新模型生成的函数调用参数时，系统会返回服务器 response.function_call_arguments.delta 事件。

事件结构

{
  "type": "response.function_call_arguments.delta",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "call_id": "<call_id>",
  "delta": "<delta>"
}

属性

字段	类型	说明
type	string	事件类型必须是 `response.function_call_arguments.delta`。
response_id	string	响应的 ID。
item_id	string	函数调用项的 ID。
output_index	integer	响应中的输出项的索引。
call_id	string	函数调用的 ID。
delta	string	参数增量，采用 JSON 字符串的形式。

RealtimeServerEventResponseFunctionCallArgumentsDone

模型生成的函数调用参数完成流式处理时，系统会返回服务器 response.function_call_arguments.done 事件。

当响应中断、不完整或取消时，系统也会返回此事件。

事件结构

{
  "type": "response.function_call_arguments.done",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "call_id": "<call_id>",
  "arguments": "<arguments>"
}

属性

字段	类型	说明
type	string	事件类型必须是 `response.function_call_arguments.done`。
response_id	string	响应的 ID。
item_id	string	函数调用项的 ID。
output_index	integer	响应中的输出项的索引。
call_id	string	函数调用的 ID。
参数	string	最终参数，采用 JSON 字符串的形式。

RealtimeServerEventResponseOutputItemAdded

在响应生成过程中创建新项时，系统会返回服务器 response.output_item.added 事件。

事件结构

{
  "type": "response.output_item.added",
  "response_id": "<response_id>",
  "output_index": 0
}

属性

字段	类型	说明
type	string	事件类型必须是 `response.output_item.added`。
response_id	string	项所属的响应的 ID。
output_index	integer	响应中的输出项的索引。
item	RealtimeConversationResponseItem	已添加的项。

RealtimeServerEventResponseOutputItemDone

当项完成流式处理时，系统会返回服务器 response.output_item.done 事件。

当响应中断、不完整或取消时，系统也会返回此事件。

事件结构

{
  "type": "response.output_item.done",
  "response_id": "<response_id>",
  "output_index": 0
}

属性

字段	类型	说明
type	string	事件类型必须是 `response.output_item.done`。
response_id	string	项所属的响应的 ID。
output_index	integer	响应中的输出项的索引。
item	RealtimeConversationResponseItem	完成流式处理的项。

RealtimeServerEventResponseTextDelta

更新模型生成的文本时，系统会返回服务器 response.text.delta 事件。文本对应于助手消息项的 text 内容部分。

事件结构

{
  "type": "response.text.delta",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "content_index": 0,
  "delta": "<delta>"
}

属性

字段	类型	说明
type	string	事件类型必须是 `response.text.delta`。
response_id	string	响应的 ID。
item_id	string	项的 ID。
output_index	integer	响应中的输出项的索引。
content_index	integer	项内容数组中的内容部分的索引。
delta	string	文本增量。

RealtimeServerEventResponseTextDone

当模型生成的文本完成流式处理时，系统会返回服务器 response.text.done 事件。文本对应于助手消息项的 text 内容部分。

当响应中断、不完整或取消时，系统也会返回此事件。

事件结构

{
  "type": "response.text.done",
  "response_id": "<response_id>",
  "item_id": "<item_id>",
  "output_index": 0,
  "content_index": 0,
  "text": "<text>"
}

属性

字段	类型	说明
type	string	事件类型必须是 `response.text.done`。
response_id	string	响应的 ID。
item_id	string	项的 ID。
output_index	integer	响应中的输出项的索引。
content_index	integer	项内容数组中的内容部分的索引。
text	string	最终文本内容。

RealtimeServerEventSessionCreated

当你建立与实时 API 的新连接时，服务器 session.created 事件是第一个服务器事件。此事件创建并返回具有默认会话配置的新会话。

事件结构

{
  "type": "session.created"
}

属性

字段	类型	说明
type	string	事件类型必须是 `session.created`。
会话	RealtimeResponseSession	会话对象。

RealtimeServerEventSessionUpdated

客户端更新会话时，系统会返回服务器 session.updated 事件。如果出现错误，服务器将改为发送 error 事件。

事件结构

{
  "type": "session.updated"
}

属性

字段	类型	说明
type	string	事件类型必须是 `session.updated`。
会话	RealtimeResponseSession	会话对象。

组件

RealtimeAudioFormat

允许的值：

pcm16
g711_ulaw
g711_alaw

RealtimeAudioInputTranscriptionModel

允许的值：

whisper-1

RealtimeAudioInputTranscriptionSettings

字段	类型	描述
模型	RealtimeAudioInputTranscriptionModel	默认的 `whisper-1` 模型当前是唯一支持音频输入听录的模型。

RealtimeClientEvent

字段	类型	说明
type	RealtimeClientEventType	客户端事件的类型。
event_id	string	事件的唯一 ID。

RealtimeClientEventType

允许的值：

session.update
input_audio_buffer.append
input_audio_buffer.commit
input_audio_buffer.clear
conversation.item.create
conversation.item.delete
conversation.item.truncate
response.create
response.cancel

RealtimeContentPart

字段	类型	说明
type	RealtimeContentPartType	内容部件的类型。

RealtimeContentPartType

允许的值：

input_text
input_audio
text
audio

RealtimeConversationItemBase

要添加到对话中的项。

RealtimeConversationRequestItem

字段	类型	说明
type	RealtimeItemType	项类型。
id	string	项的唯一 ID。

RealtimeConversationResponseItem

字段	类型	说明
object	string	对话响应项。允许的值：`realtime.item`
type	RealtimeItemType	项类型。
id	string	项的唯一 ID。此属性可为 null。

RealtimeFunctionTool

实时终结点使用的函数工具的定义。

字段	类型	说明
type	string	工具的类型。允许的值：`function`
name	string	函数的名称。
description	string	函数的说明。
parameters	object	函数的参数。

RealtimeItemStatus

允许的值：

in_progress
completed
incomplete

RealtimeItemType

允许的值：

message
function_call
function_call_output

RealtimeMessageRole

允许的值：

system
user
assistant

RealtimeRequestAssistantMessageItem

字段	类型	描述
role	string	消息的角色。允许的值：`assistant`
content	RealtimeRequestTextContentPart 数组	消息的内容。

RealtimeRequestAudioContentPart

字段	类型	说明
type	string	内容部件的类型。允许的值：`input_audio`
脚本	string	音频的听录。

RealtimeRequestFunctionCallItem

字段	类型	说明
type	string	项类型。允许的值：`function_call`
name	string	函数调用项的名称。
call_id	string	函数调用项的 ID。
参数	string	函数调用项的参数。
status	RealtimeItemStatus	项状态。

RealtimeRequestFunctionCallOutputItem

字段	类型	说明
type	string	项类型。允许的值：`function_call_output`
call_id	string	函数调用项的 ID。
output	string	函数调用项的输出。

RealtimeRequestMessageItem

字段	类型	说明
type	string	项类型。允许的值：`message`
role	RealtimeMessageRole	消息的角色。
status	RealtimeItemStatus	项状态。

RealtimeRequestMessageReferenceItem

字段	类型	说明
type	string	项类型。允许的值：`message`
id	string	消息项的 ID。

RealtimeRequestSession

字段	类型	描述
modalities	array	会话支持的形式。允许的值：`text`、`audio` 例如，`"modalities": ["text", "audio"]` 是同时启用文本和音频形式的默认设置。若要仅启用文本，请设置 `"modalities": ["text"]`。不能仅启用音频。
instructions	string	指导模型的文本和音频响应的说明（系统消息）。下面是一些示例说明，可帮助对文本和音频响应的内容和格式作出指导： `"instructions": "be succinct"` `"instructions": "act friendly"` `"instructions": "here are examples of good responses"` 下面是一些示例说明，可帮助对音频行为作出指导： `"instructions": "talk quickly"` `"instructions": "inject emotion into your voice"` `"instructions": "laugh frequently"` 虽然模型可能并不总是遵循这些说明，但这些说明提供有关所需行为的指导。
voice	RealtimeVoice	用于会话的模型响应的语音。在会话中为模型的音频响应使用语音后，无法更改语音。
input_audio_format	RealtimeAudioFormat	输入音频的格式。
output_audio_format	RealtimeAudioFormat	输出音频的格式。
input_audio_transcription	RealtimeAudioInputTranscriptionSettings	音频输入听录的设置。此属性可为 null。
turn_detection	RealtimeTurnDetection	会话的轮次检测设置。此属性可为 null。
工具	RealtimeTool 数组	模型可用的会话工具。
tool_choice	RealtimeToolChoice	会话的工具选择。
温度	数字	模型的采样温度。允许的温度值限制为 [0.6， 1.2]。默认值为 0.8。
max_response_output_tokens	整数或“inf”	单个助手响应（包括工具调用）的最大输出词元数。指定介于 1 和 4096 之间的整数来限制输出词元。否则，请将值设置为“inf”以允许最大词元数。例如，若要将输出词元限制为 1000，请设置 `"max_response_output_tokens": 1000`。若要允许最大词元数，请设置 `"max_response_output_tokens": "inf"`。默认为 `"inf"`。

RealtimeRequestSystemMessageItem

字段	类型	描述
role	string	消息的角色。允许的值：`system`
content	RealtimeRequestTextContentPart 数组	消息的内容。

RealtimeRequestTextContentPart

字段	类型	说明
type	string	内容部件的类型。允许的值：`input_text`
text	string	文本内容。

RealtimeRequestUserMessageItem

字段	类型	描述
role	string	消息的角色。允许的值：`user`
content	RealtimeRequestTextContentPart 或 RealtimeRequestAudioContentPart 数组	消息的内容。

RealtimeResponse

字段	类型	说明
object	string	响应对象。允许的值：`realtime.response`
id	string	响应的唯一 ID。
status	RealtimeResponseStatus	响应的状态。默认状态值为 `in_progress`。
status_details	RealtimeResponseStatusDetails	响应状态的详细信息。此属性可为 null。
output	RealtimeConversationResponseItem 数组	响应的输出项。
使用情况	object	响应的使用情况统计信息。每个实时 API 会话都维护会话上下文，并将新项追加到对话中。以前轮次（文本和音频词元）的输出是以后轮次的输入。请参阅后面的嵌套属性。
+ total_tokens	integer	响应中的词元总数，包括输入和输出文本和音频词元。 `usage` 对象的属性。
+ input_tokens	integer	响应中使用的输入词元数，包括文本和音频词元。 `usage` 对象的属性。
+ output_tokens	integer	响应中发送的输出词元数，包括文本词元和音频词元。 `usage` 对象的属性。
+ input_token_details	object	有关响应中使用的输入词元的详细信息。 `usage` 对象的属性。 br> 请参阅后面的嵌套属性。
+ cached_tokens	integer	响应中使用的缓存词元数。 `input_token_details` 对象的属性。
+ text_tokens	integer	响应中使用的文本词元数。 `input_token_details` 对象的属性。
+ audio_tokens	integer	响应中使用的音频词元数。 `input_token_details` 对象的属性。
+ output_token_details	object	有关响应中使用的输出词元的详细信息。 `usage` 对象的属性。请参阅后面的嵌套属性。
+ text_tokens	integer	响应中使用的文本词元数。 `output_token_details` 对象的属性。
+ audio_tokens	integer	响应中使用的音频词元数。 `output_token_details` 对象的属性。

RealtimeResponseAudioContentPart

字段	类型	说明
type	string	内容部件的类型。允许的值：`audio`
脚本	string	音频的听录。此属性可为 null。

RealtimeResponseBase

响应资源。

RealtimeResponseFunctionCallItem

字段	类型	说明
type	string	项类型。允许的值：`function_call`
name	string	函数调用项的名称。
call_id	string	函数调用项的 ID。
参数	string	函数调用项的参数。
status	RealtimeItemStatus	项状态。

RealtimeResponseFunctionCallOutputItem

字段	类型	说明
type	string	项类型。允许的值：`function_call_output`
call_id	string	函数调用项的 ID。
output	string	函数调用项的输出。

RealtimeResponseMessageItem

字段	类型	说明
type	string	项类型。允许的值：`message`
role	RealtimeMessageRole	消息的角色。
content	array	消息的内容。数组项：RealtimeResponseTextContentPart
status	RealtimeItemStatus	项状态。

RealtimeResponseOptions

字段	类型	描述
modalities	array	会话支持的形式。允许的值：`text`、`audio` 例如，`"modalities": ["text", "audio"]` 是同时启用文本和音频形式的默认设置。若要仅启用文本，请设置 `"modalities": ["text"]`。不能仅启用音频。
instructions	string	指导模型的文本和音频响应的说明（系统消息）。下面是一些示例说明，可帮助对文本和音频响应的内容和格式作出指导： `"instructions": "be succinct"` `"instructions": "act friendly"` `"instructions": "here are examples of good responses"` 下面是一些示例说明，可帮助对音频行为作出指导： `"instructions": "talk quickly"` `"instructions": "inject emotion into your voice"` `"instructions": "laugh frequently"` 虽然模型可能并不总是遵循这些说明，但这些说明提供有关所需行为的指导。
voice	RealtimeVoice	用于会话的模型响应的语音。在会话中为模型的音频响应使用语音后，无法更改语音。
output_audio_format	RealtimeAudioFormat	输出音频的格式。
工具	RealtimeTool 数组	模型可用的会话工具。
tool_choice	RealtimeToolChoice	会话的工具选择。
温度	数字	模型的采样温度。允许的温度值限制为 [0.6， 1.2]。默认值为 0.8。
max__output_tokens	整数或“inf”	单个助手响应（包括工具调用）的最大输出词元数。指定介于 1 和 4096 之间的整数来限制输出词元。否则，请将值设置为“inf”以允许最大词元数。例如，若要将输出词元限制为 1000，请设置 `"max_response_output_tokens": 1000`。若要允许最大词元数，请设置 `"max_response_output_tokens": "inf"`。默认为 `"inf"`。

RealtimeResponseSession

字段	类型	说明
object	string	会话对象。允许的值：`realtime.session`
id	string	会话的唯一 ID。
model	string	用于会话的模型。
modalities	array	会话支持的形式。允许的值：`text`、`audio` 例如，`"modalities": ["text", "audio"]` 是同时启用文本和音频形式的默认设置。若要仅启用文本，请设置 `"modalities": ["text"]`。不能仅启用音频。
instructions	string	指导模型的文本和音频响应的说明（系统消息）。下面是一些示例说明，可帮助对文本和音频响应的内容和格式作出指导： `"instructions": "be succinct"` `"instructions": "act friendly"` `"instructions": "here are examples of good responses"` 下面是一些示例说明，可帮助对音频行为作出指导： `"instructions": "talk quickly"` `"instructions": "inject emotion into your voice"` `"instructions": "laugh frequently"` 虽然模型可能并不总是遵循这些说明，但这些说明提供有关所需行为的指导。
voice	RealtimeVoice	用于会话的模型响应的语音。在会话中为模型的音频响应使用语音后，无法更改语音。
input_audio_format	RealtimeAudioFormat	输入音频的格式。
output_audio_format	RealtimeAudioFormat	输出音频的格式。
input_audio_transcription	RealtimeAudioInputTranscriptionSettings	音频输入听录的设置。此属性可为 null。
turn_detection	RealtimeTurnDetection	会话的轮次检测设置。此属性可为 null。
工具	RealtimeTool 数组	模型可用的会话工具。
tool_choice	RealtimeToolChoice	会话的工具选择。
温度	数字	模型的采样温度。允许的温度值限制为 [0.6， 1.2]。默认值为 0.8。
max_response_output_tokens	整数或“inf”	单个助手响应（包括工具调用）的最大输出词元数。指定介于 1 和 4096 之间的整数来限制输出词元。否则，请将值设置为“inf”以允许最大词元数。例如，若要将输出词元限制为 1000，请设置 `"max_response_output_tokens": 1000`。若要允许最大词元数，请设置 `"max_response_output_tokens": "inf"`。

RealtimeResponseStatus

允许的值：

in_progress
completed
cancelled
incomplete
failed

RealtimeResponseStatusDetails

字段	类型	说明
type	RealtimeResponseStatus	响应的状态。

RealtimeResponseTextContentPart

字段	类型	说明
type	string	内容部件的类型。允许的值：`text`
text	string	文本内容。

RealtimeServerEvent

字段	类型	说明
type	RealtimeServerEventType	服务器事件的类型。
event_id	string	事件的唯一 ID。

RealtimeServerEventRateLimitsUpdatedRateLimitsItem

字段	类型	说明
name	string	速率限制属性名称（此项包含其相关信息）。
limit	integer	为此速率限制属性配置的最大限制。
剩余	integer	相对于为此速率限制属性配置的限制而言的可用剩余配额。
reset_seconds	数字	到重置此速率限制属性为止时的剩余时间（以秒为单位）。

RealtimeServerEventType

允许的值：

session.created
session.updated
conversation.created
conversation.item.created
conversation.item.deleted
conversation.item.truncated
response.created
response.done
rate_limits.updated
response.output_item.added
response.output_item.done
response.content_part.added
response.content_part.done
response.audio.delta
response.audio.done
response.audio_transcript.delta
response.audio_transcript.done
response.text.delta
response.text.done
response.function_call_arguments.delta
response.function_call_arguments.done
input_audio_buffer.speech_started
input_audio_buffer.speech_stopped
conversation.item.input_audio_transcription.completed
conversation.item.input_audio_transcription.failed
input_audio_buffer.committed
input_audio_buffer.cleared
error

RealtimeServerVadTurnDetection

字段	类型	说明
type	string	轮次检测的类型。允许的值：`server_vad`
threshold	数字	服务器 VAD 轮次检测的激活阈值。在嘈杂的环境中，可能需要增加阈值以避免误报。在安静的环境中，可能需要降低阈值以避免误报。默认为 `0.5`。可以将阈值设置为介于 `0.0` 和 `1.0` 之间的值。
prefix_padding_ms	string	语音音频（以毫秒为单位）在检测到的语音开始之前要包含的持续时间。默认为 `300`。
silence_duration_ms	string	检测语音结束的静音持续时间（以毫秒为单位）。应尽快检测语音结束时间，但不要太快，以免切掉语音的最后一部分。如果将此值设置为较低的数字，模型将更快地响应，但它可能会切掉语音的最后一部分。如果将此值设置为较高的数字，模型将等待更长时间来检测语音结束，但响应可能需要更长的时间。

RealtimeSessionBase

实时会话对象配置。

RealtimeTool

实时工具定义的基准表示形式。

字段	类型	说明
type	RealtimeToolType	工具的类型。

RealtimeToolChoice

实时 tool_choice 参数的可用表示形式的组合集，包含“auto”等字符串字面量选项和对已定义工具的结构化引用。

RealtimeToolChoiceFunctionObject

实时 tool_choice（用于选择命名函数工具）的表示形式。

字段	类型	说明
type	string	tool_choice 的类型。允许的值：`function`
函数	object	要选择的函数工具。请参阅后面的嵌套属性。
+ name	string	函数工具的名称。 `function` 对象的属性。

RealtimeToolChoiceLiteral

实时终结点的一组可用的模式级字符串字面量 tool_choice 选项。

允许的值：

auto
none
required

RealtimeToolChoiceObject

实时 tool_choice（用于选择命名工具）的基准表示形式。

字段	类型	说明
type	RealtimeToolType	tool_choice 的类型。

RealtimeToolType

实时工具支持的工具类型鉴别器。目前仅支持“function”工具。

允许的值：

function

RealtimeTurnDetection

字段	类型	说明
type	RealtimeTurnDetectionType	轮次检测的类型。允许的值：`server_vad`

RealtimeTurnDetectionType

允许的值：

server_vad

RealtimeVoice

允许的值：

alloy
shimmer
echo

开始使用实时 API 快速入门。
详细了解如何使用实时 API。