GPT-4o Realtime API for speech and audio (Preview)

Azure OpenAI GPT-4o Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions. The GPT-4o audio realtime API is designed to handle real-time, low-latency conversational interactions, making it a great fit for use cases involving live interactions between a user and a model, such as customer support agents, voice assistants, and real-time translators.

Most users of the Realtime API need to deliver and receive audio from an end-user in real time, including applications that use WebRTC or a telephony system. The Realtime API isn't designed to connect directly to end user devices and relies on client integrations to terminate end user audio streams.

Supported models

Currently only gpt-4o-realtime-preview version: 2024-10-01-preview supports real-time audio.

The gpt-4o-realtime-preview model is available for global deployments in East US 2 and Sweden Central regions.

Important

The system stores your prompts and completions as described in the "Data Use and Access for Abuse Monitoring" section of the service-specific Product Terms for Azure OpenAI Service, except that the Limited Exception does not apply. Abuse monitoring will be turned on for use of the gpt-4o-realtime-preview API even for customers who otherwise are approved for modified abuse monitoring.

API support

Support for the Realtime API was first added in API version 2024-10-01-preview.

Note

For more information about the API and architecture, see the Azure OpenAI GPT-4o real-time audio repository on GitHub.

Prerequisites

Deploy a model for real-time audio

Before you can use GPT-4o real-time audio, you need a deployment of the gpt-4o-realtime-preview model in a supported region as described in the supported models section.

You can deploy the model from the Azure AI Studio model catalog or from your project in AI Studio. Follow these steps to deploy a gpt-4o-realtime-preview model from the model catalog:

  1. Sign in to AI Studio and go to the Home page.
  2. Select Model catalog from the left sidebar.
  3. Search for and select the gpt-4o-realtime-preview model from the Azure OpenAI collection.
  4. Select Deploy to open the deployment window.
  5. Enter a deployment name and select an Azure OpenAI resource.
  6. Select 2024-10-01 from the Model version dropdown.
  7. Modify other default settings depending on your requirements.
  8. Select Deploy. You land on the deployment details page.

Now that you have a deployment of the gpt-4o-realtime-preview model, you can use the AI Studio Real-time audio playground or Realtime API to interact with it in real time.

Use the GPT-4o real-time audio

Tip

Right now, the fastest way to get started development with the GPT-4o Realtime API is to download the sample code from the Azure OpenAI GPT-4o real-time audio repository on GitHub.

To chat with your deployed gpt-4o-realtime-preview model in the Azure AI Studio Real-time audio playground, follow these steps:

  1. Go to your project in Azure AI Studio.

  2. Select Playgrounds > Real-time audio from the left pane.

  3. Select your deployed gpt-4o-realtime-preview model from the Deployment dropdown.

  4. Select Enable microphone to allow the browser to access your microphone. If you already granted permission, you can skip this step.

    Screenshot of the real-time audio playground with the deployed model selected.

  5. Optionally you can edit contents in the Give the model instructions and context text box. Give the model instructions about how it should behave and any context it should reference when generating a response. You can describe the assistant's personality, tell it what it should and shouldn't answer, and tell it how to format responses.

  6. Optionally, change settings such as threshold, prefix padding, and silence duration.

  7. Select Start listening to start the session. You can speak into the microphone to start a chat.

    Screenshot of the real-time audio playground with the start listening button and microphone access enabled.

  8. You can interrupt the chat at any time by speaking. You can end the chat by selecting the Stop listening button.

The JavaScript web sample demonstrates how to use the GPT-4o Realtime API to interact with the model in real time. The sample code includes a simple web interface that captures audio from the user's microphone and sends it to the model for processing. The model responds with text and audio, which the sample code renders in the web interface.

You can run the sample code locally on your machine by following these steps. Refer to the repository on GitHub for the most up-to-date instructions.

  1. If you don't have Node.js installed, download and install the LTS version of Node.js.

  2. Clone the repository to your local machine:

    git clone https://github.com/Azure-Samples/aoai-realtime-audio-sdk.git
    
  3. Go to the javascript/samples/web folder in your preferred code editor.

    cd ./javascript/samples
    
  4. Run download-pkg.ps1 or download-pkg.sh to download the required packages.

  5. Go to the web folder from the ./javascript/samples folder.

    cd ./web
    
  6. Run npm install to install package dependencies.

  7. Run npm run dev to start the web server, navigating any firewall permissions prompts as needed.

  8. Go to any of the provided URIs from the console output (such as http://localhost:5173/) in a browser.

  9. Enter the following information in the web interface:

    • Endpoint: The resource endpoint of an Azure OpenAI resource. You don't need to append the /realtime path. An example structure might be https://my-azure-openai-resource-from-portal.openai.azure.com.
    • API Key: A corresponding API key for the Azure OpenAI resource.
    • Deployment: The name of the gpt-4o-realtime-preview model that you deployed in the previous section.
    • System Message: Optionally, you can provide a system message such as "You always talk like a friendly pirate."
    • Temperature: Optionally, you can provide a custom temperature.
    • Voice: Optionally, you can select a voice.
  10. Select the Record button to start the session. Accept permissions to use your microphone if prompted.

  11. You should see a << Session Started >> message in the main output. Then you can speak into the microphone to start a chat.

  12. You can interrupt the chat at any time by speaking. You can end the chat by selecting the Stop button.