Real-time models

Real-time (speech to speech) models are designed to work natively in audio modality bypassing the need to convert audio stream to text (via speech to text service) and then back from text to audio stream (via text to speech service). They respond to user inputs with minimal latency, enabling near-instantaneous interactions, and can understand intonation and emotions. They are however less reliable in following complex instructions – and thus are better suited for relatively simple and straightforward use cases.

AI Agents support the following real-time models:

Provider

Models

OpenAI

  • gpt-realtime

  • gpt-realtime-mini

  • gpt-4o-realtime

  • gpt-4o-mini-realtime

Azure OpenAI

  • gpt-realtime

  • gpt-realtime-mini

  • gpt-4o-realtime

  • gpt-4o-mini-realtime

Amazon

  • nova-sonic

  • nova-2-sonic

Google

  • gemini-2.5-flash-native-audio

xAI

  • grok-voice

You may use pre-deployed models or bring your own API keys.

Configuring real-time models

To configure agent to use real-time model:

  1. Navigate to the Agents screen, locate your Agent, and click Edit.

  2. In General tab:

    1. From the 'Large language model' drop-down, choose the real-time model.

    2. Set the number of 'Max output tokens' to 1,000 or larger.

  1. In Speech and Telephony tab, check the Enable voice streaming check box.

The above-described configuration makes real-time model interact with user via voice modality. Input audio stream is directly sent to the model. And the model generates audio stream in response. Correspondingly, Speech-to-Text and Text-to-Speech services are irrelevant in this interaction mode.

If you clear Enable voice streaming check box, real-time model will be used via the text modality. In this interaction mode you will need to choose Speech-to-Text and Text-to-Speech services – same as for regular LLM models. Note that such a configuration is only supported for OpenAI and Azure OpenAI models, and not for Google and Amazon models that support only audio output modality.

Chats always use text modality, regardless of the Enable voice streaming check box state. For Google and Amazon models, chat uses audio transcription, therefore responses are very slow.

Customizing the real-time model behavior

Use the following advanced configuration parameters to customize the real-time model behavior:

For example:

{
  "openai_realtime": {
    "voice": "coral"
  }
}

Configuring input / output language

Real-time models lack explicit configuration for input / output language. Instead you should include relevant instructions in your agent’s prompt.

For example:

Always respond to user in German.

Feature parity

Agents that use real-time models benefit from most of the AI Agent platform features, including but not limited to:

The following limitations apply to the agents that use real-time models: