Real-time models

Real-time models are designed to process and respond to inputs with minimal latency, enabling near-instantaneous interactions. They support multi-modal communication through both voice and text interfaces, providing minimal latency and creating seamless human-AI interactions.

AI Agents support the following real-time models:

Provider	Models
OpenAI	gpt-realtime gpt-4o-realtime gpt-4o-mini-realtime
Azure OpenAI	gpt-realtime gpt-4o-realtime gpt-4o-mini-realtime
Amazon	nova-sonic
Google	gemini-live-2.5-flash gemini-2.5-flash-native-audio gemini-2.5-flash-native-audio-thinking

You may use pre-deployed models or bring your own API keys from OpenAI or Microsoft Azure.

Configuring real-time models

To configure agent to use real-time model:

Navigate to the Agents screen, locate your Agent, and click Edit.
In General tab:
1. From the 'Large language model' drop-down, choose the real-time model.
2. Set the number of 'Max output tokens' to 1,000 or larger.

In Speech and Telephony tab, check the Enable voice streaming check box.

The above-described configuration makes real-time model interact with user via voice modality. Input audio stream is directly sent to the model. And the model generates audio stream in response. Correspondingly Speech-to-Text and Text-to-Speech services are irrelevant in this interaction mode.

If you clear Enable voice streaming check box, real-time model will be used via the text modality. In this interaction mode you will need to choose Speech-to-Text and Text-to-Speech services – same as for regular LLM models.

Chats always use text modality, regardless of the Enable voice streaming check box state.

Customizing the real-time model behavior

Use the openai_realtime advanced configuration parameters to customize the real-time model behavior.

For example:

{
  "openai_realtime": {
    "voice": "coral"
  }
}

The following openai_realtime parameters are supported:

Parameter	Type	Description
`voice`	enum	Voice name. The following voices are supported: `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`, `shimmer`, `verse`
`input_audio_transcription_language`	str	Language for input audio transcription model. Use ISO 639-1 code. For example, "en" for English, "fr" for French. Note that input audio transcription is performed by a separate transcription model and has no direct relation to what real-time model hears / responds to. It is primarily used for logging purposes.
`input_audio_transcription_prompt`	str	Prompt for input audio transcription model.
`input_audio_noise_reduction`	enum	Configuration of input audio noise reduction. `near_field` – for close-taking microphones, such as headphones. `far_field` – for far-field microphones such as laptop or conference room microphones.
`turn_detection`	enum	Configuration of turn detection `server_vad` – the model will detect the start and end of speech based on audio volume and respond at the end of user speech. `semantic_vad` – the model will use turn detection model (in conjuction with VAD) to semantically estimate when the user has finished speaking.
`eagerness`	enum	Eagerness of the model to respond for semantic VAD. Supported values: `low`, `medium`, `high`, `auto`.
`prefix_padding_ms`	int	Amount of audio to include before the VAD detected speech (in milliseconds) for server VAD.
`silence_duration_ms`	int	Duration of silence to wait before considering the speech finished (in milliseconds) for server VAD.
`threshold`	float	Activation threshold for server VAD. Valid range: (0.0 to 1.0).

Configuring input / output language

Real-time models lack explicit configuration for input / output language. Instead you should include relevant instructions in your agent’s prompt.

For example:

Always respond to user in German.

Feature Parity

Agents that use real-time models benefit from most of the AI Agent platform features, including but not limited to:

Documents
Tools
Multi-agent topologies
Post call analysis
Webhooks

The following limitations apply to the agents that use real-time models:

RAG for every query document mode is not supported and implicitly switched to RAG via doc_search tool mode.
For multi-agent topologies real-time model should be configured for the “main” agent (that starts the conversation). This model will be used for the complete conversation and LLM configuration in sub-agents will be ignored.
Real-time models generate significantly more output tokens than regular models, because they output audio stream. Therefore you will typically need to increase Max output tokens parameter in your agent configuration screen to 1,000 or larger value.
end_call and transfer_call tools lack ability to play termination / transfer messages. Real-time models are expected to say “last message” first and only after that call the tool. Consider updating agent’s prompt to enforce this behavior.
The following advanced configuration parameters are not supported when real-time models are used:
- call_transfer_conditions
- language_detected_pass_question
- language_detected_ignore_phrases
- remove_symbols
- replace_words
- activity_params
- session_params