Real-time models
Real-time models are designed to process and respond to inputs with minimal latency, enabling near-instantaneous interactions. They support multi-model communication through both voice and text interfaces, providing minimal latency and creating seamless human-AI interactions.
AI Agents support the following real-time models:
-
gpt-4o-realtime
-
gpt-4o-mini-realtime
You may use pre-deployed models or bring your own API keys from OpenAI or Microsoft Azure.
Configuring real-time models
To configure agent to use real-time model:
-
Navigate to the Agents screen, locate your Agent, and click Edit.
-
In General tab:
-
From the 'Large language model' drop-down, choose the real-time model.
-
Set the number of 'Max output tokens' to 1,000 or larger.
-
-
In Speech and Telephony tab, check the Enable voice streaming check box.
The above-described configuration makes real-time model interact with user via voice modality. Input audio stream is directly sent to the model. And the model generates audio stream in response. Correspondingly Speech-to-Text and Text-to-Speech services are irrelevant in this interaction mode.
If you clear Enable voice streaming check box, real-time model will be used via the text modality. In this interaction mode you will need to choose Speech-to-Text and Text-to-Speech services – same as for regular LLM models.
Chats always use text modality, regardless of the Enable voice streaming check box state.
Customizing the real-time model behavior
Use the openai_realtime
advanced configuration parameters to customize the real-time model behavior.
For example:
{ "openai_realtime": { "voice": "coral" } }
The following openai_realtime
parameters are supported:
Parameter |
Type |
Description |
---|---|---|
|
enum |
Voice name. The following voices are supported: |
|
str
|
Language for input audio transcription model. Use ISO 639-1 code. For example, "en" for English, "fr" for French. Note that input audio transcription is performed by a separate transcription model and has no direct relation to what real-time model hears / responds to. It is primarily used for logging purposes. |
|
str |
Prompt for input audio transcription model. |
|
enum |
Configuration of input audio noise reduction.
|
|
enum |
Configuration of turn detection
|
|
enum |
Eagerness of the model to respond for semantic VAD. Supported
values: |
|
int |
Amount of audio to include before the VAD detected speech (in milliseconds) for server VAD. |
|
int |
Duration of silence to wait before considering the speech finished (in milliseconds) for server VAD. |
|
float |
Activation threshold for server VAD. Valid range: (0.0 to 1.0). |
Configuring input / output language
Real-time models lack explicit configuration for input / output language. Instead you should include relevant instructions in your agent’s prompt.
For example:
Always respond to user in German.
Feature Parity
Agents that use real-time models benefit from most of the AI Agent platform features, including but not limited to:
-
Documents
-
Tools
-
Multi-agent topologies
-
Post call analysis
-
Webhooks
The following limitations apply to the agents that use real-time models:
-
RAG for every query
document mode is not supported and implicitly switched toRAG via doc_search tool
mode. -
For multi-agent topologies real-time model should be configured for the “main” agent (that starts the conversation). This model will be used for the complete conversation and LLM configuration in sub-agents will be ignored.
-
Real-time models generate significantly more output tokens than regular models, because they output audio stream. Therefore you will typically need to increase Max output tokens parameter in your agent configuration screen to 1,000 or larger value.
-
The following advanced configuration parameters are not supported when real-time models are used:
-
call_transfer_conditions
-
language_detected_pass_question
-
language_detected_ignore_phrases
-
remove_symbols
-
replace_words
-
activity_params
-
session_params
-