Real-time models
Real-time (speech to speech) models are designed to work natively in audio modality bypassing the need to convert audio stream to text (via speech to text service) and then back from text to audio stream (via text to speech service). They respond to user inputs with minimal latency, enabling near-instantaneous interactions, and can understand intonation and emotions. They are however less reliable in following complex instructions – and thus are better suited for relatively simple and straightforward use cases.
AI Agents support the following real-time models:
|
Provider |
Models |
|---|---|
|
OpenAI |
|
|
Azure OpenAI |
|
|
Amazon |
|
|
|
|
|
xAI |
|
You may use pre-deployed models or bring your own API keys.
Configuring real-time models
To configure agent to use real-time model:
-
Navigate to the Agents screen, locate your Agent, and click Edit.
-
In General tab:
-
From the 'Large language model' drop-down, choose the real-time model.
-
Set the number of 'Max output tokens' to 1,000 or larger.
-
-
In Speech and Telephony tab, check the Enable voice streaming check box.
The above-described configuration makes real-time model interact with user via voice modality. Input audio stream is directly sent to the model. And the model generates audio stream in response. Correspondingly, Speech-to-Text and Text-to-Speech services are irrelevant in this interaction mode.
If you clear Enable voice streaming check box, real-time model will be used via the text modality. In this interaction mode you will need to choose Speech-to-Text and Text-to-Speech services – same as for regular LLM models. Note that such a configuration is only supported for OpenAI and Azure OpenAI models, and not for Google and Amazon models that support only audio output modality.
Chats always use text modality, regardless of the Enable voice streaming check box state. For Google and Amazon models, chat uses audio transcription, therefore responses are very slow.
Customizing the real-time model behavior
Use the following advanced configuration parameters to customize the real-time model behavior:
-
openai_realtime– for OpenAI models -
gemini_audio– for Gemini models -
nova_sonic– for Amazon models -
grok_voice– for xAI models
For example:
{
"openai_realtime": {
"voice": "coral"
}
}
Configuring input / output language
Real-time models lack explicit configuration for input / output language. Instead you should include relevant instructions in your agent’s prompt.
For example:
Always respond to user in German.
Feature parity
Agents that use real-time models benefit from most of the AI Agent platform features, including but not limited to:
-
Documents
-
Tools
-
Multi-agent topologies
-
Post call analysis
-
Webhooks
The following limitations apply to the agents that use real-time models:
-
RAG for every querydocument mode is not supported and implicitly switched toRAG via doc_search toolmode. -
For multi-agent topologies real-time model should be configured for the “main” agent (that starts the conversation). This model will be used for the complete conversation and LLM configuration in sub-agents will be ignored.
-
Real-time models generate significantly more output tokens than regular models, because they output audio stream. Therefore you will typically need to make sure that the Max output tokens parameter in your agent configuration screen is set to a value of 1,000 or higher.
-
end_callandtransfer_calltools lack ability to play termination / transfer messages. Real-time models are expected to say “last message” first and only after that call the tool. Consider updating agent’s prompt to enforce this behavior. -
The following advanced configuration parameters are not supported when real-time models are used:
-
call_transfer_conditions -
language_detected_pass_question -
language_detected_ignore_phrases -
remove_symbols -
replace_words -
activity_params -
session_params
-
-
Real-time models from Google and Amazon in multi-agent topologies may miss parts of user speech when the control switches from one agent to another. This may be especially troublesome for topologies that use “consult” orchestration mode. This limitation doesn’t apply to OpenAI and Azure OpenAI models.
-
Static welcome messages are not supported for Amazon nova-sonic and nova-2-sonic models.