Speech-to-speech models
Speech to speech models are designed to work natively in audio modality bypassing the need to convert audio stream to text (via speech to text service) and then back from text to audio stream (via text to speech service). They respond to user inputs with minimal latency, enabling near-instantaneous interactions, and can understand intonation and emotions. They are however less reliable in following complex instructions – and thus are better suited for relatively simple and straightforward use cases.
AI Agents support the following speech-to-text models:
|
Provider |
Models |
|---|---|
|
OpenAI |
|
|
Azure OpenAI |
|
|
Amazon |
|
|
|
|
|
xAI |
|
You may use pre-deployed models or bring your own API keys.
Configuring speech-to-speech models
To configure agent to use a speech-to-speech model:
-
Navigate to the Agents screen, locate your Agent, and click Edit.
-
In General tab:
-
From the 'Large language model' drop-down, choose the speech-to-speech model (marked with
). -
Set the number of 'Max output tokens' to 1,000 or larger.
-
-
In Speech and Telephony tab, check the Enable voice streaming check box.
The above-described configuration makes the speech-to-speech model interact with user via voice modality. Input audio stream is directly sent to the model. And the model generates audio stream in response. Correspondingly, Speech-to-Text and Text-to-Speech services are irrelevant in this interaction mode.
If you clear the Enable voice streaming check box, the speech-to-speech model will be used via the text modality. In this interaction mode you will need to choose Speech-to-Text and Text-to-Speech services – same as for regular LLM models. Note that such a configuration is only supported for OpenAI and Azure OpenAI models, gpt-realtime, gpt-realtime-mini, gpt-realtime-1.5, and not for Google, Amazon, and xAI models that support only audio output modality.
Chats always use text modality, regardless of the Enable voice streaming check box state. For Google, Amazon, and xAI models, chat uses audio transcription, therefore responses are very slow.
Customizing the speech-to-speech model behavior
Use the following advanced configuration parameters to customize the speech-to-speech model behavior:
-
openai_realtime– for OpenAI and Azure OpenAI models -
gemini_audio– for Gemini models -
nova_sonic– for Amazon models -
grok_voice– for xAI models
For example:
{
"openai_realtime": {
"voice": "coral"
}
}
VAD mode for Gemini models
Gemini speech-to-speech models, gemini-2.5-flash-native-audio and gemini-3.1-flash-live, have a known issue in their voice activity detection (VAD) module that may cause models to “freeze” and not respond to certain user utterances. The problem happens sporadically and is more prominent for non-English languages.
You may mitigate the problem by disabling VAD module inside the model and using Silero VAD instead. In order to do that, add the following advanced configuration parameter to your agent:
{
"gemini_audio": {
"vad_mode": "silero"
}
}
|
Parameter |
Type |
Description |
|---|---|---|
|
|
SileroVad |
Configuration of Silero VAD. |
SileroVad
| Parameter | Type | Description |
|---|---|---|
positive_speech_threshold
|
float |
Threshold above which a frame is considered speech. Default: 0.5 |
negative_speech_threshold
|
float |
Threshold below which a frame is considered non-speech. Default: 0.35 |
redemption_ms
|
int |
Duration of consecutive non-speech before ending speech segment. Default: 600 |
pre_speech_pad_ms
|
int |
Amount of pre-speech audio (msec) to prepend when speech starts. Default: 400 |
min_speech_ms
|
int |
Minimum speech duration (msec) required before confirming speech start. Default: 200 |
debug
|
bool |
Enable debug data collection for VAD threshold tuning Default: false |
Example
{
"silero_vad": {
"min_speech_ms": 300
}
}
Debug data collection for VAD threshold tuning
silero_vad.debug advanced configuration parameter enables collection of per-frame VAD scores and speech boundary events during the call. Collected data can be used to analyze VAD behavior in the specific conversation and fine-tune configuration thresholds.
When the conversation ends, accumulated data is emitted as a silero_vad log line. Each entry covers 5 seconds of the call and contains:
-
VAD probability scores (2 decimal places) for every 32ms frame
-
S / E markers indicating speech start / end detections
For example,
[{"time": "14:05:00", "data": "0.03 0.05 0.12 0.78 0.91 S 0.88 ... 0.15 E 0.04"}]
For long calls, debug data is recorded for last 15 minutes of the call.
First entry is slightly longer than the rest, because it contains additional audio frames (roughly 500 msec) that are accumulated in some buffer during the conversation / VAD model initialization.
Keep in mind that while Silero VAD model works on 32 msec frames, silero_vad advanced config parameter specifies thresholds in milliseconds. So, for example, default min_speech_ms value of 200 msec, corresponds to 7 frames in the debug data.
Configuring input / output language
Speech-to-speech models lack explicit configuration for input / output language. Instead you should include relevant instructions in your agent’s prompt.
For example:
Always respond to user in German.
Feature parity
Agents that use speech-to-speech models benefit from most of the AI Agent platform features, including but not limited to:
-
Documents
-
Tools
-
Multi-agent topologies
-
Post call analysis
-
Webhooks
The following limitations apply to the agents that use speech-to-speech models:
-
RAG for every querydocument mode is not supported and implicitly switched toRAG via doc_search toolmode. -
For multi-agent topologies speech-to-speech model should be configured for the “main” agent (that starts the conversation). This model will be used for the complete conversation and LLM configuration in sub-agents will be ignored.
-
Speech-to-speech models generate significantly more output tokens than textual models, because they output audio stream. Therefore you will typically need to make sure that the Max output tokens parameter in your agent configuration screen is set to a value of 1,000 or higher.
-
end_callandtransfer_calltools lack ability to play termination / transfer messages. Speech-to-speech models are expected to say “last message” first and only after that call the tool. Consider updating agent’s prompt to enforce this behavior. -
The following advanced configuration parameters are not supported when speech-to-speech models are used:
-
call_transfer_conditions -
language_detected_pass_question -
language_detected_ignore_phrases -
remove_symbols -
replace_words -
activity_params -
session_params
-
-
Speech-to-speech models from Google and Amazon in multi-agent topologies may miss parts of user speech when the control switches from one agent to another. This may be especially troublesome for topologies that use “consult” orchestration mode. This limitation doesn’t apply to OpenAI and Azure OpenAI models.
-
Static welcome messages are not supported for Amazon nova-sonic and nova-2-sonic models.