Speech-to-speech models

Speech to speech models are designed to work natively in audio modality bypassing the need to convert audio stream to text (via speech to text service) and then back from text to audio stream (via text to speech service). They respond to user inputs with minimal latency, enabling near-instantaneous interactions, and can understand intonation and emotions. They are however less reliable in following complex instructions – and thus are better suited for relatively simple and straightforward use cases.

AI Agents support the following speech-to-text models:

Provider	Models
OpenAI	gpt-realtime gpt-realtime-mini gpt-realtime-1.5 gpt-4o-realtime gpt-4o-mini-realtime
Azure OpenAI	gpt-realtime gpt-realtime-mini gpt-realtime-1.5 gpt-4o-realtime gpt-4o-mini-realtime
Amazon	nova-sonic nova-2-sonic
Google	gemini-2.5-flash-native-audio gemini-3.1-flash-live
xAI	grok-voice

You may use pre-deployed models or bring your own API keys.

Configuring speech-to-speech models

To configure agent to use a speech-to-speech model:

Navigate to the Agents screen, locate your Agent, and click Edit.
In General tab:
1. From the 'Large language model' drop-down, choose the speech-to-speech model (marked with ).
2. Set the number of 'Max output tokens' to 1,000 or larger.

In Speech and Telephony tab, check the Enable voice streaming check box.

The above-described configuration makes the speech-to-speech model interact with user via voice modality. Input audio stream is directly sent to the model. And the model generates audio stream in response. Correspondingly, Speech-to-Text and Text-to-Speech services are irrelevant in this interaction mode.

If you clear the Enable voice streaming check box, the speech-to-speech model will be used via the text modality. In this interaction mode you will need to choose Speech-to-Text and Text-to-Speech services – same as for regular LLM models. Note that such a configuration is only supported for OpenAI and Azure OpenAI models, gpt-realtime, gpt-realtime-mini, gpt-realtime-1.5, and not for Google, Amazon, and xAI models that support only audio output modality.

Chats always use text modality, regardless of the Enable voice streaming check box state. For Google, Amazon, and xAI models, chat uses audio transcription, therefore responses are very slow.

Customizing the speech-to-speech model behavior

Use the following advanced configuration parameters to customize the speech-to-speech model behavior:

openai_realtime – for OpenAI and Azure OpenAI models
gemini_audio – for Gemini models
nova_sonic – for Amazon models
grok_voice – for xAI models

For example:

{
  "openai_realtime": {
    "voice": "coral"
  }
}

VAD mode for Gemini models

Gemini speech-to-speech models, gemini-2.5-flash-native-audio and gemini-3.1-flash-live, have a known issue in their voice activity detection (VAD) module that may cause models to “freeze” and not respond to certain user utterances. The problem happens sporadically and is more prominent for non-English languages.

You may mitigate the problem by disabling VAD module inside the model and using Silero VAD instead. In order to do that, add the following advanced configuration parameter to your agent:

{
    "gemini_audio": {
        "vad_mode": "silero"
    }
}

Parameter	Type	Description
`silero_vad`	SileroVad	Configuration of Silero VAD.

SileroVad

Parameter	Type	Description
`positive_speech_threshold`	float	Threshold above which a frame is considered speech. Default: 0.5
`negative_speech_threshold`	float	Threshold below which a frame is considered non-speech. Default: 0.35
`redemption_ms`	int	Duration of consecutive non-speech before ending speech segment. Default: 600
`pre_speech_pad_ms`	int	Amount of pre-speech audio (msec) to prepend when speech starts. Default: 400
`min_speech_ms`	int	Minimum speech duration (msec) required before confirming speech start. Default: 200
`debug`	bool	Enable debug data collection for VAD threshold tuning Default: false

Example

{
    "silero_vad": {
        "min_speech_ms": 300
    }
}

Debug data collection for VAD threshold tuning

silero_vad.debug advanced configuration parameter enables collection of per-frame VAD scores and speech boundary events during the call. Collected data can be used to analyze VAD behavior in the specific conversation and fine-tune configuration thresholds.

When the conversation ends, accumulated data is emitted as a silero_vad log line. Each entry covers 5 seconds of the call and contains:

VAD probability scores (2 decimal places) for every 32ms frame
S / E markers indicating speech start / end detections

For example,

[{"time": "14:05:00", "data": "0.03 0.05 0.12 0.78 0.91 S 0.88 ... 0.15 E 0.04"}]

For long calls, debug data is recorded for last 15 minutes of the call.

First entry is slightly longer than the rest, because it contains additional audio frames (roughly 500 msec) that are accumulated in some buffer during the conversation / VAD model initialization.

Keep in mind that while Silero VAD model works on 32 msec frames, silero_vad advanced config parameter specifies thresholds in milliseconds. So, for example, default min_speech_ms value of 200 msec, corresponds to 7 frames in the debug data.

Configuring input / output language

Speech-to-speech models lack explicit configuration for input / output language. Instead you should include relevant instructions in your agent’s prompt.

For example:

Always respond to user in German.

Feature parity

Agents that use speech-to-speech models benefit from most of the AI Agent platform features, including but not limited to:

Documents
Tools
Multi-agent topologies
Post call analysis
Webhooks

The following limitations apply to the agents that use speech-to-speech models:

RAG for every query document mode is not supported and implicitly switched to RAG via doc_search tool mode.
For multi-agent topologies speech-to-speech model should be configured for the “main” agent (that starts the conversation). This model will be used for the complete conversation and LLM configuration in sub-agents will be ignored.
Speech-to-speech models generate significantly more output tokens than textual models, because they output audio stream. Therefore you will typically need to make sure that the Max output tokens parameter in your agent configuration screen is set to a value of 1,000 or higher.
end_call and transfer_call tools lack ability to play termination / transfer messages. Speech-to-speech models are expected to say “last message” first and only after that call the tool. Consider updating agent’s prompt to enforce this behavior.
The following advanced configuration parameters are not supported when speech-to-speech models are used:
- call_transfer_conditions
- language_detected_pass_question
- language_detected_ignore_phrases
- remove_symbols
- replace_words
- activity_params
- session_params
Speech-to-speech models from Google and Amazon in multi-agent topologies may miss parts of user speech when the control switches from one agent to another. This may be especially troublesome for topologies that use “consult” orchestration mode. This limitation doesn’t apply to OpenAI and Azure OpenAI models.
Static welcome messages are not supported for Amazon nova-sonic and nova-2-sonic models.