Speech-to-speech models

Speech to speech models are designed to work natively in audio modality bypassing the need to convert audio stream to text (via speech to text service) and then back from text to audio stream (via text to speech service). They respond to user inputs with minimal latency, enabling near-instantaneous interactions, and can understand intonation and emotions. They are however less reliable in following complex instructions – and thus are better suited for relatively simple and straightforward use cases.

AI Agents support the following speech-to-text models:

Provider

Models

OpenAI

  • gpt-realtime

  • gpt-realtime-mini

  • gpt-realtime-1.5

  • gpt-4o-realtime

  • gpt-4o-mini-realtime

Azure OpenAI

  • gpt-realtime

  • gpt-realtime-mini

  • gpt-realtime-1.5

  • gpt-4o-realtime

  • gpt-4o-mini-realtime

Amazon

  • nova-sonic

  • nova-2-sonic

Google

  • gemini-2.5-flash-native-audio

  • gemini-3.1-flash-live

xAI

  • grok-voice

You may use pre-deployed models or bring your own API keys.

Configuring speech-to-speech models

To configure agent to use a speech-to-speech model:

  1. Navigate to the Agents screen, locate your Agent, and click Edit.

  2. In General tab:

    1. From the 'Large language model' drop-down, choose the speech-to-speech model (marked with ).

    2. Set the number of 'Max output tokens' to 1,000 or larger.

  1. In Speech and Telephony tab, check the Enable voice streaming check box.

The above-described configuration makes the speech-to-speech model interact with user via voice modality. Input audio stream is directly sent to the model. And the model generates audio stream in response. Correspondingly, Speech-to-Text and Text-to-Speech services are irrelevant in this interaction mode.

If you clear the Enable voice streaming check box, the speech-to-speech model will be used via the text modality. In this interaction mode you will need to choose Speech-to-Text and Text-to-Speech services – same as for regular LLM models. Note that such a configuration is only supported for OpenAI and Azure OpenAI models, gpt-realtime, gpt-realtime-mini, gpt-realtime-1.5, and not for Google, Amazon, and xAI models that support only audio output modality.

Chats always use text modality, regardless of the Enable voice streaming check box state. For Google, Amazon, and xAI models, chat uses audio transcription, therefore responses are very slow.

Customizing the speech-to-speech model behavior

Use the following advanced configuration parameters to customize the speech-to-speech model behavior:

For example:

{
  "openai_realtime": {
    "voice": "coral"
  }
}

VAD mode for Gemini models

Gemini speech-to-speech models, gemini-2.5-flash-native-audio and gemini-3.1-flash-live, have a known issue in their voice activity detection (VAD) module that may cause models to “freeze” and not respond to certain user utterances. The problem happens sporadically and is more prominent for non-English languages.

You may mitigate the problem by disabling VAD module inside the model and using Silero VAD instead. In order to do that, add the following advanced configuration parameter to your agent:

{
    "gemini_audio": {
        "vad_mode": "silero"
    }
}

Parameter

Type

Description

silero_vad

SileroVad

Configuration of Silero VAD.

SileroVad

Parameter Type Description
positive_speech_threshold float

Threshold above which a frame is considered speech.

Default: 0.5

negative_speech_threshold float

Threshold below which a frame is considered non-speech.

Default: 0.35

redemption_ms int

Duration of consecutive non-speech before ending speech segment.

Default: 600

pre_speech_pad_ms int

Amount of pre-speech audio (msec) to prepend when speech starts.

Default: 400

min_speech_ms int

Minimum speech duration (msec) required before confirming speech start.

Default: 200

debug bool

Enable debug data collection for VAD threshold tuning

Default: false

Example
{
    "silero_vad": {
        "min_speech_ms": 300
    }
}

Debug data collection for VAD threshold tuning

silero_vad.debug advanced configuration parameter enables collection of per-frame VAD scores and speech boundary events during the call. Collected data can be used to analyze VAD behavior in the specific conversation and fine-tune configuration thresholds.

When the conversation ends, accumulated data is emitted as a silero_vad log line. Each entry covers 5 seconds of the call and contains:

For example,

[{"time": "14:05:00", "data": "0.03 0.05 0.12 0.78 0.91 S 0.88 ... 0.15 E 0.04"}]

For long calls, debug data is recorded for last 15 minutes of the call.

First entry is slightly longer than the rest, because it contains additional audio frames (roughly 500 msec) that are accumulated in some buffer during the conversation / VAD model initialization.

Keep in mind that while Silero VAD model works on 32 msec frames, silero_vad advanced config parameter specifies thresholds in milliseconds. So, for example, default min_speech_ms value of 200 msec, corresponds to 7 frames in the debug data.

Configuring input / output language

Speech-to-speech models lack explicit configuration for input / output language. Instead you should include relevant instructions in your agent’s prompt.

For example:

Always respond to user in German.

Feature parity

Agents that use speech-to-speech models benefit from most of the AI Agent platform features, including but not limited to:

The following limitations apply to the agents that use speech-to-speech models: