Real-time models

Real-time models are designed to process and respond to inputs with minimal latency, enabling near-instantaneous interactions. They support multi-model communication through both voice and text interfaces, providing minimal latency and creating seamless human-AI interactions.

AI Agents support the following real-time models:

You may use pre-deployed models or bring your own API keys from OpenAI or Microsoft Azure.

Configuring real-time models

To configure agent to use real-time model:

  1. Navigate to the Agents screen, locate your Agent, and click Edit.

  2. In General tab:

    1. From the 'Large language model' drop-down, choose the real-time model.

    2. Set the number of 'Max output tokens' to 1,000 or larger.

  1. In Speech and Telephony tab, check the Enable voice streaming check box.

The above-described configuration makes real-time model interact with user via voice modality. Input audio stream is directly sent to the model. And the model generates audio stream in response. Correspondingly Speech-to-Text and Text-to-Speech services are irrelevant in this interaction mode.

If you clear Enable voice streaming check box, real-time model will be used via the text modality. In this interaction mode you will need to choose Speech-to-Text and Text-to-Speech services – same as for regular LLM models.

Chats always use text modality, regardless of the Enable voice streaming check box state.

Customizing the real-time model behavior

Use the openai_realtime advanced configuration parameters to customize the real-time model behavior.

For example:

{
  "openai_realtime": {
    "voice": "coral"
  }
}

The following openai_realtime parameters are supported:

Parameter

Type

Description

voice

enum

Voice name.

The following voices are supported: alloy, ash, ballad, coral, echo, sage, shimmer, verse

input_audio_transcription_language

str

 

Language for input audio transcription model.

Use ISO 639-1 code. For example, "en" for English, "fr" for French.

Note that input audio transcription is performed by a separate transcription model and has no direct relation to what real-time model hears / responds to. It is primarily used for logging purposes. 

input_audio_transcription_prompt

str

Prompt for input audio transcription model.

input_audio_noise_reduction

enum

Configuration of input audio noise reduction.

  • near_field – for close-taking microphones, such as headphones.

  • far_field – for far-field microphones such as laptop or conference room microphones.

turn_detection

enum

Configuration of turn detection

  • server_vad – the model will detect the start and end of speech based on audio volume and respond at the end of user speech.

  • semantic_vad – the model will use turn detection model (in conjuction with VAD) to semantically estimate when the user has finished speaking.

eagerness

enum

Eagerness of the model to respond for semantic VAD.

Supported values: low, medium, high, auto.

prefix_padding_ms

int

Amount of audio to include before the VAD detected speech (in milliseconds) for server VAD.

silence_duration_ms

int

Duration of silence to wait before considering the speech finished (in milliseconds) for server VAD.

threshold

float

Activation threshold for server VAD. Valid range: (0.0 to 1.0).

Configuring input / output language

Real-time models lack explicit configuration for input / output language. Instead you should include relevant instructions in your agent’s prompt.

For example:

Always respond to user in German.

Feature Parity

Agents that use real-time models benefit from most of the AI Agent platform features, including but not limited to:

The following limitations apply to the agents that use real-time models: