Playing text to the user

A basic functionality of VoiceAI Connect is passing the bot's messages to the user as speech.

For bots using a textual channel, this is done by first performing text-to-speech of the bot's message using a third-party service, and then playing the synthesized speech to the user. By default all the text-to-speech responses are cached by VoiceAI Connect.

For bots using a voice channel, the speech is synthesized by the bot (or, usually, bot framework), and played to the user by VoiceAI Connect. In such cases, no caching is done by VoiceAI Connect.

The bot's messages are usually sent as a response to a user utterance (or another event that is sent to the bot). However, several bot frameworks allows sending asynchronous messages that are triggered by the bot's logic (e.g., a timer or a query that takes time). VoiceAI Connect supports both methods.

Depending on the text-to-speech provider, Speech Synthesis Markup Language (SSML) can be used in the bot textual messages to allow for more customization in the audio response, by providing details on pauses, and audio formatting for acronyms, dates, times, abbreviations, or prosody. VoiceAI Connect forwards to the text-to-speech service any SSML that is received from the bot.


How do I use it?

Each bot framework uses a different way for sending textual messages:

See Sending activities page for instructions on how to send activities using your bot framework.

AudioCodes Bot API

Using the message activity.

Example:

{
  "type": "message",
  "text": "Hi."
}
Microsoft Bot Framework

Using the message activity.

Example:

{
  "type": "message",
  "text": "Hi."
}
Dialogflow CX

Using an intent fulfillment with type "Agent says".

Dialogflow ES

Using an intent response with type "Text Response".

Note: VoiceAI Connect only uses the responses of the DEFAULT platform.

Amazon Lex V2

Using the Amazon Lex V2 UI editor.

Text-to-speech caching

By default, VoiceAI Connect caches all text-to-speech responses. This cache prevent subsequent activations of the text-to-speech service in case the same text is used multiple times.

The cache size can be controlled by the administrator of the VoiceAI Connect installation.

To disable the cache for a specific response from the bot or for the whole duration of the call, the following bot configuration parameter can be used:

Parameter

Type

Description

disableTtsCache

Boolean

Defines caching of text-to-speech (audio) results of bot responses.

  • true: Text-to-speech caching is disabled.

  • false: (Default) Text-to-speech caching is enabled.

Note: This parameter is not applicable when using a voice bot channel (i.e., speech-to-text is performed by the bot framework).

The following example shows how your bot can disable caching of a specific response:

AudioCodes Bot API
{
  "type": "message",
  "text": "I have something sensitive to tell you.",
  "activityParams": {
    "disableTtsCache": true
  }
}
Microsoft Bot Framework
{
  "type": "message",
  "text": "I have something sensitive to tell you.",
  "channelData": {
    "activityParams": {
      "disableTtsCache": true
    }
  }
}
Dialogflow CX

Add a Custom Payload fulfillment to disable caching of a single agent response:

{
  "activityParams": {
    "disableTtsCache": true
  }
}
Dialogflow ES

Add a Custom Payload response to disable caching of a single agent response:

{
  "activityParams": {
    "disableTtsCache": true
  }
}
Amazon Lex V2

Add a Custom payload in the message.

{
  "activityParams":{
    "disableTtsCache": true
  }
}

Controlling internal audio buffer size

If there is a significant jitter in the network, increasing the buffer size between audio providers (text-to-speech or remote URL) and the SIP side can help mitigate the problem.

To increase the audio buffer size, the following bot configuration parameter can be used:

Parameter

Type

Description

playMaxBufferTimeMS

Numeric

Defines the maximum buffer size used between audio providers (text-to-speech or remote URL) and the SIP side.

Range: 0-5000 millisecond of audio

Default value: 0 (no increased buffer)

This parameter is applicable only to VoiceAI Connect Enterprise Version 3.14 and later.

Handling text-to-speech playback error

You can define how VoiceAI Connect responds when it receives a text-to-speech playback error from the AudioCodes SBC. Only the admin can configure this parameter.

Parameter

Type

Description

ignorePlaybackError

Boolean

Defines how VoiceAI Connect responds when it receives a text-to-speech playback error from the SBC.

  • true: VoiceAI Connect ignores playback errors received from SBC.

  • false: (Default) VoiceAI Connect ends the call and sends the failure code SBCPlaybackError.

This parameter is applicable only to VoiceAI Connect Enterprise Version 3.14 and later.

retriesOnPlaybackError

Number

Defines how many times VoiceAI Connect tries to send the prompt when it receives a text-to-speech playback error from the SBC.

The valid value is 0 to 5. The default is 2. A value of 0 means that VoiceAI Connect doesn't attempt to send the prompt again.

This parameter is applicable only to VoiceAI Connect Enterprise Version 3.22 and later.

Using SSML

The bot can send Speech Synthesis Markup Language (SSML) XML elements within its textual response, in one of the following ways:

VoiceAI Connect adapts the received text to the way expected by the text-to-speech provider (e.g., adding the <speak> element if needed).

The SSML is handled by the text-to-speech provider.

Refer to their documentation for a list of supported features:

When using SSML, all invalid XML characters, like the ampersand (&), must be properly escaped.

Configuring playback settings

You can configure various audio settings for the text-to-speech playback to the user.

This section is applicable only to VoiceAI Connect Enterprise Version 3.22 and later.

For Amazon Polly and Azure, VoiceAI Enterprise converts the text to SSML and adds the 'prosody' tag with the value of the below parameters.

If VoiceAI Connect receives an SSML message from the bot, VoiceAI Connect ignores the below parameters.

For an explanation on playback settings, refer to the documentation of the required text-to-speech provider:

The playback settings can be configured by the Administrator or the bot, using the following bot parameters:

Parameter

Type

Description

speakingRateMultiplier

Number

Defines the speaking rate (speed).

The valid value is 0.5 (i.e., half the speed) to 2 (i.e., twice as fast). 1 is the normal (native) speed.

pitchSemitone

Number

Defines the speaking pitch.

The valid value is -20 to 20. The default is 2. 20 means increase 20 semitones from the original pitch. -20 means decrease 20 semitones from the original pitch.

Note:

  • This parameter is applicable only to the following text-to-speech providers: Azure and Google

  • For Azure, don't configure both the pitchSemitone and pitchPercentage parameters; configure only one of them.

pitchPercentage

Number

Defines the speaking pitch by percentage.

The valid value is 50 to 150.

The conversion depends on the text-to-speech provider:

  • Amazon Polly: The pitch is calculated as (50 - x), where x is the value of pitchPercentage. For example, if the parameter is configured to 120, the pitch is reduced by 70% (i.e., 50 - 120 = -70).

  • Azure: The pitch is calculated as (100 - x), where x is the value of pitchPercentage. For example, if the parameter is configured to 50, the pitch is increased by 50% (i.e., 100 - 50 = 50).

Note:

  • This parameter is applicable only to the following text-to-speech providers: Amazon Polly and Azure.

  • For Azure, don't configure both the pitchSemitone and pitchPercentage parameters; configure only one of them.

volumeDb

Number

Defines the volume gain (in dB) of the normal native volume supported by the specific voice.

The valid value range is -96.0 to 16. If unset, or set to a value of 0.0 (dB), it plays at normal native signal amplitude. A value of -6.0 (dB) plays at about half the amplitude of the normal native signal amplitude. A value of +6.0 (dB) plays at about twice the amplitude of the current volume.

Note: This parameter is applicable only to the following text-to-speech providers: Amazon Polly and Google

volumePercentage

Number

Defines the volume level of the speaking voice as a percentage.

The valid value is 0 to 200.

The volume is calculated as (100 - x), where x is the value of volumePercentage. For example, if the parameter is configured to 50, the pitch is increased by 50% (i.e., 100 - 50 = 50).

Note: This parameter is applicable only to the following text-to-speech providers: Azure