Playing text to the user
A basic functionality of VoiceAI Connect is passing the bot's messages to the user as speech.
For bots using a textual channel, this is done by first performing text-to-speech of the bot's message using a third-party service, and then playing the synthesized speech to the user. By default all the text-to-speech responses are cached by VoiceAI Connect.
For bots using a voice channel, the speech is synthesized by the bot (or, usually, bot framework), and played to the user by VoiceAI Connect. In such cases, no caching is done by VoiceAI Connect.
The bot's messages are usually sent as a response to a user utterance (or another event that is sent to the bot). However, several bot frameworks allows sending asynchronous messages that are triggered by the bot's logic (e.g., a timer or a query that takes time). VoiceAI Connect supports both methods.
Depending on the text-to-speech provider, Speech Synthesis Markup Language (SSML) can be used in the bot textual messages to allow for more customization in the audio response, by providing details on pauses, and audio formatting for acronyms, dates, times, abbreviations, or prosody. VoiceAI Connect forwards to the text-to-speech service any SSML that is received from the bot.
How do I use it?
Each bot framework uses a different way for sending textual messages:
See Sending activities page for instructions on how to send activities using your bot framework.
Using the message
activity.
Example:
{ "type": "message", "text": "Hi." }
Using the message
activity.
Example:
{ "type": "message", "text": "Hi." }
Using an intent fulfillment with type "Agent says".
Using an intent response with type "Text Response".
Note: VoiceAI Connect only uses the responses of the DEFAULT platform.
Using the Amazon Lex V2 UI editor.
Text-to-speech caching
By default, VoiceAI Connect caches all text-to-speech responses. This cache prevent subsequent activations of the text-to-speech service in case the same text is used multiple times.
The cache size can be controlled by the administrator of the VoiceAI Connect installation.
To disable the cache for a specific response from the bot or for the whole duration of the call, the following bot configuration parameter can be used:
Parameter |
Type |
Description |
---|---|---|
Boolean |
Defines caching of text-to-speech (audio) results of bot responses.
Note: This parameter is not applicable when using a voice bot channel (i.e., speech-to-text is performed by the bot framework). |
The following example shows how your bot can disable caching of a specific response:
{ "type": "message", "text": "I have something sensitive to tell you.", "activityParams": { "disableTtsCache": true } }
{ "type": "message", "text": "I have something sensitive to tell you.", "channelData": { "activityParams": { "disableTtsCache": true } } }
Add a Custom Payload fulfillment to disable caching of a single agent response:
{ "activityParams": { "disableTtsCache": true } }
Add a Custom Payload response to disable caching of a single agent response:
{ "activityParams": { "disableTtsCache": true } }
Add a Custom payload in the message.
{ "activityParams":{ "disableTtsCache": true } }
Controlling internal audio buffer size
If there is a significant jitter in the network, increasing the buffer size between audio providers (text-to-speech or remote URL) and the SIP side can help mitigate the problem.
To increase the audio buffer size, the following bot configuration parameter can be used:
Parameter |
Type |
Description |
---|---|---|
Numeric |
Defines the maximum buffer size used between audio providers (text-to-speech or remote URL) and the SIP side. Range: 0-5000 millisecond of audio Default value: 0 (no increased buffer) This parameter is applicable only to VoiceAI Connect Enterprise Version 3.14 and later.
|
Handling text-to-speech playback error
You can define how VoiceAI Connect responds when it receives a text-to-speech playback error from the AudioCodes SBC. Only the admin can configure this parameter.
Parameter |
Type |
Description |
---|---|---|
Boolean |
Defines how VoiceAI Connect responds when it receives a text-to-speech playback error from the SBC.
This parameter is applicable only to VoiceAI Connect Enterprise Version 3.14 and later.
|
|
Number |
Defines how many times VoiceAI Connect tries to send the prompt when it receives a text-to-speech playback error from the SBC. The valid value is 0 to 5. The default is 2. A value of 0 means that VoiceAI Connect doesn't attempt to send the prompt again. This parameter is applicable only to VoiceAI Connect Enterprise Version 3.22 and later.
|
Using SSML
The bot can send Speech Synthesis Markup Language (SSML) XML elements within its textual response, in one of the following ways:
-
A full SSML document, for example:
<speak> This is <say-as interpret-as="characters">SSML</say-as>. </speak>
-
Text with SSML tags, for example:
This is <say-as interpret-as="characters">SSML</say-as>.
VoiceAI Connect adapts the received text to the way expected by the text-to-speech provider (e.g., adding the <speak>
element if needed).
The SSML is handled by the text-to-speech provider.
Refer to their documentation for a list of supported features:
Configuring playback settings
You can configure various audio settings for the text-to-speech playback to the user.
For Amazon Polly and Azure, VoiceAI Enterprise converts the text to SSML and adds the 'prosody' tag with the value of the below parameters.
If VoiceAI Connect receives an SSML message from the bot, VoiceAI Connect ignores the below parameters.
For an explanation on playback settings, refer to the documentation of the required text-to-speech provider:
The playback settings can be configured by the Administrator or the bot, using the following bot parameters:
Parameter |
Type |
Description |
---|---|---|
Number |
Defines the speaking rate (speed). The valid value is 0.5 (i.e., half the speed) to 2 (i.e., twice as fast). 1 is the normal (native) speed. |
|
Number |
Defines the speaking pitch. The valid value is -20 to 20. The default is 2. 20 means increase 20 semitones from the original pitch. -20 means decrease 20 semitones from the original pitch. Note:
|
|
Number |
Defines the speaking pitch by percentage. The valid value is 50 to 150. The conversion depends on the text-to-speech provider:
Note:
|
|
Number |
Defines the volume gain (in dB) of the normal native volume supported by the specific voice. The valid value range is -96.0 to 16. If unset, or set to a value of 0.0 (dB), it plays at normal native signal amplitude. A value of -6.0 (dB) plays at about half the amplitude of the normal native signal amplitude. A value of +6.0 (dB) plays at about twice the amplitude of the current volume. Note: This parameter is applicable only to the following text-to-speech providers: Amazon Polly and Google |
|
Number |
Defines the volume level of the speaking voice as a percentage. The valid value is 0 to 200. The volume is calculated as (100 - x), where x is the value of Note: This parameter is applicable only to the following text-to-speech providers: Azure |