Receiving user's speech
A basic functionality of VoiceAI Connect is passing the user's speech to the bot.
For bots using a textual channel, this is done by first performing speech-to-text of the user's speech using a third-party service, and then passing the recognized text to the bot. For bots using a voice channel, the audio stream of the user is passed to the bot.
The bot usually uses Natural Language Understanding (NLU) for analyzing the input from the user, and respond to it.
In addition to the text or audio that is passed to the bot, VoiceAI Connect passes parameters to the bot which may include information such as the confidence level of the speech-to-text recognition.
How do I use it?
VoiceAI Connect will send the user's speech to the bot by default.
Textual channels
For each speech-to-text recognition, a text-message event is sent to the bot. The exact event structure depends on the bot framework.
The following table lists the additional fields that can be included in the text-message event:
Field |
Type |
Description |
---|---|---|
|
Number |
Numeric value representing the confidence level of the recognition. |
|
Object |
Raw recognition output of the speech-to-text engine (vendor specific). |
|
Array of Objects |
If Continuous ASR mode is enabled, this array contains the separate recognition outputs. |
|
String |
Indicates the participant identifier on which the speech recognition occurred. Note: The parameter is applicable only to Agent assist calls. |
|
String |
URI user-part of the participant. Note: The parameter is applicable only to Agent assist calls. |
|
Array of Objects |
Displays the word level timing (milliseconds) info (duration of word and offset from start of phrase) in speech-to-text recognition. For example: { "words": [ { "word": "hello", "offset": 1500, "duration": 300 }, { "word": "world", "offset": 1900, "duration": 800 } ] } Note:
VoiceAI Connect Enterprise supports this feature from Version 3.12 and later. |
Text-message events are sent differently for each bot framework:
The event is sent as a message activity, with the data fields inside the parameters property.
Example:
{ "type": "message", "text": "Hi.", "parameters": { "confidence": 0.6599681 } }
The text-message is sent as a message
activity. The additional
fields are sent in the channelData
property.
Example:
{ "type": "message", "text": "Hi.", "channelData": { "confidence": 0.6599681 } }
The text-message is sent as text input.
Note: Dialogflow supports a maximum text input length of 256 characters. Therefore, if the input received from the speech-to-text engine is longer than 256 characters, VoiceAI Connect truncates the message before sending it to Dialogflow.
The text-message is sent as text input.
Note: Dialogflow supports a maximum text input length of 256 characters. Therefore, if the input received from the speech-to-text engine is longer than 256 characters, VoiceAI Connect truncates the message before sending it to Dialogflow.
Voice channels
If your bot uses a voice channel (e.g., using Dialogflow bots with Live Hub), the audio stream from the user is passed directly to the bot, without using an external speech-to-text service.
Usually, the speech-to-text of the user's audio is done by the bot framework, so the bot logic would still handle textual messages.
As there is no specific event that is sent for each user utterance for voice channels, no additional parameters are passed to the bot by VoiceAI Connect in such case.
Disabling speech input
There are cases when you want to prevent activation of the speech-to-text service.
For example, if the bot wants to send two messages to the user with some time gap between them, and therefore, doesn’t want to allow speech input after the first message playback.
To disable speech input, set the following parameter to false:
Boolean |
Enables the activation of the speech-to-text service after the bot message playback is finished (typically, this is the trigger for activating speech-to-text).
|
Enabling word level timing information
You can enable the received recognition from the speech-to-text provider to include the words
field, which displays the duration of each word and its offset relative to the phrase in which spoken.
VoiceAI Connect Enterprise supports this feature for Azure and Google speech-to-text providers only, from Version 3.12 and later.
How to use it?
This feature is controlled by the VoiceAI Connect Administrator, or dynamically by the bot during the conversation (bot overrides VoiceAI Connect configuration):
Parameter |
Type |
Description |
---|---|---|
Boolean |
Enables the inclusion of the
|