Receiving user's speech

A basic functionality of VoiceAI Connect is passing the user's speech to the bot.

For bots using a textual channel, this is done by first performing speech-to-text of the user's speech using a third-party service, and then passing the recognized text to the bot. For bots using a voice channel, the audio stream of the user is passed to the bot.

The bot usually uses Natural Language Understanding (NLU) for analyzing the input from the user, and respond to it.

In addition to the text or audio that is passed to the bot, VoiceAI Connect passes parameters to the bot which may include information such as the confidence level of the speech-to-text recognition.


How do I use it?

VoiceAI Connect will send the user's speech to the bot by default.

Textual channels

For each speech-to-text recognition, a text-message event is sent to the bot. The exact event structure depends on the bot framework.

The following table lists the additional fields that can be included in the text-message event:

Field

Type

Description

confidence

Number

Numeric value representing the confidence level of the recognition.

recognitionOutput

Object

Raw recognition output of the speech-to-text engine (vendor specific).

recognitions

Array of Objects

If Continuous ASR mode is enabled, this array contains the separate recognition outputs.

participant

String

Indicates the participant identifier on which the speech recognition occurred.

Note: The parameter is applicable only to Agent assist calls.

participantUriUser

String

URI user-part of the participant.

Note: The parameter is applicable only to Agent assist calls.

words

Array of Objects

Displays the word level timing (milliseconds) info (duration of word and offset from start of phrase) in speech-to-text recognition. For example:

{
  "words": [
    {
      "word": "hello",
      "offset": 1500,
      "duration": 300
    },
    {
      "word": "world",
      "offset": 1900,
      "duration": 800
    }
  ]
}

Note:

  • This field is applicable only to Azure and Google speech-to-text providers, and Direct-bot.

  • To enable the inclusion of this field, use the wordLevelTimingInfo parameter.

  • In continuousASR mode, this field is nested inside the 'recognitions' section for each sentence separately.

VoiceAI Connect Enterprise supports this feature from Version 3.12 and later.

These additional fields are only sent for textual channels.

Text-message events are sent differently for each bot framework:

AudioCodes Bot API

The event is sent as a message activity, with the data fields inside the parameters property.

Example:

{
  "type": "message",
  "text": "Hi.",
  "parameters": {
    "confidence": 0.6599681
  }
}
Microsoft Bot Framework

The text-message is sent as a message activity. The additional fields are sent in the channelData property.

Example:

{
  "type": "message",
  "text": "Hi.",
  "channelData": {
    "confidence": 0.6599681
  }
}
Dialogflow CX

The text-message is sent as text input.

Note: Dialogflow supports a maximum text input length of 256 characters. Therefore, if the input received from the speech-to-text engine is longer than 256 characters, VoiceAI Connect truncates the message before sending it to Dialogflow.

Dialogflow ES

The text-message is sent as text input.

Note: Dialogflow supports a maximum text input length of 256 characters. Therefore, if the input received from the speech-to-text engine is longer than 256 characters, VoiceAI Connect truncates the message before sending it to Dialogflow.

Voice channels

If your bot uses a voice channel (e.g., using Dialogflow bots with Live Hub), the audio stream from the user is passed directly to the bot, without using an external speech-to-text service.

Usually, the speech-to-text of the user's audio is done by the bot framework, so the bot logic would still handle textual messages.

As there is no specific event that is sent for each user utterance for voice channels, no additional parameters are passed to the bot by VoiceAI Connect in such case.

Disabling speech input

There are cases when you want to prevent activation of the speech-to-text service.

For example, if the bot wants to send two messages to the user with some time gap between them, and therefore, doesn’t want to allow speech input after the first message playback.

Make sure not to disable the speech input for the whole duration of the call, as it will prevent receiving additional messages from the user.

To disable speech input, set the following parameter to false:

enableSpeechInput

Boolean

Enables the activation of the speech-to-text service after the bot message playback is finished (typically, this is the trigger for activating speech-to-text).

  • true (Default)

  • false

Enabling word level timing information

You can enable the received recognition from the speech-to-text provider to include the words field, which displays the duration of each word and its offset relative to the phrase in which spoken.

VoiceAI Connect Enterprise supports this feature for Azure and Google speech-to-text providers only, from Version 3.12 and later.

How to use it?

This feature is controlled by the VoiceAI Connect Administrator, or dynamically by the bot during the conversation (bot overrides VoiceAI Connect configuration):

Parameter

Type

Description

wordLevelTimingInfo

Boolean

Enables the inclusion of the words field in the received recognition.

  • true

  • false (Default)