Receiving speech-to-text hypothesis notification

VoiceAI Connect can be enabled by the Administrator (or bot) to send the bot all speech-to-text information (hypothesis) received from the speech-to-text provider. The hypothesis includes the text recognition as well all other related speech-to-text information such as time offset and confidence level (depending on speech-to-text provider).

VoiceAI Connect Enterprise supports this feature from Version 2.4 and later.

How do I use it?

Bot configuration

Once this feature is enabled, VoiceAI Connect sends the speechHypothesis event message to the bot framework, for each hypothesis.

The following bot parameter should be configured to enable the feature:

See Controlling events sent to bot for more details regarding the parameter.

Parameter

Type

Description

sendEventsToBot Array of strings

(Optional) If the value "speechHypothesis" is included in the array, VoiceAI Connect sends the speechHypothesis event containing the speech-to-text hypothesis to the bot framework.

Event properties

The following table lists the additional fields that are included in the speechHypothesis event:

Field

Type

Description

text

String

The hypothesized text.

stability

Number

Indicates the volatility of results obtained so far, with 0.0 indicating complete instability while 1.0 indicates complete stability.

confidence

Number

Numeric value representing the confidence level of the recognition.

recognitionOutput

Object

Raw recognition output of the speech-to-text engine (vendor specific).

Note: From VoiceAI Connect Enterprise Version 3.22, this field also includes provider (name and type).

audioFilename

String

URL to recorded audio file of the current speech-to-text transcript that is sent to the bot. This is included only if you configure VoiceAI Connect to save the speech-to-text audio file, as described in Storing call transcripts and audio recordings.

This field is applicable from VoiceAI Connect Enterprise Version 3.22 and later.

recognitions

Array of Objects

If Continuous automatic speech recognition (ASR) mode is enabled, this array contains the separate recognition outputs.

The properties depends on the speech-to-text provider.

Event format

The event format and syntax depends on the bot framework:

AudioCodes Bot API

The message is sent as an event activity.

The additional fields are sent as the value of the event.

Below is an example of the output when Azure speech-to-text service was used:

{
  "type": "event",
  "name": "speechHypothesis",
  "value": {
    "text": "how are",
    "recognitionOutput": {
      "id": "6f3af32500f74419990bfe2cfa593e85",
      "duration": 2900000,
      "offset": 9200000,
      "text": "how are",
      "provider": {
        "name": "my_azure",
        "type": "azure"
      },
      "audioFilename": "SpeechMockBot/2024-07/2024-07-23/2024-07-23T09-45-07_53f64de7-c2c8-426d-b8f3-0bfc4cd9a3ad/0001_2024-07-23T09-45-11.532Z_in.wav"
    }
  }
}
Microsoft Bot Framework

The message is sent as an event activity.

The additional fields are sent as the value of the event.

Below is an example of the output when Azure speech-to-text service was used:

{
  "type": "event",
  "name": "speechHypothesis",
  "value": {
    "text": "how are",
    "recognitionOutput": {
      "id": "6f3af32500f74419990bfe2cfa593e85",
      "duration": 2900000,
      "offset": 9200000,
      "text": "how are",
      "provider": {
        "name": "my_azure",
        "type": "azure"
      },
      "audioFilename": "SpeechMockBot/2024-07/2024-07-23/2024-07-23T09-45-07_53f64de7-c2c8-426d-b8f3-0bfc4cd9a3ad/0001_2024-07-23T09-45-11.532Z_in.wav"
    }
  }
}
Microsoft Copilot Studio

The message is sent as an event activity.

The additional fields are sent as the value of the event.

Example:

{
  "type": "event",
  "name": "speechHypothesis",
  "value": {
    "text": "how are",
    "recognitionOutput": {
      "id": "6f3af32500f74419990bfe2cfa593e85",
      "duration": 2900000,
      "offset": 9200000,
      "text": "how are"
    }
  }
}
Microsoft Copilot Studio legacy

The message is sent as an event activity.

The additional fields are sent as the value of the event.

Example:

{
  "type": "event",
  "name": "speechHypothesis",
  "value": {
    "text": "how are",
    "recognitionOutput": {
      "id": "6f3af32500f74419990bfe2cfa593e85",
      "duration": 2900000,
      "offset": 9200000,
      "text": "how are"
    }
  }
}
Dialogflow CX

The notification is sent as a speechHypothesis event, with the fields as the event parameters.

Below is an example when Google Dialogflow voice channel was used:

{
  "queryInput": {
    "event": {
      "languageCode": "en-US",
      "name": "speechHypothesis",
      "parameters": {
        "text": "how are",
        "stability": 009999999776482582,
        "recognitionOutput": {
          "speechWordInfo": [],
          "messageType": "TRANSCRIPT”",
          "transcript": "how are you",
          "IsFinal": false,
          "confidence": 0,
          "dtmfDigits": 0,
          "stability": 0.009999999776482582,
          "speechEndOffset": {
            "seconds": 3,
            "nanos": 680000000
          }
        }
      }
    }
  }
}

Below is an example when Google speech-to-text service was used:

{
  "queryInput": {
    "event": {
      "languageCode": "en-US",
      "name": "speechHypothesis",
      "parameters": {
        "text": "ohh you",
        "stability": 0.009999999776482582,
        "recognitionOutput": {
          "results": [
            {
              "alternatives": [
                {
                  "words": [],
                  "transcript": "ohh you",
                  "confidence": 0
                }
              ],
              "isFinal": false,
              "stability": 0.009999999776482582,
              "resultEndTime": {
                "seconds": "2",
                "nanos": 670000000
              },
              "channelTag": 0,
              "languageCode": "en-us"
            }
          ]
        }
      }
    }
  }
}

For Dialogflow CX, the fields are also sent inside the event-speechHypothesis session parameter, and can be accessed using a syntax such as this:

$session.params.event-speechHypothesis.text
Dialogflow ES

The notification is sent as a speechHypothesis event, with the fields as the event parameters.

Below is an example when Google Dialogflow voice channel was used:

{
  "queryInput": {
    "event": {
      "languageCode": "en-US",
      "name": "speechHypothesis",
      "parameters": {
        "text": "how are",
        "stability": 009999999776482582,
        "recognitionOutput": {
          "speechWordInfo": [],
          "messageType": "TRANSCRIPT”",
          "transcript": "how are you",
          "IsFinal": false,
          "confidence": 0,
          "dtmfDigits": 0,
          "stability": 0.009999999776482582,
          "speechEndOffset": {
            "seconds": 3,
            "nanos": 680000000
          }
        }
      }
    }
  }
}

Below is an example when Google speech-to-text service was used:

{
  "queryInput": {
    "event": {
      "languageCode": "en-US",
      "name": "speechHypothesis",
      "parameters": {
        "text": "ohh you",
        "stability": 0.009999999776482582,
        "recognitionOutput": {
          "results": [
            {
              "alternatives": [
                {
                  "words": [],
                  "transcript": "ohh you",
                  "confidence": 0
                }
              ],
              "isFinal": false,
              "stability": 0.009999999776482582,
              "resultEndTime": {
                "seconds": "2",
                "nanos": 670000000
              },
              "channelTag": 0,
              "languageCode": "en-us"
            }
          ]
        }
      }
    }
  }
}