Receiving speech-to-text hypothesis notification

VoiceAI Connect can be enabled by the Administrator (or bot) to send the bot all speech-to-text information (hypothesis) received from the speech-to-text provider. The hypothesis includes the text recognition as well all other related speech-to-text information such as time offset and confidence level (depending on speech-to-text provider).

VoiceAI Connect Enterprise supports this feature from Version 2.4 and later.

How do I use it?

Bot configuration

Once this feature is enabled, VoiceAI Connect sends the speechHypothesis event message to the bot framework, for each hypothesis.

The following bot parameter should be configured to enable the feature:

See Controlling events sent to bot for more details regarding the parameter.

Parameter	Type	Description
`sendEventsToBot`	Array of strings	(Optional) If the value "speechHypothesis" is included in the array, VoiceAI Connect sends the speechHypothesis event containing the speech-to-text hypothesis to the bot framework.

Event properties

The following table lists the additional fields that are included in the speechHypothesis event:

Field	Type	Description
text	String	The hypothesized text.
stability	Number	Indicates the volatility of results obtained so far, with 0.0 indicating complete instability while 1.0 indicates complete stability.
confidence	Number	Numeric value representing the confidence level of the recognition.
recognitionOutput	Object	Raw recognition output of the speech-to-text engine (vendor specific). Note: From VoiceAI Connect Enterprise Version 3.22, this field also includes `provider` (`name` and `type`).
`audioFilename`	String	URL to recorded audio file of the current speech-to-text transcript that is sent to the bot. This is included only if you configure VoiceAI Connect to save the speech-to-text audio file, as described in Storing call transcripts and audio recordings. This field is applicable from VoiceAI Connect Enterprise Version 3.22 and later.
recognitions	Array of Objects	If Continuous automatic speech recognition (ASR) mode is enabled, this array contains the separate recognition outputs.

The properties depends on the speech-to-text provider.

Event format

The event format and syntax depends on the bot framework:

AudioCodes Bot API

The message is sent as an event activity.

The additional fields are sent as the value of the event.

Below is an example of the output when Azure speech-to-text service was used:

{
  "type": "event",
  "name": "speechHypothesis",
  "value": {
    "text": "how are",
    "recognitionOutput": {
      "id": "6f3af32500f74419990bfe2cfa593e85",
      "duration": 2900000,
      "offset": 9200000,
      "text": "how are",
      "provider": {
        "name": "my_azure",
        "type": "azure"
      },
      "audioFilename": "SpeechMockBot/2024-07/2024-07-23/2024-07-23T09-45-07_53f64de7-c2c8-426d-b8f3-0bfc4cd9a3ad/0001_2024-07-23T09-45-11.532Z_in.wav"
    }
  }
}

Microsoft Bot Framework

The message is sent as an event activity.

The additional fields are sent as the value of the event.

Below is an example of the output when Azure speech-to-text service was used:

{
  "type": "event",
  "name": "speechHypothesis",
  "value": {
    "text": "how are",
    "recognitionOutput": {
      "id": "6f3af32500f74419990bfe2cfa593e85",
      "duration": 2900000,
      "offset": 9200000,
      "text": "how are",
      "provider": {
        "name": "my_azure",
        "type": "azure"
      },
      "audioFilename": "SpeechMockBot/2024-07/2024-07-23/2024-07-23T09-45-07_53f64de7-c2c8-426d-b8f3-0bfc4cd9a3ad/0001_2024-07-23T09-45-11.532Z_in.wav"
    }
  }
}

Microsoft Copilot Studio

The message is sent as an event activity.

The additional fields are sent as the value of the event.

Example:

{
  "type": "event",
  "name": "speechHypothesis",
  "value": {
    "text": "how are",
    "recognitionOutput": {
      "id": "6f3af32500f74419990bfe2cfa593e85",
      "duration": 2900000,
      "offset": 9200000,
      "text": "how are"
    }
  }
}

Microsoft Copilot Studio legacy

The message is sent as an event activity.

The additional fields are sent as the value of the event.

Example:

{
  "type": "event",
  "name": "speechHypothesis",
  "value": {
    "text": "how are",
    "recognitionOutput": {
      "id": "6f3af32500f74419990bfe2cfa593e85",
      "duration": 2900000,
      "offset": 9200000,
      "text": "how are"
    }
  }
}

Dialogflow CX

The notification is sent as a speechHypothesis event, with the fields as the event parameters.

Below is an example when Google Dialogflow voice channel was used:

{
  "queryInput": {
    "event": {
      "languageCode": "en-US",
      "name": "speechHypothesis",
      "parameters": {
        "text": "how are",
        "stability": 009999999776482582,
        "recognitionOutput": {
          "speechWordInfo": [],
          "messageType": "TRANSCRIPT”",
          "transcript": "how are you",
          "IsFinal": false,
          "confidence": 0,
          "dtmfDigits": 0,
          "stability": 0.009999999776482582,
          "speechEndOffset": {
            "seconds": 3,
            "nanos": 680000000
          }
        }
      }
    }
  }
}

Below is an example when Google speech-to-text service was used:

{
  "queryInput": {
    "event": {
      "languageCode": "en-US",
      "name": "speechHypothesis",
      "parameters": {
        "text": "ohh you",
        "stability": 0.009999999776482582,
        "recognitionOutput": {
          "results": [
            {
              "alternatives": [
                {
                  "words": [],
                  "transcript": "ohh you",
                  "confidence": 0
                }
              ],
              "isFinal": false,
              "stability": 0.009999999776482582,
              "resultEndTime": {
                "seconds": "2",
                "nanos": 670000000
              },
              "channelTag": 0,
              "languageCode": "en-us"
            }
          ]
        }
      }
    }
  }
}

For Dialogflow CX, the fields are also sent inside the event-speechHypothesis session parameter, and can be accessed using a syntax such as this:

$session.params.event-speechHypothesis.text

Dialogflow ES

The notification is sent as a speechHypothesis event, with the fields as the event parameters.

Below is an example when Google Dialogflow voice channel was used:

{
  "queryInput": {
    "event": {
      "languageCode": "en-US",
      "name": "speechHypothesis",
      "parameters": {
        "text": "how are",
        "stability": 009999999776482582,
        "recognitionOutput": {
          "speechWordInfo": [],
          "messageType": "TRANSCRIPT”",
          "transcript": "how are you",
          "IsFinal": false,
          "confidence": 0,
          "dtmfDigits": 0,
          "stability": 0.009999999776482582,
          "speechEndOffset": {
            "seconds": 3,
            "nanos": 680000000
          }
        }
      }
    }
  }
}

Below is an example when Google speech-to-text service was used:

{
  "queryInput": {
    "event": {
      "languageCode": "en-US",
      "name": "speechHypothesis",
      "parameters": {
        "text": "ohh you",
        "stability": 0.009999999776482582,
        "recognitionOutput": {
          "results": [
            {
              "alternatives": [
                {
                  "words": [],
                  "transcript": "ohh you",
                  "confidence": 0
                }
              ],
              "isFinal": false,
              "stability": 0.009999999776482582,
              "resultEndTime": {
                "seconds": "2",
                "nanos": 670000000
              },
              "channelTag": 0,
              "languageCode": "en-us"
            }
          ]
        }
      }
    }
  }
}