Receiving speech-to-text hypothesis notification
VoiceAI Connect can be enabled by the Administrator (or bot) to send the bot all speech-to-text information (hypothesis) received from the speech-to-text provider. The hypothesis includes the text recognition as well all other related speech-to-text information such as time offset and confidence level (depending on speech-to-text provider).
How do I use it?
Bot configuration
Once this feature is enabled, VoiceAI Connect sends the speechHypothesis
event message to the bot framework, for each hypothesis.
The following bot parameter should be configured to enable the feature:
See Controlling events sent to bot for more details regarding the parameter.
Event properties
The following table lists the additional fields that are included in the speechHypothesis
event:
Field |
Type |
Description |
---|---|---|
text |
String |
The hypothesized text. |
stability |
Number |
Indicates the volatility of results obtained so far, with 0.0 indicating complete instability while 1.0 indicates complete stability. |
confidence |
Number |
Numeric value representing the confidence level of the recognition. |
recognitionOutput |
Object |
Raw recognition output of the speech-to-text engine (vendor specific). Note: From VoiceAI Connect Enterprise Version 3.22, this field also includes |
|
String |
URL to recorded audio file of the current speech-to-text transcript that is sent to the bot. This is included only if you configure VoiceAI Connect to save the speech-to-text audio file, as described in Storing call transcripts and audio recordings. This field is applicable from VoiceAI Connect Enterprise Version 3.22 and later. |
recognitions |
Array of Objects |
If Continuous automatic speech recognition (ASR) mode is enabled, this array contains the separate recognition outputs. |
The properties depends on the speech-to-text provider.
Event format
The event format and syntax depends on the bot framework:
The message is sent as an event activity.
The additional fields are sent as the value of the event.
Below is an example of the output when Azure speech-to-text service was used:
{ "type": "event", "name": "speechHypothesis", "value": { "text": "how are", "recognitionOutput": { "id": "6f3af32500f74419990bfe2cfa593e85", "duration": 2900000, "offset": 9200000, "text": "how are", "provider": { "name": "my_azure", "type": "azure" }, "audioFilename": "SpeechMockBot/2024-07/2024-07-23/2024-07-23T09-45-07_53f64de7-c2c8-426d-b8f3-0bfc4cd9a3ad/0001_2024-07-23T09-45-11.532Z_in.wav" } } }
The message is sent as an event activity.
The additional fields are sent as the value of the event.
Below is an example of the output when Azure speech-to-text service was used:
{ "type": "event", "name": "speechHypothesis", "value": { "text": "how are", "recognitionOutput": { "id": "6f3af32500f74419990bfe2cfa593e85", "duration": 2900000, "offset": 9200000, "text": "how are", "provider": { "name": "my_azure", "type": "azure" }, "audioFilename": "SpeechMockBot/2024-07/2024-07-23/2024-07-23T09-45-07_53f64de7-c2c8-426d-b8f3-0bfc4cd9a3ad/0001_2024-07-23T09-45-11.532Z_in.wav" } } }
The message is sent as an event activity.
The additional fields are sent as the value of the event.
Example:
{ "type": "event", "name": "speechHypothesis", "value": { "text": "how are", "recognitionOutput": { "id": "6f3af32500f74419990bfe2cfa593e85", "duration": 2900000, "offset": 9200000, "text": "how are" } } }
The message is sent as an event activity.
The additional fields are sent as the value of the event.
Example:
{ "type": "event", "name": "speechHypothesis", "value": { "text": "how are", "recognitionOutput": { "id": "6f3af32500f74419990bfe2cfa593e85", "duration": 2900000, "offset": 9200000, "text": "how are" } } }
The notification is sent as a speechHypothesis
event, with the fields as the event parameters.
Below is an example when Google Dialogflow voice channel was used:
{ "queryInput": { "event": { "languageCode": "en-US", "name": "speechHypothesis", "parameters": { "text": "how are", "stability": 009999999776482582, "recognitionOutput": { "speechWordInfo": [], "messageType": "TRANSCRIPT”", "transcript": "how are you", "IsFinal": false, "confidence": 0, "dtmfDigits": 0, "stability": 0.009999999776482582, "speechEndOffset": { "seconds": 3, "nanos": 680000000 } } } } } }
Below is an example when Google speech-to-text service was used:
{ "queryInput": { "event": { "languageCode": "en-US", "name": "speechHypothesis", "parameters": { "text": "ohh you", "stability": 0.009999999776482582, "recognitionOutput": { "results": [ { "alternatives": [ { "words": [], "transcript": "ohh you", "confidence": 0 } ], "isFinal": false, "stability": 0.009999999776482582, "resultEndTime": { "seconds": "2", "nanos": 670000000 }, "channelTag": 0, "languageCode": "en-us" } ] } } } } }
For Dialogflow CX, the fields are also sent inside the event-speechHypothesis
session parameter, and can be accessed using a syntax such as this:
$session.params.event-speechHypothesis.text
The notification is sent as a speechHypothesis
event, with the fields as the event parameters.
Below is an example when Google Dialogflow voice channel was used:
{ "queryInput": { "event": { "languageCode": "en-US", "name": "speechHypothesis", "parameters": { "text": "how are", "stability": 009999999776482582, "recognitionOutput": { "speechWordInfo": [], "messageType": "TRANSCRIPT”", "transcript": "how are you", "IsFinal": false, "confidence": 0, "dtmfDigits": 0, "stability": 0.009999999776482582, "speechEndOffset": { "seconds": 3, "nanos": 680000000 } } } } } }
Below is an example when Google speech-to-text service was used:
{ "queryInput": { "event": { "languageCode": "en-US", "name": "speechHypothesis", "parameters": { "text": "ohh you", "stability": 0.009999999776482582, "recognitionOutput": { "results": [ { "alternatives": [ { "words": [], "transcript": "ohh you", "confidence": 0 } ], "isFinal": false, "stability": 0.009999999776482582, "resultEndTime": { "seconds": "2", "nanos": 670000000 }, "channelTag": 0, "languageCode": "en-us" } ] } } } } }