Speech-to-Text API
The API is based on WebSocket. The client (VoiceAI Connect) opens a WebSocket connection to a pre-defined URL, per conversation. The connection remains open for the entire duration of the conversation. The same connection can be used for several sequential (not concurrent) recognition-sessions on which the speech-to-text engine performs recognition of an audio stream. Note that the connection might remain open during periods of no active recognition-session.
Control messages are sent as textual JSON messages. Media is sent as binary frames (per Section 5.6 of RFC 6455). As there is no “session” information on the binary frames, a single WebSocket connection can only be used for one concurrent recognition-session.
The same WebSocket can be used to start additional recognition sessions.
In case of an error that prevents the server from further handling messages in the WebSocket connection, the server must close the connection.
Control Messages Sent by Client (VoiceAI Connect)
start
Starts a recognition-session
Parameters:
Parameter |
Type |
Description |
---|---|---|
|
String |
Defines the BCP-47 language code for speech recognition of the supplied audio. |
|
String |
Defines the conversation Session ID. VoiceAI Connect Enterprise supports this parameter from Version 3.20 and later. |
|
String |
Defines the format of the audio file (as configured by the
|
|
String |
Defines the manner in which the audio is stored and transmitted. Currently, only 16-bit linear pulse-code modulation (PCM) encoding ( |
|
Number |
Defines the sample rate (in Hertz) of the supplied audio. Currently, only 16,000 Hz is supported. |
|
String |
This field is sent with the value of the |
|
Array of Objects |
This field is sent with the value of the |
|
String |
This field is sent with the value of the |
Example:
{ "type": "start", "language": "en-US", "conversationId": "8745555-8f1a-48ba-9ec9-46e90dc5aa18", "format": "raw", "encoding": "LINEAR16", "sampleRateHz": 16000 }
stop
Stops the current recognition session.
"stop" will be sent for a “started” recognition-session.
Example:
{ "type": "stop" }
Configuration
Define the audio file format (WAV or RAW) using the sttPreferWave
parameter (see Speech customization).
Control Messages Sent by Server (Speech-to-Text Provider)
started
This message is sent to indicate that the recognition-session has started and that the stream (binary messages) can be sent by the client.
Example:
{ "type": "started" }
hypothesis
This is sent for partial recognition.
Example:
{ "type": "hypothesis", "alternatives": [ { "text": "Hi" } ] }
recognition
This message is sent for each utterance that is recognized by the speech-to-text provider.
Several "recognition" messages can be sent for a single recognition-session.
Example:
{ "type": "recognition", "alternatives": [ { "text": "Hi there", "confidence": 0.8355 } ] }
end
This message is sent to indicate that the current recognition-session has ended.
In case only a single utterance is recognized per recognition-session, an "end" message must be sent immediately after the "recognition" message.
The "end" message must be sent after the server has received a "stop" message, to indicate that the recognition-session has ended.
Example:
{ "type": "end", "reason": "some reason" }
error
This message indicates that the current recognition-session has ended with a failure condition.
Example:
{ "type": "error", "reason": "some error" }
Binary Messages Sent by Client
The client sends the audio stream as WebSocket binary messages, according to the encoding and sample-rate indicated in the start
message.
Authentication
An Authorization header is sent by the client on the HTTP request that creates the WebSocket connection, containing a shared token. The token can be used by the server to identify the client. For example:
Authorization: Bearer <token>
Example Flow
Direction |
Message |
---|---|
Client > Server |
{ "type": "start", "language": "en-US", "conversationId": "8745555-8f1a-48ba-9ec9-46e90dc5aa18", "format": "raw", "encoding": "LINEAR16", "sampleRateHz": 16000 } |
Server > Client |
{ "type": "started" } |
Client > Server |
<binary messages> |
Server > Client |
{ "type": "hypothesis", "alternatives": [ { "text": "Hi" } ] } |
Server > Client |
{ "type": "recognition", "alternatives": [ { "text": "Hi there.", "confidence": 0.8355 } ] } |
Server > Client |
{ "type": "recognition", "alternatives": [ { "text": "My name is John.", "confidence": 0.83 } ] } |
Client > Server |
{ "type": "stop" } |
Server > Client |
{ "type": "end", "reason": "stop by client" } |
|
<new recognition-session; same connection> |
Client > Server |
{ "type": "start", "language": "en-US", "conversationId": "8745555-8f1a-48ba-9ec9-46e90dc5aa18", "format": "raw", "encoding": "LINEAR16", "sampleRateHz": 16000 } |