Speech-to-Text API

The API is based on WebSocket. The client (VoiceAI Connect Enterprise) opens a WebSocket connection to a pre-defined URL, per conversation. The connection remains open for the entire duration of the conversation. The same connection can be used for several sequential (not concurrent) recognition-sessions on which the speech-to-text engine performs recognition of an audio stream. Note that the connection might remain open during periods of no active recognition-session.

Control messages are sent as textual JSON messages. Media is sent as binary frames (per Section 5.6 of RFC 6455). As there is no “session” information on the binary frames, a single WebSocket connection can only be used for one concurrent recognition-session.

The same WebSocket can be used to start additional recognition sessions.

In case of an error that prevents the server from further handling messages in the WebSocket connection, the server must close the connection.

Control Messages Sent by Client (VoiceAI Connect Enterprise)

start

Starts a recognition-session

Parameters:

Parameter

Type

Description

language

String

Defines the BCP-47 language code for speech recognition of the supplied audio.

conversationId

String

Defines the conversation Session ID.

VoiceAI Connect Enterprise supports this parameter from Version 3.20 and later.

format

String

Defines the format of the audio file (as configured by the sttPreferWave parameter):

  • raw: Audio without headers

  • wav: Audio with WAV headers

encoding

String

Defines the manner in which the audio is stored and transmitted. Currently, only 16-bit linear pulse-code modulation (PCM) encoding (LINEAR16) is supported.

sampleRateHz

Number

Defines the sample rate (in Hertz) of the supplied audio. Currently, only 16,000 Hz is supported.

sttContextId

String

This field is sent with the value of the sttContextId bot parameter (if configured). See here for details.

sttSpeechContexts

Array of Objects

This field is sent with the value of the sttSpeechContexts bot parameter (if configured). See here for details.

sttGenericData

String

This field is sent with the value of the sttGenericData bot parameter (if configured). See here for details.

participant

String

Defines the identifier of the participant for Agent assist calls.

Example:

{
  "type": "start",
  "language": "en-US",
  "conversationId": "8745555-8f1a-48ba-9ec9-46e90dc5aa18",
  "format": "raw",
  "encoding": "LINEAR16",
  "sampleRateHz": 16000
} 

stop

Stops the current recognition session.

"stop" is sent for a “started” recognition-session.

Example:

{
  "type": "stop"
}

Configuration

Define the audio file format (WAV or RAW) using the sttPreferWave parameter (see Speech customization).

Control Messages Sent by Server (Speech-to-Text Provider)

started

This message is sent to indicate that the recognition-session has started and that the stream (binary messages) can be sent by the client.

Example:

{
  "type": "started"
}

hypothesis

This is sent for partial recognition.

Example:

{
  "type": "hypothesis",
  "alternatives": [
    {
      "text": "Hi"
    }
  ]
}

recognition

This message is sent for each utterance that is recognized by the speech-to-text provider.

Several "recognition" messages can be sent for a single recognition-session.

Example:

{
  "type": "recognition",
  "alternatives": [
    {
      "text": "Hi there",
      "confidence": 0.8355
    }
  ]
}

end

This message is sent to indicate that the current recognition-session has ended.

In case only a single utterance is recognized per recognition-session, an "end" message must be sent immediately after the "recognition" message.

The "end" message must be sent after the server has received a "stop" message, to indicate that the recognition-session has ended.

Example:

{
  "type": "end",
  "reason": "some reason"
}

error

This message indicates that the current recognition-session has ended with a failure condition.

Example:

{
  "type": "error",
  "reason": "some error"
}

Binary Messages Sent by Client

The client sends the audio stream as WebSocket binary messages, according to the encoding and sample-rate indicated in the start message.

Authentication

An Authorization header is sent by the client on the HTTP request that creates the WebSocket connection, containing a shared token. The token can be used by the server to identify the client. For example:

Authorization: Bearer <token>

Example Flow

Direction

Message

Client > Server

{
  "type": "start",
  "language": "en-US",
  "conversationId": "8745555-8f1a-48ba-9ec9-46e90dc5aa18",
  "format": "raw",
  "encoding": "LINEAR16",
  "sampleRateHz": 16000
}

Server > Client

{
  "type": "started"
}

Client > Server

<binary messages>

Server > Client

{
  "type": "hypothesis",
  "alternatives": [
    {
      "text": "Hi"
    }
  ]
}

Server > Client

{
  "type": "recognition",
  "alternatives": [
    {
      "text": "Hi there.",
      "confidence": 0.8355
    }
  ]
}

Server > Client

{
  "type": "recognition",
  "alternatives": [
    {
      "text": "My name is John.",
      "confidence": 0.83
    }
  ]
}

Client > Server

{
  "type": "stop"
}

Server > Client

{
  "type": "end",
  "reason": "stop by client"
}

 

<new recognition-session; same connection>

Client > Server

{
  "type": "start",
  "language": "en-US",
  "conversationId": "8745555-8f1a-48ba-9ec9-46e90dc5aa18",
  "format": "raw",
  "encoding": "LINEAR16",
  "sampleRateHz": 16000
}