Speech Recognition

Using the Javascript SDK Version 5 or Above?

The @charisma-ai/sdk:5.0.0 NPM package uses an Audio Manager to handle speech recognition instead of this system. See the Javascript SDK Docs (opens in a new tab) for more information.

Charisma supports speech recognition services for Pro stories, to allow your players to speak to characters using their voice.

Speech to text can be integrated easily via the Charisma SDKs and supports different speech recognition providers. It is enabled in Pro stories by default. You can also disable it in the story overview screen by ticking the box under "Premium Features".

Once a playthrough is connected, audio can be streamed from the player to Charisma, with results being sent back continuously as the player speaks.

Stream Events

From the client to the Charisma server

speech-recognition-start

Sending a "speech-recognition-start" starts up the downstream service ready to accept audio chunks.

`Label`	`Type`	`Optional`	`Default`	`Comment`
service	must be one of the strings: "unified", "unified:google", "unified:aws", or "unified:deepgram"	`No`		"unified" uses Deepgram but may change.
sampleRate	number	`Yes`	16000	Sample rate in Hertz of the audio data sent. 16000 is optimal.
languageCode	string	`Yes`	"en-US"
encoding	string	`Yes`	"linear16" for Google and Deepgram, "pcm" for AWS
customServiceParameters	object	`Yes`	`{}`	See Service Specific Options below.
returnRaw	boolean	`Yes`	false	Use for debugging, returns the response from from the downstream service without changes

To see the most recent supported values for sampleRate, languageCode, and encoding, see the provider's documentation which is linked for each service under Service Specific Options.

speech-recognition-chunk

For streaming the audio data.

speech-recognition-stop

Requests for the server to stop processing the stream.

From the Charisma server to the client

Speech recognition results and errors, and started/stopped events are streamed to the client.

speech-recognition-started

After sending a request to the server to start a speech recognition stream, the server will respond with a speech-recognition-started event if successful, once the stream is connected.

`Label`	`Type`	`Always provided`
id	string	`Yes`
playerSessionId	string	`Yes`
service	string	`Yes`
parameters	object (showing the parameters used to start the service from the validated start request, and defaults)	`Yes`
startedAt	date	`Yes`

speech-recognition-stopped

For a started service, when a speech-recognition-stop is received by the server, this response indicates the server acknowledges the end of the streaming, with information about the stream.

`Label`	`Type`	`Always provided`
id	string	`Yes`
playerSessionId	string	`Yes`
service	string	`Yes`
parameters	object (showing the parameters used to start the service from the validated start request, and defaults)	`Yes`
starteedAt	date	`Yes`
endedAt	date	`Yes`
creditCount	number	`Yes`

speech-recognition-result

The successful speech recognition response is adapted from the results provided from the downstream service. To see the original result without it being generalised, set the returnRaw parameters to true.

`Label`	`Type`	`Always provided`
text	string	`Yes`
isFinal	boolean or undefined	`Yes`
speechFinal	boolean	`No`
confidence	number	`No`
durationInSeconds	number	`No`

As results are streamed, you might wish to replace the text on screen with the latest transcription, until a result with isFinal equal to true is returned. You can then display the final text value on screen and send it as a reply in the Charisma conversation.

The field speechFinal detects whether the intonation/other characteristics of speech indicates that speech is finished. This feature is only currently available from Deepgram, and for other services will always be false.

speech-recognition-error

Errors can occur if speech-recognition-start parameters are not accepted by the downstream service, or for other reasons which will be outlined in errorDetails.

`Label`	`Type`	`Always provided`
errorDetails	unknown	`Yes`
errorOccurredWhen	string	`Yes`
message	string	`No`

Service Specific Options

Additional speech recognition parameters can be provided which pass straight through to the service you have chosen. Not all parameters are supported, please consult the below list.

Warning! Using these parameters does not add any additional fields to the generalised speech-recognition-response payloads. If you want to see these, either turn on returnRaw, or please discuss with us if you have further requirements by contacting hello@charisma.ai.

For each service below the listed optional parameters can be added to customServiceParameters, and will be passed to that service. Be sure to provide values that will be accepted.

AWS

For more information see https://docs.aws.amazon.com/transcribe/latest/APIReference/API_streaming_StartStreamTranscription.html (opens in a new tab)

View full list of supported fields

SessionId
ShowSpeakerLabel
EnableChannelIdentification
NumberOfChannels
EnablePartialResultsStabilization
PartialResultsStability
ContentIdentificationType
ContentRedactionType
PiiEntityTypes

Deepgram

For more information see https://developers.deepgram.com/reference/streaming (opens in a new tab)

View full list of supported fields

model
tier
version
punctuate
profanity_filter
redact
diarize
diarize_version
smart_format
multichannel
alternatives
numerals
search
replace
callback
keywords
interim_results
endpointing
channels

Google

For more information see https://cloud.google.com/speech-to-text/docs/reference/rest/v1/RecognitionConfig (opens in a new tab)

View full list of supported fields

audioChannelCount
enableSeparateRecognitionPerChannel
alternativeLanguageCodes
maxAlternatives
profanityFilter
adaptation
speechContexts
enableWordTimeOffsets
enableWordConfidence
enableAutomaticPunctuation
enableSpokenPunctuation
enableSpokenEmojis
diarizationConfig

Core Concepts