Docs
SDK Integration
Speech Recognition

Speech Recognition

Using the Javascript SDK Version 5 or Above?

The @charisma-ai/sdk:5.0.0 NPM package uses an Audio Manager to handle speech recognition instead of this system. See the Javascript SDK Docs (opens in a new tab) for more information.

Charisma supports speech recognition services for Pro stories, to allow your players to speak to characters using their voice.

Speech to text can be integrated easily via the Charisma SDKs and supports different speech recognition providers. It is enabled in Pro stories by default. You can also disable it in the story overview screen by ticking the box under "Premium Features".

Once a playthrough is connected, audio can be streamed from the player to Charisma, with results being sent back continuously as the player speaks.

Stream Events

From the client to the Charisma server

speech-recognition-start

Sending a "speech-recognition-start" starts up the downstream service ready to accept audio chunks.

LabelTypeOptionalDefaultComment
servicemust be one of the strings: "unified", "unified:google", "unified:aws", or "unified:deepgram"No"unified" uses Deepgram but may change.
sampleRatenumberYes16000Sample rate in Hertz of the audio data sent. 16000 is optimal.
languageCodestringYes"en-US"
encodingstringYes"linear16" for Google and Deepgram, "pcm" for AWS
customServiceParametersobjectYes{}See Service Specific Options below.
returnRawbooleanYesfalseUse for debugging, returns the response from from the downstream service without changes

To see the most recent supported values for sampleRate, languageCode, and encoding, see the provider's documentation which is linked for each service under Service Specific Options.

speech-recognition-chunk

For streaming the audio data.

speech-recognition-stop

Requests for the server to stop processing the stream.

From the Charisma server to the client

Speech recognition results and errors, and started/stopped events are streamed to the client.

speech-recognition-started

After sending a request to the server to start a speech recognition stream, the server will respond with a speech-recognition-started event if successful, once the stream is connected.

LabelTypeAlways provided
idstringYes
playerSessionIdstringYes
servicestringYes
parametersobject (showing the parameters used to start the service from the validated start request, and defaults)Yes
startedAtdateYes

speech-recognition-stopped

For a started service, when a speech-recognition-stop is received by the server, this response indicates the server acknowledges the end of the streaming, with information about the stream.

LabelTypeAlways provided
idstringYes
playerSessionIdstringYes
servicestringYes
parametersobject (showing the parameters used to start the service from the validated start request, and defaults)Yes
starteedAtdateYes
endedAtdateYes
creditCountnumberYes

speech-recognition-result

The successful speech recognition response is adapted from the results provided from the downstream service. To see the original result without it being generalised, set the returnRaw parameters to true.

LabelTypeAlways provided
textstringYes
isFinalboolean or undefinedYes
speechFinalbooleanNo
confidencenumberNo
durationInSecondsnumberNo

As results are streamed, you might wish to replace the text on screen with the latest transcription, until a result with isFinal equal to true is returned. You can then display the final text value on screen and send it as a reply in the Charisma conversation.

The field speechFinal detects whether the intonation/other characteristics of speech indicates that speech is finished. This feature is only currently available from Deepgram, and for other services will always be false.

speech-recognition-error

Errors can occur if speech-recognition-start parameters are not accepted by the downstream service, or for other reasons which will be outlined in errorDetails.

LabelTypeAlways provided
errorDetailsunknownYes
errorOccurredWhenstringYes
messagestringNo

Service Specific Options

Additional speech recognition parameters can be provided which pass straight through to the service you have chosen. Not all parameters are supported, please consult the below list.

Warning! Using these parameters does not add any additional fields to the generalised speech-recognition-response payloads. If you want to see these, either turn on returnRaw, or please discuss with us if you have further requirements by contacting hello@charisma.ai.

For each service below the listed optional parameters can be added to customServiceParameters, and will be passed to that service. Be sure to provide values that will be accepted.

AWS

For more information see https://docs.aws.amazon.com/transcribe/latest/APIReference/API_streaming_StartStreamTranscription.html (opens in a new tab)

View full list of supported fields
  • SessionId
  • ShowSpeakerLabel
  • EnableChannelIdentification
  • NumberOfChannels
  • EnablePartialResultsStabilization
  • PartialResultsStability
  • ContentIdentificationType
  • ContentRedactionType
  • PiiEntityTypes

Deepgram

For more information see https://developers.deepgram.com/reference/streaming (opens in a new tab)

View full list of supported fields
  • model
  • tier
  • version
  • punctuate
  • profanity_filter
  • redact
  • diarize
  • diarize_version
  • smart_format
  • multichannel
  • alternatives
  • numerals
  • search
  • replace
  • callback
  • keywords
  • interim_results
  • endpointing
  • channels

Google

For more information see https://cloud.google.com/speech-to-text/docs/reference/rest/v1/RecognitionConfig (opens in a new tab)

View full list of supported fields
  • audioChannelCount
  • enableSeparateRecognitionPerChannel
  • alternativeLanguageCodes
  • maxAlternatives
  • profanityFilter
  • adaptation
  • speechContexts
  • enableWordTimeOffsets
  • enableWordConfidence
  • enableAutomaticPunctuation
  • enableSpokenPunctuation
  • enableSpokenEmojis
  • diarizationConfig