SDK Integration
Speech Recognition

Speech Recognition

Charisma supports speech recognition services for Pro stories, to allow your players to speak to characters using their voice.

Speech to text can be integrated easily via the Charisma SDKs and supports different speech recognition providers. It is enabled in Pro stories by default. You can also disable it in the story overview screen by ticking the box under "Premium Features".

Once a playthrough is connected, audio can be streamed from the player to Charisma, with results being sent back continuously as the player speaks.

Data Models

From the client to the Charisma server


Sending a "speech-recognition-start" starts up the downstream service ready to accept audio chunks.

servicemust be one of the strings: "unified", "unified:google", "unified:aws", or "unified:deepgram"No"unified" uses Deepgram but may change.
sampleRatenumberYes16000Sample rate in Hertz of the audio data sent. 16000 is optimal.
encodingstringYes"linear16" for Google and Deepgram, "pcm" for AWS
customServiceParametersobjectYes{}See Service Specific Options below.
returnRawbooleanYesfalseUse for debugging, returns the response from from the downstream service without changes

To see the most recent supported values for sampleRate, languageCode, and encoding, see the provider's documentation which is linked for each service under Service Specific Options.


For streaming the audio data.

From the Charisma server to the client

Speech recognition results and errors are streamed to the client.


The successful speech recognition response is adapted from the results provided from the downstream service. To see the original result without it being generalised, set the returnRaw parameters to true.

LabelTypeAlways provided
isFinalboolean or undefinedYes

As results are streamed, you might wish to replace the text on screen with the latest transcription, until a result with isFinal equal to true is returned. You can then display the final text value on screen and send it as a reply in the Charisma conversation.

The field speechFinal detects whether the intonation/other characteristics of speech indicates that speech is finished. This feature is only currently available from Deepgram, and for other services will always be false.


Errors can occur if speech-recognition-start parameters are not accepted by the downstream service, or for other reasons which will be outlined in errorDetails.

LabelTypeAlways provided

Service Specific Options

Additional speech recognition parameters can be provided which pass straight through to the service you have chosen. Not all parameters are supported, please consult the below list.

Warning! Using these parameters does not add any additional fields to the generalised speech-recognition-response payloads. If you want to see these, either turn on returnRaw, or please discuss with us if you have further requirements by contacting

For each service below the listed optional parameters can be added to customServiceParameters, and will be passed to that service. Be sure to provide values that will be accepted.


For more information see (opens in a new tab)

View full list of supported fields
  • SessionId
  • ShowSpeakerLabel
  • EnableChannelIdentification
  • NumberOfChannels
  • EnablePartialResultsStabilization
  • PartialResultsStability
  • ContentIdentificationType
  • ContentRedactionType
  • PiiEntityTypes


For more information see (opens in a new tab)

View full list of supported fields
  • model
  • tier
  • version
  • punctuate
  • profanity_filter
  • redact
  • diarize
  • diarize_version
  • smart_format
  • multichannel
  • alternatives
  • numerals
  • search
  • replace
  • callback
  • keywords
  • interim_results
  • endpointing
  • channels


For more information see (opens in a new tab)

View full list of supported fields
  • audioChannelCount
  • enableSeparateRecognitionPerChannel
  • alternativeLanguageCodes
  • maxAlternatives
  • profanityFilter
  • adaptation
  • speechContexts
  • enableWordTimeOffsets
  • enableWordConfidence
  • enableAutomaticPunctuation
  • enableSpokenPunctuation
  • enableSpokenEmojis
  • diarizationConfig