Docs
SDK Integration
Speech Recognition

Speech Recognition

Charisma supports speech recognition services for Pro stories, to allow your players to speak to characters using their voice.

Speech to text can be integrated easily via the Charisma SDKs and supports different speech recognition providers. It is enabled in Pro stories by default. You can also disable it in the story overview screen by ticking the box under "Premium Features".

Once a playthrough is connected, audio can be streamed from the player to Charisma, with results being sent back continuously as the player speaks.

Data Models

From the client to the Charisma server

speech-recognition-start

Sending a "speech-recognition-start" starts up the downstream service ready to accept audio chunks.

LabelTypeOptionalDefaultComment
servicemust be one of the strings: "unified", "unified:google", "unified:aws", or "unified:deepgram"No"unified" uses Deepgram but may change.
sampleRatenumberYes16000Sample rate in Hertz of the audio data sent. 16000 is optimal.
languageCodestringYes"en-US"
encodingstringYes"linear16" for Google and Deepgram, "pcm" for AWS
customServiceParametersobjectYes{}See Service Specific Options below.
returnRawbooleanYesfalseUse for debugging, returns the response from from the downstream service without changes

To see the most recent supported values for sampleRate, languageCode, and encoding, see the provider's documentation which is linked for each service under Service Specific Options.

speech-recognition-chunk

For streaming the audio data.

From the Charisma server to the client

Speech recognition results and errors are streamed to the client.

speech-recognition-result

The successful speech recognition response is adapted from the results provided from the downstream service. To see the original result without it being generalised, set the returnRaw parameters to true.

LabelTypeAlways provided
textstringYes
isFinalboolean or undefinedYes
speechFinalbooleanNo
confidencenumberNo
durationInSecondsnumberNo

As results are streamed, you might wish to replace the text on screen with the latest transcription, until a result with isFinal equal to true is returned. You can then display the final text value on screen and send it as a reply in the Charisma conversation.

The field speechFinal detects whether the intonation/other characteristics of speech indicates that speech is finished. This feature is only currently available from Deepgram, and for other services will always be false.

speech-recognition-error

Errors can occur if speech-recognition-start parameters are not accepted by the downstream service, or for other reasons which will be outlined in errorDetails.

LabelTypeAlways provided
errorDetailsunknownYes
errorOccurredWhenstringYes
messagestringNo

Service Specific Options

Additional speech recognition parameters can be provided which pass straight through to the service you have chosen. Not all parameters are supported, please consult the below list.

Warning! Using these parameters does not add any additional fields to the generalised speech-recognition-response payloads. If you want to see these, either turn on returnRaw, or please discuss with us if you have further requirements by contacting hello@charisma.ai.

For each service below the listed optional parameters can be added to customServiceParameters, and will be passed to that service. Be sure to provide values that will be accepted.

AWS

For more information see https://docs.aws.amazon.com/transcribe/latest/APIReference/API_streaming_StartStreamTranscription.html (opens in a new tab)

View full list of supported fields
  • SessionId
  • ShowSpeakerLabel
  • EnableChannelIdentification
  • NumberOfChannels
  • EnablePartialResultsStabilization
  • PartialResultsStability
  • ContentIdentificationType
  • ContentRedactionType
  • PiiEntityTypes

Deepgram

For more information see https://developers.deepgram.com/reference/streaming (opens in a new tab)

View full list of supported fields
  • model
  • tier
  • version
  • punctuate
  • profanity_filter
  • redact
  • diarize
  • diarize_version
  • smart_format
  • multichannel
  • alternatives
  • numerals
  • search
  • replace
  • callback
  • keywords
  • interim_results
  • endpointing
  • channels

Google

For more information see https://cloud.google.com/speech-to-text/docs/reference/rest/v1/RecognitionConfig (opens in a new tab)

View full list of supported fields
  • audioChannelCount
  • enableSeparateRecognitionPerChannel
  • alternativeLanguageCodes
  • maxAlternatives
  • profanityFilter
  • adaptation
  • speechContexts
  • enableWordTimeOffsets
  • enableWordConfidence
  • enableAutomaticPunctuation
  • enableSpokenPunctuation
  • enableSpokenEmojis
  • diarizationConfig