Speech Recognition
Using the Javascript SDK Version 5 or Above?
The
@charisma-ai/sdk:5.0.0
NPM package uses an Audio Manager to handle speech recognition instead of this system. See the Javascript SDK Docs (opens in a new tab) for more information.
Charisma supports speech recognition services for Pro stories, to allow your players to speak to characters using their voice.
Speech to text can be integrated easily via the Charisma SDKs and supports different speech recognition providers. It is enabled in Pro stories by default. You can also disable it in the story overview screen by ticking the box under "Premium Features".
Once a playthrough is connected, audio can be streamed from the player to Charisma, with results being sent back continuously as the player speaks.
Stream Events
From the client to the Charisma server
speech-recognition-start
Sending a "speech-recognition-start" starts up the downstream service ready to accept audio chunks.
Label | Type | Optional | Default | Comment |
---|---|---|---|---|
service | must be one of the strings: "unified", "unified:google", "unified:aws", or "unified:deepgram" | No | "unified" uses Deepgram but may change. | |
sampleRate | number | Yes | 16000 | Sample rate in Hertz of the audio data sent. 16000 is optimal. |
languageCode | string | Yes | "en-US" | |
encoding | string | Yes | "linear16" for Google and Deepgram, "pcm" for AWS | |
customServiceParameters | object | Yes | {} | See Service Specific Options below. |
returnRaw | boolean | Yes | false | Use for debugging, returns the response from from the downstream service without changes |
To see the most recent supported values for sampleRate, languageCode, and encoding, see the provider's documentation which is linked for each service under Service Specific Options.
speech-recognition-chunk
For streaming the audio data.
speech-recognition-stop
Requests for the server to stop processing the stream.
From the Charisma server to the client
Speech recognition results and errors, and started/stopped events are streamed to the client.
speech-recognition-started
After sending a request to the server to start a speech recognition stream, the server will respond with a speech-recognition-started
event if successful, once the stream is connected.
Label | Type | Always provided |
---|---|---|
id | string | Yes |
playerSessionId | string | Yes |
service | string | Yes |
parameters | object (showing the parameters used to start the service from the validated start request, and defaults) | Yes |
startedAt | date | Yes |
speech-recognition-stopped
For a started service, when a speech-recognition-stop is received by the server, this response indicates the server acknowledges the end of the streaming, with information about the stream.
Label | Type | Always provided |
---|---|---|
id | string | Yes |
playerSessionId | string | Yes |
service | string | Yes |
parameters | object (showing the parameters used to start the service from the validated start request, and defaults) | Yes |
starteedAt | date | Yes |
endedAt | date | Yes |
creditCount | number | Yes |
speech-recognition-result
The successful speech recognition response is adapted from the results provided from the downstream service. To see the original result without it being generalised, set the returnRaw
parameters to true
.
Label | Type | Always provided |
---|---|---|
text | string | Yes |
isFinal | boolean or undefined | Yes |
speechFinal | boolean | No |
confidence | number | No |
durationInSeconds | number | No |
As results are streamed, you might wish to replace the text on screen with the latest transcription, until a result with isFinal
equal to true
is returned. You can then display the final text
value on screen and send it as a reply in the Charisma conversation.
The field speechFinal
detects whether the intonation/other characteristics of speech indicates that speech is finished. This feature is only currently available from Deepgram, and for other services will always be false.
speech-recognition-error
Errors can occur if speech-recognition-start
parameters are not accepted by the downstream service, or for other reasons which will be outlined in errorDetails
.
Label | Type | Always provided |
---|---|---|
errorDetails | unknown | Yes |
errorOccurredWhen | string | Yes |
message | string | No |
Service Specific Options
Additional speech recognition parameters can be provided which pass straight through to the service you have chosen. Not all parameters are supported, please consult the below list.
Warning! Using these parameters does not add any additional fields to the generalised speech-recognition-response
payloads. If you want to see these, either turn on returnRaw
, or please discuss with us if you have further requirements by contacting hello@charisma.ai.
For each service below the listed optional parameters can be added to customServiceParameters, and will be passed to that service. Be sure to provide values that will be accepted.
AWS
For more information see https://docs.aws.amazon.com/transcribe/latest/APIReference/API_streaming_StartStreamTranscription.html (opens in a new tab)
View full list of supported fields
- SessionId
- ShowSpeakerLabel
- EnableChannelIdentification
- NumberOfChannels
- EnablePartialResultsStabilization
- PartialResultsStability
- ContentIdentificationType
- ContentRedactionType
- PiiEntityTypes
Deepgram
For more information see https://developers.deepgram.com/reference/streaming (opens in a new tab)
View full list of supported fields
- model
- tier
- version
- punctuate
- profanity_filter
- redact
- diarize
- diarize_version
- smart_format
- multichannel
- alternatives
- numerals
- search
- replace
- callback
- keywords
- interim_results
- endpointing
- channels
For more information see https://cloud.google.com/speech-to-text/docs/reference/rest/v1/RecognitionConfig (opens in a new tab)
View full list of supported fields
- audioChannelCount
- enableSeparateRecognitionPerChannel
- alternativeLanguageCodes
- maxAlternatives
- profanityFilter
- adaptation
- speechContexts
- enableWordTimeOffsets
- enableWordConfidence
- enableAutomaticPunctuation
- enableSpokenPunctuation
- enableSpokenEmojis
- diarizationConfig