Azure Speech JS SDK Returns Single Item in NBest Array
When using the Cognitive Services JavaScript Speech SDK with OutputFormat.Detailed
and the recognizeOnceAsync
approach, the NBest
array consistently contains only a single object instead of the expected multiple alternatives.
For example, when pronouncing 'flour', it would be anticipated to receive multiple items in the array, such as 'flower', which is a likely alternative.
Is this a bug in the SDK, or do I need to provide a different setting when creating the recognizer?
Azure AI Speech
-
navba-MSFT 26,885 Reputation points • Microsoft Employee
2024-12-19T04:33:40.0333333+00:00 @Tom D Welcome to Microsoft Q&A Forum, Thank you for posting your query here!
.
Plan 1:
For receiving all the recognition alternatives, please set the
OutputFormat
toDetailed
:speechConfig.outputFormat = SpeechSDK.OutputFormat.Detailed
.More info here.
.
.
Plan 2:
Also please check by using the
startContinuousRecognitionAsync
function instaed ofrecognizeOnceAsync
.
Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.
-
Tom D 0 Reputation points
2024-12-19T10:51:54.59+00:00 The first part is already in place. But we prefer recognizeOnceAsync. Is it possible to get NBest with
recognizeOnceAsync
? If not could you explain the rationale of this design choice?I've tested it according to your suggestion with the
startContinousRecognitionAsync
in the result on therecognized
event. The NBest array still contains only a single element.{ "Id": "3ca940b309db4a3e999475cb43d60881", "RecognitionStatus": 0, "Offset": 9700000, "Duration": 5100000, "Channel": 0, "DisplayText": "Ate.", "SNR": 18.23229, "NBest": [ { "Confidence": 0.44882, "Lexical": "ate", "ITN": "ate", "MaskedITN": "ate", "Display": "Ate.", "PronunciationAssessment": { "AccuracyScore": 98, "FluencyScore": 100, "ProsodyScore": 0, "CompletenessScore": 100, "PronScore": 39.6 }, "Words": [ { "Word": "ate", "Offset": 9700000, "Duration": 5100000, "PronunciationAssessment": { "AccuracyScore": 98, "ErrorType": "None", "Feedback": { "Prosody": { "Break": { "ErrorTypes": [ "None" ], "BreakLength": 0 }, "Intonation": { "ErrorTypes": [], "Monotone": { "SyllablePitchDeltaConfidence": 0.50218755 } } } } }, "Syllables": [ { "Syllable": "", "PronunciationAssessment": { "AccuracyScore": 84 }, "Offset": 9700000, "Duration": 5100000 } ], "Phonemes": [ { "Phoneme": "", "PronunciationAssessment": { "AccuracyScore": 74 }, "Offset": 9700000, "Duration": 2300000 }, { "Phoneme": "", "PronunciationAssessment": { "AccuracyScore": 92 }, "Offset": 12100000, "Duration": 2700000 } ] } ] } ] }
-
Tom D 0 Reputation points
2024-12-19T11:06:45.54+00:00 @navba-MSFT (excuse me if this comment shows up double, but I commented before but it does not show up)
-
navba-MSFT 26,885 Reputation points • Microsoft Employee
2024-12-19T13:34:39.8166667+00:00 @Tom D Thanks for getting back. Please share the code snippet you are using and also the text that is being tested against. Awaiting your reply
-
navba-MSFT 26,885 Reputation points • Microsoft Employee
2024-12-20T07:22:41.6333333+00:00 -
Tom D 0 Reputation points
2024-12-23T13:32:56.1233333+00:00 @navba-MSFT
Here's a snippet of the relevant initalization of the recognizer and the configuration functionasync init(languageTag: string = defaultLanguageTag, facts: Fact[] = []) { const logFileName = `speech-to-text-${Date.now()}`; this.answers = facts.flatMap(fact => fact.answers); const singleWordAnswers = this.answers.filter(answer => answer.split(' ').length === 1).length > 0; this.recognizedSpeech = new EventEmitter; await this.getAzureSpeechToken(); this.recognizer = new sdk.SpeechRecognizer(this.getSpeechConfig(languageTag, singleWordAnswers), this.getAudioConfig()); this.recognizer.recognizing = (s: sdk.SpeechRecognizer, e: sdk.SpeechRecognitionEventArgs) => { console.log('Recognizing:', e.result); } this.recognizer.recognized = (s: sdk.SpeechRecognizer, e: sdk.SpeechRecognitionEventArgs) => { this.processSpeechResult(e.result); } const phraseListGrammar = sdk.PhraseListGrammar.fromRecognizer(this.recognizer); phraseListGrammar.addPhrases(this.answers); this.recognizer.speechStartDetected = () => { this.isListening = true; this.startSpeechTime = performance.now(); } this.recognizer.speechEndDetected = () => { this.isListening = false; } let callbackInvoked: boolean = false; const connection: sdk.Connection = sdk.Connection.fromRecognizer(this.recognizer); connection.openConnection(); WaitForCondition((): boolean => callbackInvoked, (): void => { connection.closeConnection(); }); sdk.Diagnostics.onLogOutput = (s: string) => { this.server.post('/api/speech/log', {}, {fileName: logFileName, log: s}); callbackInvoked = true; }; }
private getSpeechConfig(languageTag: string, singleWordAnswers: boolean): sdk.SpeechConfig { const speechConfig = sdk.SpeechConfig.fromAuthorizationToken(this.speechToken.token, this.speechToken.region); speechConfig.speechRecognitionLanguage = languageTag; speechConfig.requestWordLevelTimestamps(); speechConfig.outputFormat = sdk.OutputFormat.Detailed; // https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-recognize-speech?pivots=programming-language-csharp#change-how-silence-is-handled speechConfig.setProperty(sdk.PropertyId.Speech_SegmentationSilenceTimeoutMs, singleWordAnswers ? this.SEGMENTATION_SILENCE_TIMEOUT_MS_SINGLE_WORD : this.SEGMENTATION_SILENCE_TIMEOUT_MS_MULTIPLE_WORDS); speechConfig.setProperty(sdk.PropertyId.SpeechServiceConnection_EndSilenceTimeoutMs, singleWordAnswers ? this.END_SILENCE_TIMEOUT_MS_SINGLE_WORD : this.END_SILENCE_TIMEOUT_MS_MULTIPLE_WORDS); speechConfig.enableAudioLogging(); return speechConfig; }
-
Tom D 0 Reputation points
2024-12-23T13:46:32.6566667+00:00 speech-to-text-1734961275388.logAnd here is de log file where I speak the words
male
which sounds similar tomail
andeye
which sounds similar toI
. For both words the NBest array consists of a single item. -
Tom D 0 Reputation points
2024-12-23T13:46:46.56+00:00 Hope this helps
Sign in to comment