Azure Speech JS SDK Returns Single Item in NBest Array

Tom D 0

When using the Cognitive Services JavaScript Speech SDK with OutputFormat.Detailed and the recognizeOnceAsync approach, the NBest array consistently contains only a single object instead of the expected multiple alternatives.

For example, when pronouncing 'flour', it would be anticipated to receive multiple items in the array, such as 'flower', which is a likely alternative.

Is this a bug in the SDK, or do I need to provide a different setting when creating the recognizer?

navba-MSFT 26,885 Reputation points Microsoft Employee

2024-12-19T04:33:40.0333333+00:00

@Tom D Welcome to Microsoft Q&A Forum, Thank you for posting your query here!

.

Plan 1:

For receiving all the recognition alternatives, please set the OutputFormat to Detailed:

speechConfig.outputFormat = SpeechSDK.OutputFormat.Detailed.

More info here.

.

.

Plan 2:

Also please check by using the startContinuousRecognitionAsync function instaed of recognizeOnceAsync

.

Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.

Tom D 0

The first part is already in place. But we prefer recognizeOnceAsync. Is it possible to get NBest with recognizeOnceAsync? If not could you explain the rationale of this design choice?

I've tested it according to your suggestion with the startContinousRecognitionAsync in the result on the recognized event. The NBest array still contains only a single element.

{
  "Id": "3ca940b309db4a3e999475cb43d60881",
  "RecognitionStatus": 0,
  "Offset": 9700000,
  "Duration": 5100000,
  "Channel": 0,
  "DisplayText": "Ate.",
  "SNR": 18.23229,
  "NBest": [
    {
      "Confidence": 0.44882,
      "Lexical": "ate",
      "ITN": "ate",
      "MaskedITN": "ate",
      "Display": "Ate.",
      "PronunciationAssessment": {
        "AccuracyScore": 98,
        "FluencyScore": 100,
        "ProsodyScore": 0,
        "CompletenessScore": 100,
        "PronScore": 39.6
      },
      "Words": [
        {
          "Word": "ate",
          "Offset": 9700000,
          "Duration": 5100000,
          "PronunciationAssessment": {
            "AccuracyScore": 98,
            "ErrorType": "None",
            "Feedback": {
              "Prosody": {
                "Break": {
                  "ErrorTypes": [
                    "None"
                  ],
                  "BreakLength": 0
                },
                "Intonation": {
                  "ErrorTypes": [],
                  "Monotone": {
                    "SyllablePitchDeltaConfidence": 0.50218755
                  }
                }
              }
            }
          },
          "Syllables": [
            {
              "Syllable": "",
              "PronunciationAssessment": {
                "AccuracyScore": 84
              },
              "Offset": 9700000,
              "Duration": 5100000
            }
          ],
          "Phonemes": [
            {
              "Phoneme": "",
              "PronunciationAssessment": {
                "AccuracyScore": 74
              },
              "Offset": 9700000,
              "Duration": 2300000
            },
            {
              "Phoneme": "",
              "PronunciationAssessment": {
                "AccuracyScore": 92
              },
              "Offset": 12100000,
              "Duration": 2700000
            }
          ]
        }
      ]
    }
  ]
}

Tom D 0 Reputation points

2024-12-19T11:06:45.54+00:00

@navba-MSFT (excuse me if this comment shows up double, but I commented before but it does not show up)
navba-MSFT 26,885 Reputation points Microsoft Employee

2024-12-19T13:34:39.8166667+00:00

@Tom D Thanks for getting back. Please share the code snippet you are using and also the text that is being tested against. Awaiting your reply
navba-MSFT 26,885 Reputation points Microsoft Employee

2024-12-20T07:22:41.6333333+00:00
@Tom D Along with your sample code, also please share the Speech SDK logs.

sdk.Diagnostics.SetLoggingLevel(sdk.LogLevel.Debug); sdk.Diagnostics.SetLogOutputPath("LogfilePathAndName");

More info ***here. ***Awaiting your reply.

Tom D 0

@navba-MSFT
Here's a snippet of the relevant initalization of the recognizer and the configuration function


async init(languageTag: string = defaultLanguageTag, facts: Fact[] = []) {
  const logFileName = `speech-to-text-${Date.now()}`;
  this.answers = facts.flatMap(fact => fact.answers);
  const singleWordAnswers = this.answers.filter(answer => answer.split(' ').length === 1).length > 0;
  this.recognizedSpeech = new EventEmitter;

  await this.getAzureSpeechToken();
  this.recognizer = new sdk.SpeechRecognizer(this.getSpeechConfig(languageTag, singleWordAnswers), this.getAudioConfig());
  this.recognizer.recognizing = (s: sdk.SpeechRecognizer, e: sdk.SpeechRecognitionEventArgs) => {
    console.log('Recognizing:', e.result);
  }
  this.recognizer.recognized = (s: sdk.SpeechRecognizer, e: sdk.SpeechRecognitionEventArgs) => {
    this.processSpeechResult(e.result);
  }

  const phraseListGrammar = sdk.PhraseListGrammar.fromRecognizer(this.recognizer);
  phraseListGrammar.addPhrases(this.answers);

  this.recognizer.speechStartDetected = () => {
    this.isListening = true;
    this.startSpeechTime = performance.now();
  }
  this.recognizer.speechEndDetected = () => {
    this.isListening = false;
  }

  let callbackInvoked: boolean = false;
  const connection: sdk.Connection = sdk.Connection.fromRecognizer(this.recognizer);

  connection.openConnection();

  WaitForCondition((): boolean => callbackInvoked, (): void => {
    connection.closeConnection();
  });

  sdk.Diagnostics.onLogOutput = (s: string) => {
    this.server.post('/api/speech/log', {}, {fileName: logFileName, log: s});
    callbackInvoked = true;
  };
}


private getSpeechConfig(languageTag: string, singleWordAnswers: boolean): sdk.SpeechConfig {
  const speechConfig = sdk.SpeechConfig.fromAuthorizationToken(this.speechToken.token, this.speechToken.region);
  speechConfig.speechRecognitionLanguage = languageTag;
  speechConfig.requestWordLevelTimestamps();
  speechConfig.outputFormat = sdk.OutputFormat.Detailed;
  // https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-recognize-speech?pivots=programming-language-csharp#change-how-silence-is-handled
  speechConfig.setProperty(sdk.PropertyId.Speech_SegmentationSilenceTimeoutMs, singleWordAnswers ? this.SEGMENTATION_SILENCE_TIMEOUT_MS_SINGLE_WORD : this.SEGMENTATION_SILENCE_TIMEOUT_MS_MULTIPLE_WORDS);
  speechConfig.setProperty(sdk.PropertyId.SpeechServiceConnection_EndSilenceTimeoutMs, singleWordAnswers ? this.END_SILENCE_TIMEOUT_MS_SINGLE_WORD : this.END_SILENCE_TIMEOUT_MS_MULTIPLE_WORDS);
  speechConfig.enableAudioLogging();
  return speechConfig;
}

Tom D 0 Reputation points

2024-12-23T13:46:32.6566667+00:00

speech-to-text-1734961275388.logAnd here is de log file where I speak the words male which sounds similar to mail and eye which sounds similar to I. For both words the NBest array consists of a single item.
Tom D 0 Reputation points

2024-12-23T13:46:46.56+00:00

Hope this helps

Share via

Azure Speech JS SDK Returns Single Item in NBest Array

Your answer