When using batch speech transscription the ITN feature only applies to the first option of the nBest results.

Question

When using batch transscription the ITN feature only applies to the first option of the nBest results, whitch is not necessarily the one with the highest confidence.

The batch transscription service returns a json result with the following structure (anonymized)

{

"source": "{a url}",

"timestamp": "2024-10-09T10:49:38Z",

"durationInTicks": 4897800000,

"duration": "PT8M9.78S",

"combinedRecognizedPhrases": [

{

  "channel": 1,

  "lexical": "{content}",

  "itn": "{content - itn works}",

  "maskedITN": "{content - itn works}",

  "display": "{content - itn works}"

}

],

"recognizedPhrases": [

{

  "recognitionStatus": "Success",

  "channel": 1,

  "offset": "PT0.77S",

  "duration": "PT1.48S",

  "offsetInTicks": 7700000.0,

  "durationInTicks": 14800000.0,

  "nBest": [

    {

      "confidence": 0.44051075,

      "lexical": "{content}",

      "itn": "{content - itn works}",

      "maskedITN": "{content- itn works}",

      "display": "{content- itn works}"

    },

    {

      "confidence": 0.52692604,

      "lexical": "{content}",

      "itn": "{content - no itn}",

      "maskedITN": "{content - no itn}",

      "display": "{content - no itn}"

    }

  ],

  "locale": "da-DK"

}

]

}

Am I doing something wrong?

Answer

Hello Julian Kopka Heerup,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that you are having issues where the Inverse Text Normalization (ITN) feature is only being applied to the first option in the nBest results, with regardless of its confidence score.

Regarding your explanation and the code provided, to ensure that the Inverse Text Normalization (ITN) feature is applied to the most accurate transcription result, you can implement a post-processing step in your application. This involves parsing the JSON response from the batch transcription service, identifying the nBest result with the highest confidence score, and then applying ITN to that result. I put together an example in Python here from your JSON:

import json
# Sample JSON response
response = '''{
    "recognizedPhrases": [
        {
            "nBest": [
                {
                    "confidence": 0.44051075,
                    "lexical": "content",
                    "itn": "content - itn works",
                    "maskedITN": "content- itn works",
                    "display": "content- itn works"
                },
                {
                    "confidence": 0.52692604,
                    "lexical": "content",
                    "itn": "content - no itn",
                    "maskedITN": "content - no itn",
                    "display": "content - no itn"
                }
            ]
        }
    ]
}'''
# Parse the JSON response
data = json.loads(response)
# Find the nBest result with the highest confidence
best_result = max(data["recognizedPhrases"][0]["nBest"], key=lambda x: x["confidence"])
# Apply ITN to the best result if needed
best_result_itn = best_result.get("itn", best_result["lexical"])
print("Best result with ITN:", best_result_itn)

I hope this is helpful! Do not hesitate to let me know if you have any other questions.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Share via

When using batch speech transscription the ITN feature only applies to the first option of the nBest results.

1 answer

Your answer