How to disable the default "Disfluency Removal" of filler words after STT transcription in Azure AI Speech?

Dennis 0

Azure AI Speech Services defaults to removing many filler words (uh, eh, etc.) via post-transcription "Disfluency Removal". My use case includes presentation analysis for filler words, which requires a verbatim transcript. Is there a transcription configuration property to disable "Disfluency Removal"? I'm using Python swagger-client and processing as in this sample: (https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/batch/python/python-client/main.py).

Perhaps something of the form:

properties = swagger_client.TranscriptionProperties()

properties.SpeechServiceResponse_PostProcessingOption = "None"

Do trained custom speech models include/default-to "Disfluency Removal"? Perhaps that is a solution?

Dennis 0

Saw this from a code search, which I'm still trying to get working:

from azure.ai.openai import OpenAIClient
from azure.identity import DefaultAzureCredential

# Initialize the OpenAI client with Azure credentials
credential = DefaultAzureCredential()
client = OpenAIClient(endpoint = "https://<your-openai-endpoint>.openai.azure.com/", credential=credential)

# Configure the Whisper model for verbatim transcription
def transcribe_audio(audio_file_path):
    with open(audio_file_path, "rb") as audio_file:
        response = client.transcribe(
            model = "whisper-1",
            audio = audio_file,
            language = "en",
            verbatim = True  # Enable verbatim transcription
        )
    return response["text"]

# Example usage
audio_file_path = "path_to_your_audio_file.wav"

I tried this in main.py (referenced above) working code:

properties.verbatim = True        # Enable verbatim transcription ???

with no affect. Code runs, but this property gets ignored.

navba-MSFT 27,465 Reputation points Microsoft Employee

2024-10-21T06:43:40.9066667+00:00
@Dennis Welcome to Microsoft Q&A Forum, Thank you for posting your query here!

.

The SpeechServiceResponse_PostProcessingOption is mentioned in this article.

May I know which version of the Python SDK are you using ?

Did you try with the below code snippet and setting this to TrueText and check ?

speech_config.set_property(property_id=speechsdk.PropertyId.SpeechServiceResponse_PostProcessingOption, value='TrueText')

Awaiting your reply.
Dennis 0 Reputation points

2024-10-21T20:21:28.21+00:00
@navba-MSFT - thanks for your prompt response.From the Python Cognitive Services Speech SDK v1.41.1 with SpeechToText API v3.2, we're adopting the Python swagger-client and processing from this sample for our use case: https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/batch/python/python-client/main.py

The API doesn't seem to have a property to disable the default "Disfluency Removal". Need something-of-the-form (none of these work):

properties.verbatim = True # OR # properties.disfluencyremoval = False # OR # properties.SpeechServiceResponse_PostProcessingOption = "None"

This Azure sample's speaker diarization & word-level timestamps work beautifully, but our popular & in-demand industry use-case for Azure AI Speech includes analyzing important high-level presentations for development, improvement, refinement, and practice. A portion of our NLTK analysis identifies concordances of filler words (um, uh, er, hmm, so, etc.) which requires a verbatim transcript, therefore no Disfluency Removal.

Thanks for any help with this issue.
navba-MSFT 27,465 Reputation points Microsoft Employee

2024-10-22T03:34:24.03+00:00

@Dennis I am checking this internally with the Product Owners. I will keep you posted.
navba-MSFT 27,465 Reputation points Microsoft Employee

2024-10-24T05:24:26.71+00:00

@Dennis Apologies for the delay. I got a confirmation from the Product Owners. As far as I know disfluency removal is off by default and you should be able to see the filling words.

.

Disfluency removal is enabled when you set parameter “truetext” but it is not officially documented.

.

You can try the audio with filler words like “uh”, “em”, it would be transcribed as it is. If the speaking speed is too fast and the filler word is not obvious, it may be missed because the model is trained in that way. You can try the attached audio at AIServices - Azure AI Studio, you should be able to see the filler words. I was able to test this and got the filler words in transcription.

.

Hope this answers.
Dennis 0 Reputation points

2024-10-24T22:25:35.9+00:00
@navba-MSFT thanks again for your responses.

We experimented with the "AIServices - Azure AI Studio" you referenced above. We experienced many filler words are not always displayed when transcribing from microphone, and never from our audio file containing numerous filler words.

We did identify a peculiarity with the Speech Services Batch Transcription API from Python, which we use. Turns out, even though the main.py sample we're using: (https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/batch/python/python-client/main.py) is included under the "Cognitive Services Speech SDK" on GitHub, it does not use the SDK:

import azure.cognitiveservices.speech as speechsdk # <<-- NOT USED --

Instead, it solely uses the Speech to Text REST API v3.2 in a swagger-client configuration. Unfortunately, this API, which includes Disfluency Removal by default, has no means to disable it that we can find. The issue is, can this API be updated to allow disabling Disfluency Removal?

Unfortunately, we are bound to batch transcription & custom speech model management. From https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-sdk:

"In some cases, you can't or shouldn't use the Speech SDK. In those cases, you can use REST APIs to access the Speech service. For example, use the Speech to text REST API for batch transcription and custom speech model management." Our use case exactly. We don't see any other options. Except for this Disfluency issue, the Speech to Text REST API v3.2 works perfectly for our use-case (speaker diarization, word-level timestamps, BYOS, etc.).

Additional information:

Speech Services Batch Transcription API from Python

https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/samples/batch/python
navba-MSFT 27,465 Reputation points Microsoft Employee

2024-11-03T05:11:42.3266667+00:00
@Dennis Apologies for the late reply. The Speech to Text REST API, doesn't include Disfluency Removal by default. If the filler words in your own audio file doesn’t get recognized, and you tried I have shared (attached MyTestSpeech.txt - rename it from .txt to .wav) one sample audio using our AI studio, it highly possible our model is not sensitive to the filler word you care about, and it does not relate to turn on/off disfluency removal.

.

Could you please try the below suggestion ?

Try the audio I shared on our studio to see if you can see the filler word.

Try the audio I shared through our SDK with “truetext” configuration to see if you can still the filler word. I think it is probably NO.

Try the audio I shared using the REST API 3.2 with your code to see if you can see the filler word.

.

If all above is align with my understanding, then most likely our model can’t output the filler word you have, we need you send us some audio with issue to see if we can reproduce and any solution we can have to handle it.

.

Awaiting your reply.
Dennis 0 Reputation points

2024-11-05T19:25:51.5333333+00:00

@navba-MSFT thanks for your continued support with this issue.

Our task has been to port an existing cloud application which includes transcription to Azure for building a greatly expanded user application using Azure AI. Our existing content consists of a hundred or so roughly hour-long videos with verbatim transcriptions from another cloud service. Below is a comparison of the filler words reported for one video run with the Azure Speech to Text REST API v3.2 versus another cloud service's verbatim transcription.

Azure Speech to Text REST API v3.2:

Another cloud service's verbatim transcription:

Filler words "Uh, Um, and Gonna" are completely absent from the Azure transcription. Disfluency removal at work? It seems so. If disfluency removal is off by default, what is the API property to enable it? With that property identified, we can possibly definitively disable disfluency removal?

Thanks again
navba-MSFT 27,465 Reputation points Microsoft Employee

2024-11-07T05:15:54.73+00:00
@Dennis Thanks for getting back.

.

Yes, It doesn't include Disfluency Removal by default. The only way to enable it is by setting it to TrueText.

.

speechConfig.setProperty( speechsdk.PropertyId.SpeechServiceResponse_PostProcessingOption, "TrueText" );

Could you please try the below suggestion?

Try the audio I shared on our studio to see if you can see the filler word.

Try the audio I shared through our SDK with “truetext” configuration to see if you can still the filler word. I think it is probably NO.

Try the audio I shared using the REST API 3.2 with your code to see if you can see the filler word.

Awaiting your reply.

Share via

How to disable the default "Disfluency Removal" of filler words after STT transcription in Azure AI Speech?

Your answer