Realtime Recognizer not utilising with Semantic Segmentation

Question

Realtime Recognizer not utilising with Semantic Segmentation

Thomas Bauer 20

Hi all!

I'm using the Azure speechsdk.SpeechRecognizer for transcribing streamed real-time audio. While the transcription works, continuous talking will result in large paragraphs being outputted rather than sentence by sentence. I included the speech_config.set_property(speechsdk.PropertyId.Speech_SegmentationStrategy, "Semantic") in order to utilise the semantic separation rather than audio based separation. However the output is 100% identical. For example it took this long for the recognizer to output:
a classic example of early remote control the idea was simple club and your lights will turn on off you didn't need to physically handle a wire remote we didn't have to find a clicker to point at a TV and you could make automations happen at a very limited scale i remember that i'd clap along to a song and the lights would go wild

The output:

A classic example of early remote control. The idea was simple, club and your lights will turn on off. You didn't need to physically handle a wire remote, we didn't have to find a clicker to point at a TV, and you could make automations happen at a very limited scale. I remember that I'd clap along to a song and the lights would go wild.
The output is correct and I can see that the recognizer is collecting word by word but it took this whole paragraph until it was returned formatted. The behavior is exactly as if the semantic recognition is not used.

This is how I set up the recognizer, is there some step missing?

speech_config = speechsdk.SpeechConfig(subscription=os.getenv('AZURE_SPEECH_KEY'), region=os.getenv('AZURE_SPEECH_REGION'))
            speech_config.set_property(speechsdk.PropertyId.Speech_SegmentationStrategy, "Semantic")
            speech_config.speech_recognition_language="en-US"
            speech_config.set_property(speechsdk.PropertyId.Spee)
            self.push_stream = speechsdk.audio.PushAudioInputStream()
            audio_config = speechsdk.audio.AudioConfig(stream=self.push_stream)
            self.speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

Accepted answer

1 additional answer

Your answer

Answer 1

Hi @Thomas Bauer,

Thank you for reaching out to Miorosoft Q& forum!

To achieve sentence-by-sentence transcription in Azure Speech SDK, use the SpeechRecognizerFromFile class with the following setup:

Segmentation Strategy: Sets speech_config.set_property(speechsdk.PropertyId.Speech_SegmentationStrategy, "Semantic") to leverage semantic segmentation.
Sentence-by-Sentence Output: A custom recognized_handler splits sentences based on punctuation (periods, question marks, exclamations), ensuring each sentence is printed as it’s fully recognized.

Here’s a complete code example implementing this approach:

import azure.cognitiveservices.speech as speechsdk
import time
import re

class SpeechRecognizerFromFile:
    def __init__(self, subscription_key, region, audio_file):
        # Initialize speech configuration
        self.speech_config = speechsdk.SpeechConfig(subscription=subscription_key, region=region)
        self.speech_config.speech_recognition_language = "en-US"
        
        # Set segmentation strategy to 'Semantic'
        self.speech_config.set_property(speechsdk.PropertyId.Speech_SegmentationStrategy, "Semantic")
        
        # Create audio configuration using the provided audio file
        audio_config = speechsdk.audio.AudioConfig(filename=audio_file)
        
        # Initialize the speech recognizer with the audio configuration
        self.speech_recognizer = speechsdk.SpeechRecognizer(speech_config=self.speech_config, audio_config=audio_config)
        
        # Initialize the state for recognizing speech continuously
        self.done = False

    def stop_cb(self, evt):
        """Callback function to stop continuous recognition."""
        print(f"CLOSING on {evt}")
        self.speech_recognizer.stop_continuous_recognition()
        self.done = True

    def recognized_handler(self, evt):
        """Callback for final recognition results."""
        if evt.result.reason == speechsdk.ResultReason.RecognizedSpeech:
            # Split text by periods, question marks, or exclamation marks followed by a space
            sentences = re.split(r'(?<=[.!?])\s+', evt.result.text)
            for sentence in sentences:
                print(f"{sentence.strip()}")
        elif evt.result.reason == speechsdk.ResultReason.NoMatch:
            print("No speech could be recognized.")
        elif evt.result.reason == speechsdk.ResultReason.Canceled:
            cancellation_details = evt.result.cancellation_details
            print(f"Speech Recognition canceled: {cancellation_details.reason}")
            if cancellation_details.reason == speechsdk.CancellationReason.Error:
                print(f"Error details: {cancellation_details.error_details}")

    def start_recognition(self):
        """Start continuous speech recognition."""
        # Connect events to handlers
        self.speech_recognizer.recognized.connect(self.recognized_handler)  # Only handle final recognized results
        self.speech_recognizer.session_started.connect(lambda evt: print(f"SESSION STARTED: {evt}"))
        self.speech_recognizer.session_stopped.connect(lambda evt: print(f"SESSION STOPPED: {evt}"))
        self.speech_recognizer.canceled.connect(lambda evt: print(f"CANCELED: {evt}"))
        
        # Connect the stop callback to stop recognition when needed
        self.speech_recognizer.session_stopped.connect(self.stop_cb)
        self.speech_recognizer.canceled.connect(self.stop_cb)

        # Start continuous recognition
        print("Starting continuous recognition...")
        self.speech_recognizer.start_continuous_recognition()

        # Keep the program running until 'done' is set to True
        while not self.done:
            time.sleep(0.5)

# Example usage
if __name__ == "__main__":
    subscription_key = "SPEECH_KEY"
    region = "SPEECH_REGION"  # e.g., "eastus"
    audio_file = r"C:\Users\XXXXXXXXXX\Downloads\Untitled.wav"  # path to your audio file here

    recognizer = SpeechRecognizerFromFile(subscription_key, region, audio_file)
    recognizer.start_recognition()

Output: User's image

I hope this helps. Still if you face any errors, do let us know will try to figure out the issue.

If this answers your query, do click Accept Answer and Yes for was this answer helpful.

Thomas Bauer 20 Reputation points

2024-11-08T15:21:50.5+00:00

Thanks for the quick feedback!

This will unfortunately not help with the issue I'm having.

I can't use SpeechRecognizerFromFile as the audio is streamed from a microphone and should be as low latency as possible.

recognized_handler does not invoke it's callback function until multiple sentences have been said. The reason why this is an issue is that the latency added by waiting for a multiple sentences to finish is very long.

When using speech_config.set_property(speechsdk.PropertyId.Speech_SegmentationStrategy, "Semantic") is it supposed to give you a recognized callback after each sentence or is it not possible for the model to identify complete sentences without having more sentences as input?
santoshkc 13,600 Reputation points Microsoft External Staff

2024-11-11T09:28:21.65+00:00

Hi @Thomas Bauer,

Thank you for you follow-up query.

In real-time transcription with SpeechRecognizer, setting Speech_SegmentationStrategy to "Semantic" can help segment by meaning, but callbacks may still delay until enough context accumulates. For lower latency, try using Recognizing events to capture partial transcriptions in real-time, or adjust speech_config.output_format to Detailed for richer interim data. Adding brief pauses between phrases can also hint segmentation points. However, the recognizer’s need for context may limit true sentence-by-sentence callbacks, so a mix of interim updates with minor post-processing might better handle sentence segmentation in real-time.

I hope you understand. Thank you.
Thomas Bauer 20 Reputation points

2024-11-17T20:53:08.2133333+00:00

Thanks for the help!

I tried it with the option of Detailed output but no change. I assume the way the recognizer is doing it's work simply can't sped up. I used the OpenAI Whisper and feeding it 6s chunks with overlap of 0.5s. I then parse the individual sentences out of the buffer. This reduces the quality of the output slightly but brings the latency to a maximum of 6s. Next step would be to try the same with the Sync version of the SpeechRecognizer to utilize the Azure cloud instead.

Answer 2

Thomas Bauer 20

The Update to SDK version 1.42.0 from Jan 2025 solved this issue. Semantic separation now works as expected

Share via

Realtime Recognizer not utilising with Semantic Segmentation

1 additional answer

Your answer