Realtime Recognizer not utilising with Semantic Segmentation

Thomas Bauer 0 Reputation points
2024-11-08T02:34:01.7166667+00:00

Hi all!

I'm using the Azure speechsdk.SpeechRecognizer for transcribing streamed real-time audio. While the transcription works, continuous talking will result in large paragraphs being outputted rather than sentence by sentence. I included the speech_config.set_property(speechsdk.PropertyId.Speech_SegmentationStrategy, "Semantic") in order to utilise the semantic separation rather than audio based separation. However the output is 100% identical. For example it took this long for the recognizer to output:
a classic example of early remote control the idea was simple club and your lights will turn on off you didn't need to physically handle a wire remote we didn't have to find a clicker to point at a TV and you could make automations happen at a very limited scale i remember that i'd clap along to a song and the lights would go wild

The output:

A classic example of early remote control. The idea was simple, club and your lights will turn on off. You didn't need to physically handle a wire remote, we didn't have to find a clicker to point at a TV, and you could make automations happen at a very limited scale. I remember that I'd clap along to a song and the lights would go wild.
The output is correct and I can see that the recognizer is collecting word by word but it took this whole paragraph until it was returned formatted. The behavior is exactly as if the semantic recognition is not used.

This is how I set up the recognizer, is there some step missing?

speech_config = speechsdk.SpeechConfig(subscription=os.getenv('AZURE_SPEECH_KEY'), region=os.getenv('AZURE_SPEECH_REGION'))
            speech_config.set_property(speechsdk.PropertyId.Speech_SegmentationStrategy, "Semantic")
            speech_config.speech_recognition_language="en-US"
            speech_config.set_property(speechsdk.PropertyId.Spee)
            self.push_stream = speechsdk.audio.PushAudioInputStream()
            audio_config = speechsdk.audio.AudioConfig(stream=self.push_stream)
            self.speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,774 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. santoshkc 9,400 Reputation points Microsoft Vendor
    2024-11-08T14:37:03.5466667+00:00

    Hi @Thomas Bauer,

    Thank you for reaching out to Miorosoft Q& forum!

    To achieve sentence-by-sentence transcription in Azure Speech SDK, use the SpeechRecognizerFromFile class with the following setup:

    • Segmentation Strategy: Sets speech_config.set_property(speechsdk.PropertyId.Speech_SegmentationStrategy, "Semantic") to leverage semantic segmentation.
    • Sentence-by-Sentence Output: A custom recognized_handler splits sentences based on punctuation (periods, question marks, exclamations), ensuring each sentence is printed as it’s fully recognized.

    Here’s a complete code example implementing this approach:

    import azure.cognitiveservices.speech as speechsdk
    import time
    import re
    
    class SpeechRecognizerFromFile:
        def __init__(self, subscription_key, region, audio_file):
            # Initialize speech configuration
            self.speech_config = speechsdk.SpeechConfig(subscription=subscription_key, region=region)
            self.speech_config.speech_recognition_language = "en-US"
            
            # Set segmentation strategy to 'Semantic'
            self.speech_config.set_property(speechsdk.PropertyId.Speech_SegmentationStrategy, "Semantic")
            
            # Create audio configuration using the provided audio file
            audio_config = speechsdk.audio.AudioConfig(filename=audio_file)
            
            # Initialize the speech recognizer with the audio configuration
            self.speech_recognizer = speechsdk.SpeechRecognizer(speech_config=self.speech_config, audio_config=audio_config)
            
            # Initialize the state for recognizing speech continuously
            self.done = False
    
        def stop_cb(self, evt):
            """Callback function to stop continuous recognition."""
            print(f"CLOSING on {evt}")
            self.speech_recognizer.stop_continuous_recognition()
            self.done = True
    
        def recognized_handler(self, evt):
            """Callback for final recognition results."""
            if evt.result.reason == speechsdk.ResultReason.RecognizedSpeech:
                # Split text by periods, question marks, or exclamation marks followed by a space
                sentences = re.split(r'(?<=[.!?])\s+', evt.result.text)
                for sentence in sentences:
                    print(f"{sentence.strip()}")
            elif evt.result.reason == speechsdk.ResultReason.NoMatch:
                print("No speech could be recognized.")
            elif evt.result.reason == speechsdk.ResultReason.Canceled:
                cancellation_details = evt.result.cancellation_details
                print(f"Speech Recognition canceled: {cancellation_details.reason}")
                if cancellation_details.reason == speechsdk.CancellationReason.Error:
                    print(f"Error details: {cancellation_details.error_details}")
    
        def start_recognition(self):
            """Start continuous speech recognition."""
            # Connect events to handlers
            self.speech_recognizer.recognized.connect(self.recognized_handler)  # Only handle final recognized results
            self.speech_recognizer.session_started.connect(lambda evt: print(f"SESSION STARTED: {evt}"))
            self.speech_recognizer.session_stopped.connect(lambda evt: print(f"SESSION STOPPED: {evt}"))
            self.speech_recognizer.canceled.connect(lambda evt: print(f"CANCELED: {evt}"))
            
            # Connect the stop callback to stop recognition when needed
            self.speech_recognizer.session_stopped.connect(self.stop_cb)
            self.speech_recognizer.canceled.connect(self.stop_cb)
    
            # Start continuous recognition
            print("Starting continuous recognition...")
            self.speech_recognizer.start_continuous_recognition()
    
            # Keep the program running until 'done' is set to True
            while not self.done:
                time.sleep(0.5)
    
    # Example usage
    if __name__ == "__main__":
        subscription_key = "SPEECH_KEY"
        region = "SPEECH_REGION"  # e.g., "eastus"
        audio_file = r"C:\Users\XXXXXXXXXX\Downloads\Untitled.wav"  # path to your audio file here
    
        recognizer = SpeechRecognizerFromFile(subscription_key, region, audio_file)
        recognizer.start_recognition()
    

    Output:User's image

    I hope this helps. Still if you face any errors, do let us know will try to figure out the issue.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.