Which Azure Speech SDK Feature to Use for Real-Time Meeting Transcription with Speaker Diarization?

Su Myat Hlaing 160 Reputation points
2025-01-29T03:01:34.6166667+00:00

Hi ,

I am working on real-time meeting transcription using Azure Speech SDK and need:

  • Accurate speaker diarization (identify who is speaking).
  • Sentence-level segmentation (avoid merging multiple sentences into one recognition event).
  • Improved low-volume speech recognition (ensure all speech is captured, even if quiet).

🔹 What I Have Tried:

1. ConversationTranscriber

  • Works well for file-based transcription (fromWavFileInput).
  • When used in real-time, it merges multiple sentences into one recognition result.
  • Speaker diarization assigns one speaker ID to an entire chunk instead of per sentence.

2. SpeechRecognizer with Diarization

  • Doesn't seem to support speaker diarization in real-time.

3. Audio Preprocessing for Low Volume

  • Tried using Web Audio API with gain amplification and noise suppression before feeding into the SDK.
  • Adjusted Azure Speech SDK properties:
speechConfig.setProperty(
  SpeechSDK.PropertyId.SpeechServiceConnection_InitialSilenceTimeoutMs,
  "1000" // Reduce silence timeout to 1 second
);
speechConfig.setProperty(
  SpeechSDK.PropertyId.SpeechServiceConnection_EndSilenceTimeoutMs,
  "1000" // Reduce speech end timeout to 1 second
);
  • Still experiencing missed words at low volume.

🔹 Questions:

  1. Which Azure Speech SDK feature is best for real-time meeting transcription with speaker diarization?
  2. How can I ensure that sentences are properly segmented in real-time?
  3. What settings or configurations can improve recognition of low-volume speech?
  4. Would using a custom speech model improve accuracy for low-volume conversations?

🔹Current Setup:

  • SDK: microsoft-cognitiveservices-speech-sdk (latest version)
  • Language Models: "en-US", "ja-JP"
  • Recognition Mode: Continuous Recognition
  • Audio Source: Live microphone
  • Environment: Browser (Web Speech SDK)

Any guidance, best practices, or SDK settings that can help optimize real-time transcription accuracy, speaker diarization, and low-volume speech recognition would be highly appreciated.

Thanks in advance!

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,895 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 28,381 Reputation points
    2025-01-29T10:41:04.0766667+00:00

    Azure Speech SDK offers real-time speaker diarization where you can distinguish between different speakers during live transcription.

    How does it work? It is simple, it assigns unique speaker ID to each participant which allows you to identify who is speaking in real-time.

    https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-stt-diarization

    When it comes to sentence segmentation during real-time transcription, you can use the continuous recognition feature of the Speech SDK, it processes speech input in a continuous way and provides transcription results with appropriate sentence boundaries.

    https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-speech-to-text

    If you want to improve the low volume speech, I think audio preprocessing is a must where you will have more amplification and noise suppression techniques to improve audio clarity.

    Or you can train a custom speech model with audio data that includes low-volume speech, you just need to provide the model examples of low-volume speech and it can learn to better transcribe such inputs.

    https://azure.microsoft.com/en-us/blog/improve-speechtotext-accuracy-with-azure-custom-speech

    This is a good tutorial you can use it : https://blog.gopenai.com/real-time-transcription-with-diarization-using-azure-speech-sdk-a9bd801499a8


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.