Which Azure Speech SDK Feature to Use for Real-Time Meeting Transcription with Speaker Diarization?

Question

Hi ,

I am working on real-time meeting transcription using Azure Speech SDK and need:

Accurate speaker diarization (identify who is speaking).
Sentence-level segmentation (avoid merging multiple sentences into one recognition event).
Improved low-volume speech recognition (ensure all speech is captured, even if quiet).

🔹 What I Have Tried:

1. ConversationTranscriber

Works well for file-based transcription (fromWavFileInput).
When used in real-time, it merges multiple sentences into one recognition result.
Speaker diarization assigns one speaker ID to an entire chunk instead of per sentence.

2. SpeechRecognizer with Diarization

Doesn't seem to support speaker diarization in real-time.

3. Audio Preprocessing for Low Volume

Tried using Web Audio API with gain amplification and noise suppression before feeding into the SDK.
Adjusted Azure Speech SDK properties:

speechConfig.setProperty(
  SpeechSDK.PropertyId.SpeechServiceConnection_InitialSilenceTimeoutMs,
  "1000" // Reduce silence timeout to 1 second
);
speechConfig.setProperty(
  SpeechSDK.PropertyId.SpeechServiceConnection_EndSilenceTimeoutMs,
  "1000" // Reduce speech end timeout to 1 second
);

Still experiencing missed words at low volume.

🔹 Questions:

Which Azure Speech SDK feature is best for real-time meeting transcription with speaker diarization?
How can I ensure that sentences are properly segmented in real-time?
What settings or configurations can improve recognition of low-volume speech?
Would using a custom speech model improve accuracy for low-volume conversations?

🔹Current Setup:

SDK: microsoft-cognitiveservices-speech-sdk (latest version)
Language Models: "en-US", "ja-JP"
Recognition Mode: Continuous Recognition
Audio Source: Live microphone
Environment: Browser (Web Speech SDK)

Any guidance, best practices, or SDK settings that can help optimize real-time transcription accuracy, speaker diarization, and low-volume speech recognition would be highly appreciated.

Thanks in advance!

Answer

Azure Speech SDK offers real-time speaker diarization where you can distinguish between different speakers during live transcription.

How does it work? It is simple, it assigns unique speaker ID to each participant which allows you to identify who is speaking in real-time.

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-stt-diarization

When it comes to sentence segmentation during real-time transcription, you can use the continuous recognition feature of the Speech SDK, it processes speech input in a continuous way and provides transcription results with appropriate sentence boundaries.

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-speech-to-text

If you want to improve the low volume speech, I think audio preprocessing is a must where you will have more amplification and noise suppression techniques to improve audio clarity.

Or you can train a custom speech model with audio data that includes low-volume speech, you just need to provide the model examples of low-volume speech and it can learn to better transcribe such inputs.

https://azure.microsoft.com/en-us/blog/improve-speechtotext-accuracy-with-azure-custom-speech

This is a good tutorial you can use it : https://blog.gopenai.com/real-time-transcription-with-diarization-using-azure-speech-sdk-a9bd801499a8

Share via

Which Azure Speech SDK Feature to Use for Real-Time Meeting Transcription with Speaker Diarization?

1 answer

Your answer