How to use whisper model to transcribe audio in real time using Speech SDK?

Question

How do I use Whisper model to transcribe microphone input in real-time using Microsoft-cognitiveservices-speech-sdk npm package? I currently have this working and my region is set to northcentralus I want to know how to use Whisper to transcribe in real-time instead of using the default cognitive speech-to-text model, I wasn't able to find documentation for this.

Answer

According to the documentation :

You use the Azure OpenAI Whisper model for speech to text.

The file size limit for the Azure OpenAI Whisper model is 25 MB. If you need to transcribe a file larger than 25 MB, you can use the Azure AI Speech batch transcription API.

For the real-time option, Whisper does not natively support streaming audio input for real-time transcription, so you'll need to manage this by breaking the audio into chunks and processing them sequentially. This approach introduces a slight delay but can approximate real-time processing.

Chunking Audio: Divide the continuous audio stream into manageable chunks. The size of these chunks can affect the latency and accuracy of the transcription, so you might need to experiment to find the best balance.
Transcribing with Whisper: For each audio chunk, use Whisper to transcribe the audio to text. This involves loading the Whisper model and passing the audio data to it for transcription.

You can use Python for this task, leveraging libraries such as pyaudio for audio capture and the transformers library from Hugging Face for running Whisper.

Answer

@Nas Welcome to Microsoft Q&A Forum, Thank you for posting your query here!

The real-time speech to text feature is not available within the Azure AI Speech whisper model. See here.

User's image

So if you wish to create and run an application to recognize and transcribe speech to text in real-time, (without the Whisper model) using the ReactJS, please follow this article.

For more information, see the React sample and the implementation of speech to text from a microphone on GitHub. This sample shows how to integrate the Azure Speech service into a sample React application. This sample shows design pattern examples for authentication token exchange and management, as well as capturing audio from a microphone or file for speech-to-text conversions.

Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.

**

Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

Share via

How to use whisper model to transcribe audio in real time using Speech SDK?

3 answers

Your answer