According to the documentation :
You use the Azure OpenAI Whisper model for speech to text.
The file size limit for the Azure OpenAI Whisper model is 25 MB. If you need to transcribe a file larger than 25 MB, you can use the Azure AI Speech batch transcription API.
For the real-time option, Whisper does not natively support streaming audio input for real-time transcription, so you'll need to manage this by breaking the audio into chunks and processing them sequentially. This approach introduces a slight delay but can approximate real-time processing.
- Chunking Audio: Divide the continuous audio stream into manageable chunks. The size of these chunks can affect the latency and accuracy of the transcription, so you might need to experiment to find the best balance.
- Transcribing with Whisper: For each audio chunk, use Whisper to transcribe the audio to text. This involves loading the Whisper model and passing the audio data to it for transcription.
You can use Python for this task, leveraging libraries such as pyaudio
for audio capture and the transformers
library from Hugging Face for running Whisper.