Stream Audio Issue with Speech

Diomedes Kastanis 0 Microsoft Employee

We’re using a Python FastAPI server to stream audio from the browser via WebSocket and pass it to Azure Speech. Our goal is to automatically recognize the input language, translate it to English, and stream both the translated text and audio back to the browser. The challenge seems to be with sending the audio stream to Azure Speech using AudioStream. When using use_default_microphone=True, everything works perfectly. However, streaming the audio input instead of using the default microphone appears to be the issue. Thanks here's the code, @router.websocket("/translate/speech")

async def websocket_endpoint(websocket: WebSocket):

await websocket.accept()

await websocket.send_text("Connected to the translator service")

# Create a PushAudioInputStream to act as a bucket for incoming audio data.

audio_format = speechsdk.audio.AudioStreamFormat(samples_per_second=16000, bits_per_sample=16, channels=1)

audio_stream = speechsdk.audio.PushAudioInputStream(stream_format=audio_format)

audio_config = speechsdk.audio.AudioConfig(stream=audio_stream)

# Create a speech translation config with specified subscription key and service region.

translation_config = speechsdk.translation.SpeechTranslationConfig(subscription=AZURE_SPEECH_SUBS_KEY, region=AZURE_SPEECH_REGION)

# Replace with the languages of your choice, from list found here: https://aka.ms/speech/sttt-languages

from_language = "en-US"

to_language = "es"

translation_config.speech_recognition_language = from_language

translation_config.add_target_language(to_language)

translation_config.voice_name = "en-US-JennyNeural" # Optional: Set the voice name of the output translation.

# Create the TranslationRecognizer with the audio configuration.

recognizer = speechsdk.translation.TranslationRecognizer(translation_config=translation_config, audio_config=audio_config)

# Configure speech synthesis (for translated speech output)

speech_config = speechsdk.SpeechConfig(subscription=AZURE_SPEECH_SUBS_KEY, region=AZURE_SPEECH_REGION)

speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)

santoshkc 12,190 Reputation points Microsoft Vendor

2025-02-06T13:09:29.5666667+00:00

Hi @Diomedes Kastanis,

Thank you for reaching out to Microsoft Q&A forum!

When using the default microphone, the system expects a real-time stream from the microphone, but when streaming from a WebSocket, it can be more complex because the incoming data might not be in the same format.

One common issue is the handling of the audio buffer or timing mismatch between the WebSocket stream and Azure Speech's PushAudioInputStream. It’s important to ensure that the audio is being correctly chunked and sent in real-time to the Azure Speech service. You might want to check the buffering and ensure that the audio is being sent in a consistent manner that matches the expected format for Azure's recognition.

Additionally, consider checking if the WebSocket data is being correctly formatted before it is passed to the PushAudioInputStream. Also verify if the WebSocket connection is maintaining a consistent flow of data, as interruptions in the stream could affect recognition.

Thank you.
santoshkc 12,190 Reputation points Microsoft Vendor

2025-02-07T09:04:13.98+00:00

Hi @Diomedes Kastanis,

Following up to see if the given response was helpful. Thank you.
santoshkc 12,190 Reputation points Microsoft Vendor

2025-02-10T14:17:17.0666667+00:00

Hi @Diomedes Kastanis,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others.

Thank you.

Share via

Stream Audio Issue with Speech

Your answer