I'm working on a real-time translation project using Azure Speech Services. When I run my translation code in a standalone Python script, it accurately recognizes and translates French and English speech. However, when the same Speech-to-Text functionality is integrated into a video call (using WebSocket connections), the recognition of French is significantly less accurate. Here’s a summary of my setup: Python Script : I use Azure's cognitive services for real-time speech recognition, and the language detection works very well, especially for French. Video Call Integration : Using Azure Speech Services in a Node.js application, I use the same language configurations and WebSocket to capture and process audio from live video calls, but the French detection is consistently inaccurate. I’ve ensured that the audio quality is similar in both cases and that the language configurations match. Unless there is an underlying issue somewhere else, the model recognizes english (not amazingly but the language synthesis is there), and does not process french well at all. I have also tried with italian and spanish and they are not great. Is there a language code issue since I am using the speech to text translate and text to speech? l

Why does Azure Speech-to-Text detect French accurately in a standalone Python script but perform poorly in a real-time video call integration?

Suha Mansuri 0

I'm working on a real-time translation project using Azure Speech Services. When I run my translation code in a standalone Python script, it accurately recognizes and translates French and English speech. However, when the same Speech-to-Text functionality is integrated into a video call (using WebSocket connections), the recognition of French is significantly less accurate.

Here’s a summary of my setup:

Python Script: I use Azure's cognitive services for real-time speech recognition, and the language detection works very well, especially for French.
Video Call Integration: Using Azure Speech Services in a Node.js application, I use the same language configurations and WebSocket to capture and process audio from live video calls, but the French detection is consistently inaccurate.

I’ve ensured that the audio quality is similar in both cases and that the language configurations match. Unless there is an underlying issue somewhere else, the model recognizes english (not amazingly but the language synthesis is there), and does not process french well at all. I have also tried with italian and spanish and they are not great. Is there a language code issue since I am using the speech to text translate and text to speech? l

romungi-MSFT 48,326 Reputation points Microsoft Employee

2024-11-15T06:06:51.9866667+00:00

@Suha Mansuri Are you using the speech to text with fast transcription? If yes, you can set the language identification parameters for better results. See the reference with transcription properties. Here is a sample with a custom model but you can just refer the language identification properties and add them for your case and check if the results are better.
romungi-MSFT 48,326 Reputation points Microsoft Employee

2024-11-20T07:10:38.9133333+00:00

@Suha Mansuri Did you get a chance to check the above response and use LID properties?
Suha Mansuri 0 Reputation points

2024-11-20T20:15:44.45+00:00

@romungi-MSFT Thank you for the suggestion!! I am using this with the purpose of implementing a real time translation through video call application. Thus, the translation would be converted back to speech and sent to the users. I apologize if that was not clearer before. Would fast transcription still be applicable here?

And as for the API, I am using the cognitive speech SDK and Vonage as my video chat service. I have the server working (seemingly so), but the speech processing of what I speak is very poor in terms of recognizing what I am saying. The translation operates fine, just does not have the correct transcription to translate.
Suha Mansuri 0 Reputation points

2024-11-20T20:21:14.0866667+00:00

HI @romungi-MSFT , I had responded to this yesterday but I guess it did not post. So for clarification, I am working on a real-time translation video call application. Thus, I am unsure if fast transcription is applicable here. Let me know. I am trying to program with with Vonage and Azure. Originally I had also attempted google cloud but it behaved poorly hence the switch to Azure. As for now, I am using the Azure Cognitive Speech SDK, and their speech synthesis translation APIs through there. It seems as though the server is running correctly as I have no errors popping up. When transcribing English, it is mediocre. When transcribing French (we have tried both fr-CA and fr-FR) it is barely recognizing anything. I thought maybe this was an issue with the audio format that is being passed to the API, but because my server code does not have issues I cannot figure out if there is anything else.
Suha Mansuri 0 Reputation points

2024-11-20T20:31:11.9866667+00:00

I also want to add, I tested the cognitive service speech SDK in python (no vid call just the lang synthesis and translation) using a back and forth english->french and also french->english. It was seamless and perfect. hence why i did not think the LID properties were going to change much (I tried anyways and it does not change much) and why I figured either audio format or something else was the issue

romungi-MSFT 48,326 Microsoft Employee

@Suha Mansuri Fast transcription API is real time, so it could help if you use it with the language properties. With respect to recognition with just audio file vs audio extracted from video. The only difference is the API i.e direct speech API vs translation API. With SDK your translation config should be something like below from this sample:

    translation_config = speechsdk.translation.SpeechTranslationConfig(
        subscription=speech_key, region=service_region,
        speech_recognition_language='en-US',
        target_languages=('de', 'fr'))
    audio_config = speechsdk.audio.AudioConfig(filename=weatherfilename)

    # Creates a translation recognizer using and audio file as input.
    recognizer = speechsdk.translation.TranslationRecognizer(
        translation_config=translation_config, audio_config=audio_config)

    result = recognizer.recognize_once()
    # Check the result
    if result.reason == speechsdk.ResultReason.TranslatedSpeech:
        print("""Recognized: {}
        German translation: {}
        French translation: {}""".format(
            result.text, result.translations['de'], result.translations['fr']))
...

This should return text in de and fr. You might as well run the sample and check with some of your audio files with this config and audio from video call to check the deviation. Thanks!!

Share via

Why does Azure Speech-to-Text detect French accurately in a standalone Python script but perform poorly in a real-time video call integration?

Your answer