Why i get wrong visemes when using German with English phrases?

Veljko Markovic | Babylon Engineer 20 Reputation points
2024-12-18T11:52:12.4866667+00:00

I am using "Azure Speech" to synthesize speech from a text input, and also to generate Viseme. When using German language, if i use English phrase it sends me back wrong visemes. Ts is not good, last viseme has ts: 0, which should not happen. You can test it out if you set to German and use this sentence:

Hallo, Ich bin der neue virtuelle Assistent der Fresh Food and Beverage Group. Es freut mich, euch hier begrüssen zu dürfen. In Zukunft werde ich verschiedene Aktivitäten übernehmen dürfen. Insbesondere im Bereich Schulung und Qualitätssicherung.

If "Fresh Food and Beverage Group" is removed, it works fine.

So after English phrase, visemes are broken.Screenshot 2024-12-18 124822

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,847 questions
{count} votes

Accepted answer
  1. Sina Salam 15,006 Reputation points
    2024-12-18T13:57:38.22+00:00

    Hello Veljko Markovic | Babylon Engineer,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that you would like to resolve the issue of getting wrong visemes when using German with English phrases.

    Since you need to keep "Fresh Food and Beverage Group" in the text, here are a few specific suggestions to address the viseme issue:

    1. Use SSML to explicitly mark the English phrase. This can help the speech synthesis engine handle the language switch more accurately. For an example:
    <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="de-DE">
           Hallo, Ich bin der neue virtuelle Assistent der <lang xml:lang="en-US">Fresh Food and Beverage Group</lang>. Es freut mich, euch hier begrüssen zu dürfen. In Zukunft werde ich verschiedene Aktivitäten übernehmen dürfen. Insbesondere im Bereich Schulung und Qualitätssicherung.
    </speak>
    

    Other things you can do are to:

    • Try breaking the text into smaller segments and process them separately. This might help in isolating the issue.
    • Define a custom pronunciation for the English phrase within the SSML tags. This can sometimes help in generating more accurate visemes.
    • If the issue persists, contacting Azure support with your specific use case and the issues.

    Regarding your clarification:

    Since the input is from customer, preprocess the input programmatically to dynamically detect language changes will ensures that language switches are handled dynamically, improving viseme accuracy. You can use Azure Language Detection API (part of Azure Cognitive Services) to identify segments of different languages in the text and wrap them with appropriate <lang> tags in SSML:

       def create_ssml(text, default_language="de-DE"):
           # Example of language detection logic
           detected_segments = detect_language_segments(text)  # Assume this detects and splits text by language
           ssml = f'<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="{default_language}">'
           for segment in detected_segments:
               if segment['language'] == default_language:
                   ssml += segment['text']
               else:
                   ssml += f'<lang xml:lang="{segment["language"]}">{segment["text"]}</lang>'
           ssml += '</speak>'
           return ssml
    
    

    Secondly, you can use a custom approach for viseme generation as a workaround if SDK issue is not resolve: Break the text into smaller segments, process them individually, and stitch the viseme timelines together. For an example:

         def process_text_segments(text, language="de-DE"):
             segments = detect_language_segments(text)  # Detect language and split text
             viseme_data = []
             for segment in segments:
                 response = synthesize_speech(segment['text'], language=segment['language'])
                 viseme_data.extend(response['visemes'])
             return viseme_data
    

    So, other things you can do:

    a. If the German model has persistent issues, explore alternative voices or models within Azure Speech that might handle mixed-language inputs better.

    b. Report this German-English viseme inconsistency to Azure support with the following details:

    • Provide examples of problematic and non-problematic text inputs.
    • Include SSML scripts and their outputs for German, Hungarian, and English cases.
    • Request a fix or clarification on handling mixed-language visemes for German.

    I hope this is helpful! Do not hesitate to let me know if you have any other questions.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.