Hello Veljko Markovic | Babylon Engineer,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
I understand that you would like to resolve the issue of getting wrong visemes when using German with English phrases.
Since you need to keep "Fresh Food and Beverage Group" in the text, here are a few specific suggestions to address the viseme issue:
- Use SSML to explicitly mark the English phrase. This can help the speech synthesis engine handle the language switch more accurately. For an example:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="de-DE">
Hallo, Ich bin der neue virtuelle Assistent der <lang xml:lang="en-US">Fresh Food and Beverage Group</lang>. Es freut mich, euch hier begrüssen zu dürfen. In Zukunft werde ich verschiedene Aktivitäten übernehmen dürfen. Insbesondere im Bereich Schulung und Qualitätssicherung.
</speak>
Other things you can do are to:
- Try breaking the text into smaller segments and process them separately. This might help in isolating the issue.
- Define a custom pronunciation for the English phrase within the SSML tags. This can sometimes help in generating more accurate visemes.
- If the issue persists, contacting Azure support with your specific use case and the issues.
Regarding your clarification:
Since the input is from customer, preprocess the input programmatically to dynamically detect language changes will ensures that language switches are handled dynamically, improving viseme accuracy. You can use Azure Language Detection API (part of Azure Cognitive Services) to identify segments of different languages in the text and wrap them with appropriate <lang> tags in SSML:
def create_ssml(text, default_language="de-DE"):
# Example of language detection logic
detected_segments = detect_language_segments(text) # Assume this detects and splits text by language
ssml = f'<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="{default_language}">'
for segment in detected_segments:
if segment['language'] == default_language:
ssml += segment['text']
else:
ssml += f'<lang xml:lang="{segment["language"]}">{segment["text"]}</lang>'
ssml += '</speak>'
return ssml
Secondly, you can use a custom approach for viseme generation as a workaround if SDK issue is not resolve: Break the text into smaller segments, process them individually, and stitch the viseme timelines together. For an example:
def process_text_segments(text, language="de-DE"):
segments = detect_language_segments(text) # Detect language and split text
viseme_data = []
for segment in segments:
response = synthesize_speech(segment['text'], language=segment['language'])
viseme_data.extend(response['visemes'])
return viseme_data
So, other things you can do:
a. If the German model has persistent issues, explore alternative voices or models within Azure Speech that might handle mixed-language inputs better.
b. Report this German-English viseme inconsistency to Azure support with the following details:
- Provide examples of problematic and non-problematic text inputs.
- Include SSML scripts and their outputs for German, Hungarian, and English cases.
- Request a fix or clarification on handling mixed-language visemes for German.
I hope this is helpful! Do not hesitate to let me know if you have any other questions.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.