Azure Speech Studio Andrew Multilingual voice sounds glitchy
I'm having some issues with the Andrew Multilingual (en-US-AndrewMultilingualNeural) voice in the Azure Speech Studio. There's a few instances in which the voice sounds raspy and really kind of glitchy. It seems to have a lot of trouble with the word "Move" in particular, but there's other sentences where I've noticed it as well. I could provide some audio files to show the problem. Any idea how to fix this problem or when this will be fixed?
Azure AI Speech
-
Avinash Devarakonda 330 Reputation points • Microsoft Vendor
2024-10-23T18:47:33.7166667+00:00 Hi Rene Lems,
Welcome to Microsoft Q&A Forum, thank you for posting your query here!
I have tested the audio along with “Move” word using speech studio and it is working fine on my end. It might have been a temporary network issue when you tried. Please try again after some time.
Ensure you are using the latest version of the Azure Speech SDK.
Thank you!!
-
Rene Lems 0 Reputation points
2024-10-24T08:17:11.91+00:00 Hi @Avinash Devarakonda . I'm using the AI Speech Studio in the browser. Below is the SSML, and it's still the same for me. Here's the audio file of the word move: https://drive.google.com/file/d/1IeYXlRLabP0O_78RTcIYudqQ9pHw3LKK/view?usp=sharing
<!--ID=B7267351-473F-409D-9765-754A8EBCDE05;Version=1|{"VoiceNameToIdMapItems":[{"Id":"8b8dfa1b-b07b-4fa8-b47f-c90666bc2488","Name":"Microsoft Server Speech Text to Speech Voice (en-US, AndrewMultilingualNeural)","ShortName":"en-US-AndrewMultilingualNeural","Locale":"en-US","VoiceType":"StandardVoice"}]}--> <!--ID=FCB40C2B-1F9F-4C26-B1A1-CF8E67BE07D1;Version=1|{"Files":{}}--> <!--ID=5B95B1CC-2C7B-494F-B746-CF22A0E779B7;Version=1|{"Locales":{"en-US":{"AutoApplyCustomLexiconFiles":[{}]},"de-DE":{"AutoApplyCustomLexiconFiles":[{}]}}}--> <speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US"><voice name="en-US-AndrewMultilingualNeural">Move</voice></speak>
The audio sounds the same for me with the latest SDK.
-
Avinash Devarakonda 330 Reputation points • Microsoft Vendor
2024-10-24T17:04:28.9666667+00:00 Hi Rene Lems,
I tried with only "Move" word as you said I am also facing the same issue, I tried to add "Move" word along with other words, then I am unable to see any issue, below is the SSML for the same.
<!--ID=B7267351-473F-409D-9765-754A8EBCDE05;Version=1|{"VoiceNameToIdMapItems":[{"Id":"8b8dfa1b-b07b-4fa8-b47f-c90666bc2488","Name":"Microsoft Server Speech Text to Speech Voice (en-US, AndrewMultilingualNeural)","ShortName":"en-US-AndrewMultilingualNeural","Locale":"en-US","VoiceType":"StandardVoice"}]}--> <!--ID=FCB40C2B-1F9F-4C26-B1A1-CF8E67BE07D1;Version=1|{"Files":{}}--> <!--ID=5B95B1CC-2C7B-494F-B746-CF22A0E779B7;Version=1|{"Locales":{"en-US":{"AutoApplyCustomLexiconFiles":[{}]}}}--> <speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US"><voice name="en-US-AndrewMultilingualNeural">hi, how are you move and move to the right, take a break move</voice></speak>
Thank you!!
-
Rene Lems 0 Reputation points
2024-10-24T18:52:48.8266667+00:00 Thank you for your reply. Unfortunately I need the word individually and not in a sentence, so that doesn't work for me. Are you from Microsoft? If so, is there any plan to improve on the voice? It was working great before half a year ago or something. I remember that all of a sudden the voice changed overnight, for the worse I'd say. I remember other users complaining about this. Like here for instance: https://learn.microsoft.com/en-us/answers/questions/1632552/the-andrew-neutral-voice-is-not-as-good-as-it-used
Best,
René
-
Avinash Devarakonda 330 Reputation points • Microsoft Vendor
2024-10-25T16:02:08.6066667+00:00 Hi Rene Lems,
In Azure Speech Studio, single-word inputs may occasionally sound less natural or lack clear intonation. This is because the TTS engine performs better with contextual information, it can generate more accurate pronunciation, rhythm, and intonation when given full sentences or phrases.
Use sentences or phrase groups rather than isolated words. This provides the model with enough context to produce smoother, more natural-sounding audio.Thank you!!
-
Avinash Devarakonda 330 Reputation points • Microsoft Vendor
2024-10-28T16:17:02.53+00:00 Hi Rene Lems,
We haven’t heard from you on the last response and was just checking back to see if the give response was helpful.
Thank you!! -
Alexis Toro 0 Reputation points
2024-11-04T13:39:07.0133333+00:00 Hello @Avinash Devarakonda , I am working with René on the project, the issue is not only for isolated words, we encounter it regularly at beginning of sentences. For example the following one :
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"> <voice name="en-US-AndrewMultilingualNeural"> <prosody rate="8.00%"> Indeed! we can better understand how this technology works with concrete examples.<break time="500ms" /> Together let’s increase the sustainability rating of this data center. <break time="1000ms" />We start with something that's often unused: waste heat.<break time="500ms" /> With all the servers running non-stop, data centers produce heat.<break time="500ms" /> And a lot of it.<break time="850ms" /> To illustrate: a mid-sized data center (with something along 2000-10000 servers) could heat around 12000 households! <break time="1000ms" /> </prosody> </voice> </speak>
You can here that
indeed
is pretty distorted -
Rene Lems 0 Reputation points
2024-11-25T12:55:50.4466667+00:00 Sorry, I was out of office for a while, so that's why I didn't respond to your answer. Alexis has added another comment about the same project. He's given another example. The single word is often used in combination with a sentence afterwards. Seems like a pretty common use case to me. Is that not enough context? Any way to improve on the way Andrew says these words?
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"> <voice name="en-US-AndrewMultilingualNeural"> <prosody rate="8.00%"> Indeed! we can better understand how this technology works with concrete examples.<break time="500ms" /> Together let’s increase the sustainability rating of this data center. <break time="1000ms" />We start with something that's often unused: waste heat.<break time="500ms" /> With all the servers running non-stop, data centers produce heat.<break time="500ms" /> And a lot of it.<break time="850ms" /> To illustrate: a mid-sized data center (with something along 2000-10000 servers) could heat around 12000 households! <break time="1000ms" /> </prosody> </voice> </speak>
Sign in to comment