Azure TTS: Getting non speech audio bytes at beginning and ending of TTS speech

Tom Westrick 0 Reputation points
2025-03-06T21:22:56.8833333+00:00

We use Azure's Rest API with the TTS service to generate audio for one of our products. From our logs, it seems starting on February 28, 2025, we started getting audio back with non-speech bytes (two audio blips) at the beginning and end of the audio generated when using the voice zh-CN-XiaochenMultilingualNeural in English. I have an example mp3 file but it seems we cannot upload audio files here.

Here is an example request to replicate the issue:

POST /cognitiveservices/v1 HTTP/1.1
Host: eastus.tts.speech.microsoft.com
Content-Type: application/ssml+xml
X-Microsoft-OutputFormat: audio-48khz-192kbitrate-mono-mp3
Content-Length: 309

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
    <voice xml:lang="en-US" name="zh-CN-XiaochenMultilingualNeural">
        <lang xml:lang="en-US">
            Good morning, this is for testing.
        </lang>
    </voice>
</speak>

From lots of trial and error, it seems removing all line breaks and extra white space in the XML, the non-speech bytes don't get generated. This seems like a workaround and not a permanent fix.

This works as expected:

POST /cognitiveservices/v1 HTTP/1.1
Host: eastus.tts.speech.microsoft.com
Content-Type: application/ssml+xml
X-Microsoft-OutputFormat: audio-48khz-192kbitrate-mono-mp3
Content-Length: 267

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US"><voice xml:lang="en-US" name="zh-CN-XiaochenMultilingualNeural"><lang xml:lang="en-US">Good morning, this is for testing.</lang></voice></speak>

Like I said, this just started happening and all other Azure voices we use seem to work just fine.

My question is, can this be confirmed a bug with the specific voice? And can my fix be considered a permanent solution or is it random that it fixes the issue?

The Docs show requests being made with the XML having line breaks and white space.

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,940 questions
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.