Why has the batch synthesis tts word.json output changed when using <break /> tags in ssml?

swal 0 Reputation points
2025-01-09T18:54:17.3133333+00:00

A month ago I was using the batch synthesis tts api and was receiving correct responses for the word.json file. Today I seem to be receiving different responses for the word.json

I haven't changed my code at all.

The audio output is correct.

Here's my ssml input

<speak version='1.0' xml:lang='en-US'><voice name='en-US-AvaNeural'>First paragraph<break strength="strong" />this is a paragraph<break strength="strong" />this is another paragraph<break strength="strong" /></speak>


Here's the word.json output

[
  {
    "Text": "First",
    "AudioOffset": 50,
    "Duration": 400
  },
  {
    "Text": "paragraphthis is a paragraphthis is another paragraph",
    "AudioOffset": 462,
    "Duration": 850
  },
  {
    "Text": "this",
    "AudioOffset": 2362,
    "Duration": 250
  },
  {
    "Text": "is",
    "AudioOffset": 2625,
    "Duration": 100
  },
  {
    "Text": "a",
    "AudioOffset": 2737,
    "Duration": 62
  },
  {
    "Text": "paragraphthis is another paragraph",
    "AudioOffset": 2812,
    "Duration": 900
  },
  {
    "Text": "this",
    "AudioOffset": 4712,
    "Duration": 275
  },
  {
    "Text": "is",
    "AudioOffset": 5000,
    "Duration": 87
  },
  {
    "Text": "another",
    "AudioOffset": 5100,
    "Duration": 325
  },
  {
    "Text": "paragraph",
    "AudioOffset": 5437,
    "Duration": 875
  }
]

As you can see in the output, the text is repeated where the <break /> tag is used.

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,955 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Saideep Anchuri 4,940 Reputation points Microsoft External Staff
    2025-01-12T15:13:58.11+00:00

    Hi swal

    Here is the update. I am able to get expected answer with below inputs and commands.      

     input - "<speak version=\"1.0\" xml:lang=\"en-US\"><voice name=\"en-US-JennyNeural\">The rainbow has<break strength=\"strong\"/>seven colors.<break strength=\"strong\"/>Each color has its own beauty.<break strength=\"strong\"/></voice></speak>"
    
    
    curl -v -X PUT -H "Ocp-Apim-Subscription-Key: yoursubkey" -H "Content-Type: application/json" -d '{
        "description": "my ssml test",
        "inputKind": "SSML",
        "inputs": [
            {
                "content": "<speak version=\"1.0\" xml:lang=\"en-US\"><voice name=\"en-US-JennyNeural\">The rainbow has<break strength=\"strong\"/>seven colors.<break strength=\"strong\"/>Each color has its own beauty.<break strength=\"strong\"/></voice></speak>"
            }
        ],
        "properties": {
            "outputFormat": "riff-24khz-16bit-mono-pcm",
            "wordBoundaryEnabled": true,
            "sentenceBoundaryEnabled": false,
            "concatenateResult": false,
            "decompressOutputFiles": false
        }
    }'
    https://northeurope.api.cognitive.microsoft.com/texttospeech/batchsyntheses/idm0756?api-version=2024-04-01%22
    
    output- [
      {
        "Text": "The",
        "AudioOffset": 50,
        "Duration": 137
      },
      {
        "Text": "rainbow",
        "AudioOffset": 200,
        "Duration": 350
      },
      {
        "Text": "has",
        "AudioOffset": 562,
        "Duration": 475
      },
      {
        "Text": "seven",
        "AudioOffset": 2050,
        "Duration": 362
      },
      {
        "Text": "colors",
        "AudioOffset": 2425,
        "Duration": 612
      },
      {
        "Text": ".",
        "AudioOffset": 3050,
        "Duration": 100
      },
      {
        "Text": "Each",
        "AudioOffset": 4900,
        "Duration": 287
      },
      {
        "Text": "color",
        "AudioOffset": 5200,
        "Duration": 350
      },
      {
        "Text": "has",
        "AudioOffset": 5562,
        "Duration": 175
      },
      {
        "Text": "its",
        "AudioOffset": 5750,
        "Duration": 150
      },
      {
        "Text": "own",
        "AudioOffset": 5912,
        "Duration": 162
      },
      {
        "Text": "beauty",
        "AudioOffset": 6087,
        "Duration": 462
      },
      {
        "Text": ".",
        "AudioOffset": 6562,
        "Duration": 100
      }
    ]
    
    
    

    Thank You.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.