Will word boundary event always be triggered before the Synthesizing event?

Yu Lan 76 Reputation points Microsoft Employee
2024-11-06T03:09:33.8733333+00:00

We are using speech SDK to do text to speech, and we need to highlight the speaking word by leveraging the word boundary event.

From https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-speech-synthesis?tabs=browserjs%2Cterminal&pivots=programming-language-javascript, looks like the audio data is returned from the Synthesizing event, while the word boundary information is returned from the WordBoundary event. How can we ensure when I play audio, the word boundary information is always available? Does wordBoundary event always return before the synthesizing event?

Synthesizing Signals that speech synthesis is ongoing. This event fires each time the SDK receives an audio chunk from the Speech service. You can confirm when synthesis is in progress.
Synthesizing Signals that speech synthesis is ongoing. This event fires each time the SDK receives an audio chunk from the Speech service. You can confirm when synthesis is in progress.
VisemeReceived Signals that a viseme event was received. Visemes are often used to represent the key poses in observed speech. Key poses include the position of the lips, jaw, and tongue in producing a particular phoneme. You can use visemes to animate the face of a character as speech audio plays.
WordBoundary Signals that a word boundary was received. This event is raised at the beginning of each new spoken word, punctuation, and sentence. The event reports the current word's time offset, in ticks, from the beginning of the output audio. This event also reports the character position in the input text or SSML immediately before the word that's about to be spoken. This event is commonly used to get relative positions of the text and corresponding audio. You might want to know about a new word, and then take action based on the timing. For example, you can get information that can help you decide when and for how long to highlight words as they're spoken.
Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,772 questions
0 comments No comments
{count} votes

Accepted answer
  1. Avinash Devarakonda 245 Reputation points Microsoft Vendor
    2024-11-06T05:56:22.9566667+00:00

    Hi Yu Lan,

    The WordBoundary event in the Azure Speech SDK is designed to be triggered before the corresponding word is spoken, providing the timing information needed to highlight words as they are spoken.

    However, the Synthesizing event, which provides the audio data, can sometimes be processed at a different pace. To ensure that the word boundary information is always available when you play the audio, you can rely on the fact that the WordBoundary event is generally fired before the audio for that word is played.

    Means you should receive the word boundary information in time to highlight the word as the audio plays. Make sure your event handlers for WordBoundary and Synthesizing are set up correctly.

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer

    Thank You.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.