Will word boundary event always be triggered before the Synthesizing event?

Question

We are using speech SDK to do text to speech, and we need to highlight the speaking word by leveraging the word boundary event.

From https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-speech-synthesis?tabs=browserjs%2Cterminal&pivots=programming-language-javascript, looks like the audio data is returned from the Synthesizing event, while the word boundary information is returned from the WordBoundary event. How can we ensure when I play audio, the word boundary information is always available? Does wordBoundary event always return before the synthesizing event?

Synthesizing	Signals that speech synthesis is ongoing. This event fires each time the SDK receives an audio chunk from the Speech service.	You can confirm when synthesis is in progress.
`Synthesizing`	Signals that speech synthesis is ongoing. This event fires each time the SDK receives an audio chunk from the Speech service.	You can confirm when synthesis is in progress.
`VisemeReceived`	Signals that a viseme event was received.	Visemes are often used to represent the key poses in observed speech. Key poses include the position of the lips, jaw, and tongue in producing a particular phoneme. You can use visemes to animate the face of a character as speech audio plays.
`WordBoundary`	Signals that a word boundary was received. This event is raised at the beginning of each new spoken word, punctuation, and sentence. The event reports the current word's time offset, in ticks, from the beginning of the output audio. This event also reports the character position in the input text or SSML immediately before the word that's about to be spoken.	This event is commonly used to get relative positions of the text and corresponding audio. You might want to know about a new word, and then take action based on the timing. For example, you can get information that can help you decide when and for how long to highlight words as they're spoken.

Accepted Answer

Hi Yu Lan,

The WordBoundary event in the Azure Speech SDK is designed to be triggered before the corresponding word is spoken, providing the timing information needed to highlight words as they are spoken.

However, the Synthesizing event, which provides the audio data, can sometimes be processed at a different pace. To ensure that the word boundary information is always available when you play the audio, you can rely on the fact that the WordBoundary event is generally fired before the audio for that word is played.

Means you should receive the word boundary information in time to highlight the word as the audio plays. Make sure your event handlers for WordBoundary and Synthesizing are set up correctly.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer

Thank You.

Share via

Will word boundary event always be triggered before the Synthesizing event?

0 additional answers

Your answer