How can an application using Azure Communication Service using the Play APIs determine which words have been spoken by TTS

Question

Hi,

We're using Azure Communication Service to receive calls from users. When a user calls in, we use the ACS Play API - https://learn.microsoft.com/en-ca/azure/communication-services/concepts/call-automation/play-action to for TTS.

We need to accurately track what words were spoken by the TTS to the caller in order to handle interruptions by the User.

The PlayStarted and PlayCompleted events dont provide sufficient granularity. We need to determine exactly which words were spoken when by the TTS. Is there an option to receive transcription data from the TTS the same as way as receiving transcription data for a human caller ?

Answer

Hi Sameer,

Welcome to Microsoft Q&A,

You can track limit sounds by inserting pauses using SSML with the mark for Azure Communication Services TTS to control the pacing. Place breaks after key phrases against which you will time events, such as by correlating them with PlayStarted and PlayCompleted events. For instance, putting after every sentence. Log the time for each event to roughly glean when each section is read After that, this will strike a balance between simplicity and accuracy without extensive setups being needed. It can also be plugged into your backend via the ACS Play API.

Enable real-time transcription for the user's speech on the same call using Azure's Call Automation features. The transcription data is sent via the same WebSocket connection. Correlate the TTS events with the incoming transcription events to handle user interruptions accurately.

for reference,

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-structure

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup

If you have any further assistant, do let me know.

Share via

How can an application using Azure Communication Service using the Play APIs determine which words have been spoken by TTS

1 answer

Your answer