How can an application using Azure Communication Service using the Play APIs determine which words have been spoken by TTS

Question

Hi,

We're using Azure Communication Service to receive calls from users. When a user calls in, we use the ACS Play API - https://learn.microsoft.com/en-ca/azure/communication-services/concepts/call-automation/play-action to for TTS.

We need to accurately track what words were spoken by the TTS to the caller in order to handle interruptions by the User.

The PlayStarted and PlayCompleted events dont provide sufficient granularity. We need to determine exactly which words were spoken when by the TTS. Is there an option to receive transcription data from the TTS the same as way as receiving transcription data for a human caller ?

Answer

Hi Sameer,

Welcome to Microsoft Q&A,

You can track limit sounds by inserting pauses using SSML with the mark for Azure Communication Services TTS to control the pacing. Place breaks after key phrases against which you will time events, such as by correlating them with PlayStarted and PlayCompleted events. For instance, putting after every sentence. Log the time for each event to roughly glean when each section is read After that, this will strike a balance between simplicity and accuracy without extensive setups being needed. It can also be plugged into your backend via the ACS Play API.

Enable real-time transcription for the user's speech on the same call using Azure's Call Automation features. The transcription data is sent via the same WebSocket connection. Correlate the TTS events with the incoming transcription events to handle user interruptions accurately.

for reference,

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-structure

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup

If you have any further assistant, do let me know.

Answer

@Sameer ah i see. thanks for clarifying. Currently, that specific functionality does not exist in ACS however, the ACS engineering team is working on a feature like that which will be available later this year (no exact ETA to share at this time).

For now, the workaround they suggest is using Connect API to join as another participant in the call and start transcription in that leg, which will give you what you need.

Add real-time transcription into your applications - An Azure Communication Services how-to document | Microsoft Learn

instead of RoomCallLocator, you should use ServerCallLocator

Azure Communication Services Call Automation how-to for managing calls with Call Automation - An Azure Communication Services how-to document | Microsoft Learn

hope that helps :)

Share via

How can an application using Azure Communication Service using the Play APIs determine which words have been spoken by TTS

2 answers

Your answer