How can an application using Azure Communication Service using the Play APIs determine which words have been spoken by TTS

Sameer 0 Reputation points
2025-02-27T15:48:58.98+00:00

Hi,

We're using Azure Communication Service to receive calls from users. When a user calls in, we use the ACS Play API - https://learn.microsoft.com/en-ca/azure/communication-services/concepts/call-automation/play-action to for TTS.

We need to accurately track what words were spoken by the TTS to the caller in order to handle interruptions by the User.

The PlayStarted and PlayCompleted events dont provide sufficient granularity. We need to determine exactly which words were spoken when by the TTS. Is there an option to receive transcription data from the TTS the same as way as receiving transcription data for a human caller ?

Azure Communication Services
Azure Communication Services
An Azure communication platform for deploying applications across devices and platforms.
1,032 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Siva Nair 575 Reputation points Microsoft External Staff
    2025-02-27T16:51:16.1066667+00:00

    Hi Sameer,

    Welcome to Microsoft Q&A,

    You can track limit sounds by inserting pauses using SSML with the <break> mark for Azure Communication Services TTS to control the pacing. Place breaks after key phrases against which you will time events, such as by correlating them with PlayStarted and PlayCompleted events. For instance, putting <break time="500ms"/> after every sentence. Log the time for each event to roughly glean when each section is read After that, this will strike a balance between simplicity and accuracy without extensive setups being needed. It can also be plugged into your backend via the ACS Play API.

    Enable real-time transcription for the user's speech on the same call using Azure's Call Automation features. The transcription data is sent via the same WebSocket connection. Correlate the TTS events with the incoming transcription events to handle user interruptions accurately.

    for reference,

    https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-structure

    https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup

    If you have any further assistant, do let me know. 


  2. Grmacjon-MSFT 18,741 Reputation points
    2025-03-06T21:57:51.86+00:00

    @Sameer ah i see. thanks for clarifying. Currently, that specific functionality does not exist in ACS however, the ACS engineering team is working on a feature like that which will be available later this year (no exact ETA to share at this time).

    For now, the workaround they suggest is using Connect API to join as another participant in the call and start transcription in that leg, which will give you what you need.

    Add real-time transcription into your applications - An Azure Communication Services how-to document | Microsoft Learn

     

    instead of RoomCallLocator, you should use ServerCallLocator

    Azure Communication Services Call Automation how-to for managing calls with Call Automation - An Azure Communication Services how-to document | Microsoft Learn

    hope that helps :)

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.