Hi Sameer,
Welcome to Microsoft Q&A,
You can track limit sounds by inserting pauses using SSML with the <break> mark for Azure Communication Services TTS to control the pacing. Place breaks after key phrases against which you will time events, such as by correlating them with PlayStarted and PlayCompleted events. For instance, putting <break time="500ms"/> after every sentence. Log the time for each event to roughly glean when each section is read After that, this will strike a balance between simplicity and accuracy without extensive setups being needed. It can also be plugged into your backend via the ACS Play API.
Enable real-time transcription for the user's speech on the same call using Azure's Call Automation features. The transcription data is sent via the same WebSocket connection. Correlate the TTS events with the incoming transcription events to handle user interruptions accurately.
for reference,
https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-structure
https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup
If you have any further assistant, do let me know.