How to perform speech recognition to get speech input over a telephony phone call in Microsoft Azure?
Things We Have Done:
- Created an Azure Communication Service (ACS) instance and acquired an active phone number.
- Set up an event subscription to host the callback link required to interact with the purchased phone number.
- Deployed Azure Speech Service to generate an endpoint and API key for text-to-speech and speech-to-text functionalities.
- Developed Python and C# code to integrate these functionalities and connect/interact on the phone number.
Achievements with Python Code:
- Successfully connected calls.
- Enabled basic text-to-speech for trial questions.
Challenges with Python Code:
- Debugging an issue where the converted text-to-speech audio files for questions are not playing on the call.
Help Needed for Python Code:
- Understanding if the issue with playing media on the call is due to limitations in the Azure Communication Service SDK for Python and, if so, identifying possible workarounds.
Achievements with C# Code:
- Successfully connected calls.
- Enabled the bot to ask questions during the call (extracting questions/prompts from Excel & playing them in the call.
Challenges with C# Code:
- While user responses are likely being monitored, we are unable to capture what the user is saying during the call (user speech input).
- Despite providing InitialSilenceTimeout of 10 secs, once bot is done reading out the prompt if I do not say anything, the bot moves on to the next question in a matter of 1-2 seconds & does not reprompt the current question. Even if I try to say something within that 1-2 seconds, I do not believe the bot is getting my speech input & as it moves on to the next question regardless.
Help Needed for C# Code:
- Validating if we are using the correct service (Azure Speech Service) for speech-to-text integration.
- Guidance on how to capture real-time speech-to-text responses effectively during a call.
**
Additional Context:**
- The Python approach is relatively new (2-day-old effort) as we pivoted after encountering roadblocks with the C# implementation, despite extensive debugging over 5-6 days.
- This intended solution is related to telephony requiring speech input from a mobile device phone instead of relying on speech input from the microphone of a laptop/computer.
- For testing, we are following the setup instructions mentioned in the provided GitHub reference link, including setting up Azure DevTunnel & running the app. Assuming these steps are followed & Azure services are configured properly, when calling the ACS phone #, the phone call is able to go through.
- GitHub reference link for C#: https://github.com/Azure-Samples/communication-services-dotnet-quickstarts/tree/main/callautomation-openai-sample-csharp
- Python Version: 3.12.6
C# approach code:
using Azure;
using Azure.AI.OpenAI;
using Azure.Communication;
using Azure.Communication.CallAutomation;
using Azure.Messaging;
using Azure.Messaging.EventGrid;
using Azure.Messaging.EventGrid.SystemEvents;
using Microsoft.AspNetCore.Mvc;
using System.ComponentModel.DataAnnotations;
using System.Text.RegularExpressions;
using Microsoft.CognitiveServices.Speech;
var builder = WebApplication.CreateBuilder(args);
int currentQuestionId = 1; // Start with the first question
var excelFilePath = builder.Configuration.GetValue