I was able to get C# code working that does speech translation for both microphone audio and system audio via Azure Cognitive Services. (By “system audio,” I mean, for example, the voices of remote Web meeting participants or the audio output from audio files played locally on the PC).
Although the display of intermediate speech translation results for the microphone audio is pretty fast (that is, within a few seconds after the utterance in the source language starts) and I was able to set this up very easily with event handlers (using code similar to here), the display of intermediate speech translation results for system audio is much slower than microphone audio (starts after the source utterance is finished).
Questions:
1. Is it possible to set up Azure API event handlers to do speech translation of system audio (NOT microphone audio) without needing to save audio to WAV files (or to a stream) and then doing speech translation on those WAV files (or stream)?
The documentation here seems to indicate that this is not possible at the present time but would like to confirm.
I’m currently doing speech translation of system audio via code similar to the following:
using (var audioInput = AudioConfig.FromWavFileInput(curAudioFileForSpeechRecognitionProcessing))
{
using (var recognizerFromSystemAudio = new TranslationRecognizer(config, autoDetectSourceLanguageConfig, audioInput))
{
...
recognizerFromSystemAudio.Recognizing += (s, e) =>
{
var lidResult = e.Result.Properties.GetProperty(PropertyId.SpeechServiceConnection_AutoDetectSourceLanguageResult);
...
I’d like to be able to set up for speech translation of system audio more easily via code similar to the following and have the intermediate speech translation results displayed quickly even before the utterance is finished (which I’m using for microphone audio).
using (var recognizerFromMicrophone = new TranslationRecognizer(config, autoDetectSourceLanguageConfig))
{
...
recognizerFromMicrophone.Recognizing += (s, e) =>
{
var lidResult = e.Result.Properties.GetProperty(PropertyId.SpeechServiceConnection_AutoDetectSourceLanguageResult);
...
2. If setting up speech translation of system audio via event handlers (no reading from WAV files or stream) is not possible, do you have any ideas on how I can speed up the display of intermediate speech translation results for system audio?
Ideally, I would NOT need to do the following, but the Azure API would handle this automatically (just like for microphone audio):
a. Record the first half of the sentence of the speech of the person talking.
b. Do speech translation on this audio fragment to display the intermediate speech translation results.
c. Record the remainder of the sentence.
d. Do speech translation for the audio of the complete sentence and then display this.
NOTE: To reduce latency, I need to do both the speech-to-text and machine translation in one call to the Azure server (i.e., NOT do speech-to-text and machine translation in two separate calls).
Environment I’m using:
· Windows 10 (Version 22H2 (OS Build 19045.4529))
· Microsoft .NET Framework (Version 4.8.04084)
· Microsoft Visual Studio Professional 2019 (Version 16.11.35)