How to recognize and translate speech
Reference documentation | Package (NuGet) | Additional samples on GitHub
In this how-to guide, you learn how to recognize human speech and translate it to another language.
See the speech translation overview for more information about:
- Translating speech to text
- Translating speech to multiple target languages
- Performing direct speech to speech translation
Sensitive data and environment variables
The example source code in this article depends on environment variables for storing sensitive data, such as the Speech resource's key and region. The Program
class contains two static readonly string
values that are assigned from the host machine's environment variables: SPEECH__SUBSCRIPTION__KEY
. Both of these fields are at the class scope, so they're accessible within method bodies of the class:
public class Program
static readonly string SPEECH__SUBSCRIPTION__KEY =
static readonly string SPEECH__SERVICE__REGION =
static Task Main() => Task.CompletedTask;
For more information on environment variables, see Environment variables and application configuration.
Use API keys with caution. Don't include the API key directly in your code, and never post it publicly. If you use an API key, store it securely in Azure Key Vault. For more information about using API keys securely in your apps, see API keys with Azure Key Vault.
For more information about AI services security, see Authenticate requests to Azure AI services.
Create a speech translation configuration
To call the Speech service by using the Speech SDK, you need to create a SpeechTranslationConfig
instance. This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.
Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.
You can initialize SpeechTranslationConfig
in a few ways:
- With a subscription: pass in a key and the associated region.
- With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
- With a host: pass in a host address. A key or authorization token is optional.
- With an authorization token: pass in an authorization token and the associated region.
Let's look at how you create a SpeechTranslationConfig
instance by using a key and region. Get the Speech resource key and region in the Azure portal.
public class Program
static readonly string SPEECH__SUBSCRIPTION__KEY =
static readonly string SPEECH__SERVICE__REGION =
static Task Main() => TranslateSpeechAsync();
static async Task TranslateSpeechAsync()
var speechTranslationConfig =
SpeechTranslationConfig.FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
Change the source language
One common task of speech translation is specifying the input (or source) language. The following example shows how you would change the input language to Italian. In your code, interact with the SpeechTranslationConfig
instance by assigning it to the SpeechRecognitionLanguage
static async Task TranslateSpeechAsync()
var speechTranslationConfig =
SpeechTranslationConfig.FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
// Source (input) language
speechTranslationConfig.SpeechRecognitionLanguage = "it-IT";
The SpeechRecognitionLanguage
property expects a language-locale format string. Refer to the list of supported speech translation locales.
Add a translation language
Another common task of speech translation is to specify target translation languages. At least one is required, but multiples are supported. The following code snippet sets both French and German as translation language targets:
static async Task TranslateSpeechAsync()
var speechTranslationConfig =
SpeechTranslationConfig.FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
speechTranslationConfig.SpeechRecognitionLanguage = "it-IT";
With every call to AddTargetLanguage
, a new target translation language is specified. In other words, when speech is recognized from the source language, each target translation is available as part of the resulting translation operation.
Initialize a translation recognizer
After you created a SpeechTranslationConfig
instance, the next step is to initialize TranslationRecognizer
. When you initialize TranslationRecognizer
, you need to pass it your speechTranslationConfig
instance. The configuration object provides the credentials that the Speech service requires to validate your request.
If you're recognizing speech by using your device's default microphone, here's what the TranslationRecognizer
instance should look like:
static async Task TranslateSpeechAsync()
var speechTranslationConfig =
SpeechTranslationConfig.FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
var fromLanguage = "en-US";
var toLanguages = new List<string> { "it", "fr", "de" };
speechTranslationConfig.SpeechRecognitionLanguage = fromLanguage;
using var translationRecognizer = new TranslationRecognizer(speechTranslationConfig);
If you want to specify the audio input device, then you need to create an AudioConfig
class instance and provide the audioConfig
parameter when initializing TranslationRecognizer
First, reference the AudioConfig
object as follows:
static async Task TranslateSpeechAsync()
var speechTranslationConfig =
SpeechTranslationConfig.FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
var fromLanguage = "en-US";
var toLanguages = new List<string> { "it", "fr", "de" };
speechTranslationConfig.SpeechRecognitionLanguage = fromLanguage;
using var audioConfig = AudioConfig.FromDefaultMicrophoneInput();
using var translationRecognizer = new TranslationRecognizer(speechTranslationConfig, audioConfig);
If you want to provide an audio file instead of using a microphone, you still need to provide an audioConfig
parameter. However, when you create an AudioConfig
class instance, instead of calling FromDefaultMicrophoneInput
, you call FromWavFileInput
and pass the filename
static async Task TranslateSpeechAsync()
var speechTranslationConfig =
SpeechTranslationConfig.FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
var fromLanguage = "en-US";
var toLanguages = new List<string> { "it", "fr", "de" };
speechTranslationConfig.SpeechRecognitionLanguage = fromLanguage;
using var audioConfig = AudioConfig.FromWavFileInput("YourAudioFile.wav");
using var translationRecognizer = new TranslationRecognizer(speechTranslationConfig, audioConfig);
Translate speech
To translate speech, the Speech SDK relies on a microphone or an audio file input. Speech recognition occurs before speech translation. After all objects are initialized, call the recognize-once function and get the result:
static async Task TranslateSpeechAsync()
var speechTranslationConfig =
SpeechTranslationConfig.FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
var fromLanguage = "en-US";
var toLanguages = new List<string> { "it", "fr", "de" };
speechTranslationConfig.SpeechRecognitionLanguage = fromLanguage;
using var translationRecognizer = new TranslationRecognizer(speechTranslationConfig);
Console.Write($"Say something in '{fromLanguage}' and ");
Console.WriteLine($"we'll translate into '{string.Join("', '", toLanguages)}'.\n");
var result = await translationRecognizer.RecognizeOnceAsync();
if (result.Reason == ResultReason.TranslatedSpeech)
Console.WriteLine($"Recognized: \"{result.Text}\":");
foreach (var element in result.Translations)
Console.WriteLine($" TRANSLATED into '{element.Key}': {element.Value}");
For more information about speech to text, see the basics of speech recognition.
Event based translation
The TranslationRecognizer
object exposes a Recognizing
event. The event fires several times and provides a mechanism to retrieve the intermediate translation results.
Intermediate translation results aren't available when you use multi-lingual speech translation.
The following example prints the intermediate translation results to the console:
using (var audioInput = AudioConfig.FromWavFileInput(@"whatstheweatherlike.wav"))
using (var translationRecognizer = new TranslationRecognizer(config, audioInput))
// Subscribes to events.
translationRecognizer.Recognizing += (s, e) =>
Console.WriteLine($"RECOGNIZING in '{fromLanguage}': Text={e.Result.Text}");
foreach (var element in e.Result.Translations)
Console.WriteLine($" TRANSLATING into '{element.Key}': {element.Value}");
translationRecognizer.Recognized += (s, e) => {
if (e.Result.Reason == ResultReason.TranslatedSpeech)
Console.WriteLine($"RECOGNIZED in '{fromLanguage}': Text={e.Result.Text}");
foreach (var element in e.Result.Translations)
Console.WriteLine($" TRANSLATED into '{element.Key}': {element.Value}");
else if (e.Result.Reason == ResultReason.RecognizedSpeech)
Console.WriteLine($"RECOGNIZED: Text={e.Result.Text}");
Console.WriteLine($" Speech not translated.");
else if (e.Result.Reason == ResultReason.NoMatch)
Console.WriteLine($"NOMATCH: Speech could not be recognized.");
// Starts continuous recognition. Uses StopContinuousRecognitionAsync() to stop recognition.
Console.WriteLine("Start translation...");
await translationRecognizer.StartContinuousRecognitionAsync().ConfigureAwait(false);
// Waits for completion.
// Use Task.WaitAny to keep the task rooted.
Task.WaitAny(new[] { stopTranslation.Task });
// Stops translation.
await translationRecognizer.StopContinuousRecognitionAsync().ConfigureAwait(false);
Synthesize translations
After a successful speech recognition and translation, the result contains all the translations in a dictionary. The Translations
dictionary key is the target translation language, and the value is the translated text. Recognized speech can be translated and then synthesized in a different language (speech-to-speech).
Event-based synthesis
The TranslationRecognizer
object exposes a Synthesizing
event. The event fires several times and provides a mechanism to retrieve the synthesized audio from the translation recognition result. If you're translating to multiple languages, see Manual synthesis.
Specify the synthesis voice by assigning a VoiceName
instance, and provide an event handler for the Synthesizing
event to get the audio. The following example saves the translated audio as a .wav file.
The event-based synthesis works only with a single translation. Do not add multiple target translation languages. Additionally, the VoiceName
value should be the same language as the target translation language. For example, "de"
could map to "de-DE-Hedda"
static async Task TranslateSpeechAsync()
var speechTranslationConfig =
SpeechTranslationConfig.FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
var fromLanguage = "en-US";
var toLanguage = "de";
speechTranslationConfig.SpeechRecognitionLanguage = fromLanguage;
speechTranslationConfig.VoiceName = "de-DE-Hedda";
using var translationRecognizer = new TranslationRecognizer(speechTranslationConfig);
translationRecognizer.Synthesizing += (_, e) =>
var audio = e.Result.GetAudio();
Console.WriteLine($"Audio synthesized: {audio.Length:#,0} byte(s) {(audio.Length == 0 ? "(Complete)" : "")}");
if (audio.Length > 0)
File.WriteAllBytes("YourAudioFile.wav", audio);
Console.Write($"Say something in '{fromLanguage}' and ");
Console.WriteLine($"we'll translate into '{toLanguage}'.\n");
var result = await translationRecognizer.RecognizeOnceAsync();
if (result.Reason == ResultReason.TranslatedSpeech)
Console.WriteLine($"Recognized: \"{result.Text}\"");
Console.WriteLine($"Translated into '{toLanguage}': {result.Translations[toLanguage]}");
Manual synthesis
You can use the Translations
dictionary to synthesize audio from the translation text. Iterate through each translation and synthesize it. When you're creating a SpeechSynthesizer
instance, the SpeechConfig
object needs to have its SpeechSynthesisVoiceName
property set to the desired voice.
The following example translates to five languages. Each translation is then synthesized to an audio file in the corresponding neural language.
static async Task TranslateSpeechAsync()
var speechTranslationConfig =
SpeechTranslationConfig.FromSubscription(SPEECH__SERVICE__KEY, SPEECH__SERVICE__REGION);
var fromLanguage = "en-US";
var toLanguages = new List<string> { "de", "en", "it", "pt", "zh-Hans" };
speechTranslationConfig.SpeechRecognitionLanguage = fromLanguage;
using var translationRecognizer = new TranslationRecognizer(speechTranslationConfig);
Console.Write($"Say something in '{fromLanguage}' and ");
Console.WriteLine($"we'll translate into '{string.Join("', '", toLanguages)}'.\n");
var result = await translationRecognizer.RecognizeOnceAsync();
if (result.Reason == ResultReason.TranslatedSpeech)
var languageToVoiceMap = new Dictionary<string, string>
["de"] = "de-DE-KatjaNeural",
["en"] = "en-US-AriaNeural",
["it"] = "it-IT-ElsaNeural",
["pt"] = "pt-BR-FranciscaNeural",
["zh-Hans"] = "zh-CN-XiaoxiaoNeural"
Console.WriteLine($"Recognized: \"{result.Text}\"");
foreach (var (language, translation) in result.Translations)
Console.WriteLine($"Translated into '{language}': {translation}");
var speechConfig =
speechConfig.SpeechSynthesisVoiceName = languageToVoiceMap[language];
using var audioConfig = AudioConfig.FromWavFileOutput($"{language}-translation.wav");
using var speechSynthesizer = new SpeechSynthesizer(speechConfig, audioConfig);
await speechSynthesizer.SpeakTextAsync(translation);
For more information about speech synthesis, see the basics of speech synthesis.
Multi-lingual translation with language identification
In many scenarios, you might not know which input languages to specify. Using language identification you can detect up to 10 possible input languages and automatically translate to your target languages.
The following example anticipates that en-US
or zh-CN
should be detected because they're defined in AutoDetectSourceLanguageConfig
. Then, the speech is translated to de
and fr
as specified in the calls to AddTargetLanguage()
var autoDetectSourceLanguageConfig = AutoDetectSourceLanguageConfig.FromLanguages(new string[] { "en-US", "zh-CN" });
var translationRecognizer = new TranslationRecognizer(speechTranslationConfig, autoDetectSourceLanguageConfig, audioConfig);
For a complete code sample, see language identification.
Multi-lingual speech translation without source language candidates
Multi-lingual speech translation implements a new level of speech translation technology that unlocks various capabilities, including having no specified input language, and handling language switches within the same session. These features enable a new level of speech translation powers that can be implemented into your products.
Currently when you use Language ID with speech translation, you must create the SpeechTranslationConfig
object from the v2 endpoint. Replace the string "YourServiceRegion" with your Speech resource region (such as "westus"). Replace "YourSubscriptionKey" with your Speech resource key.
var v2EndpointInString = String.Format("wss://{0}", "YourServiceRegion");
var v2EndpointUrl = new Uri(v2EndpointInString);
var speechTranslationConfig = SpeechTranslationConfig.FromEndpoint(v2EndpointUrl, "YourSubscriptionKey");
Specify the translation target languages. Replace with languages of your choice. You can add more lines.
A key differentiator with multi-lingual speech translation is that you do not need to specify the source language. This is because the service will automatically detect the source language. Create the AutoDetectSourceLanguageConfig
object with the fromOpenRange
method to let the service know that you want to use multi-lingual speech translation with no specified source language.
AutoDetectSourceLanguageConfig autoDetectSourceLanguageConfig = AutoDetectSourceLanguageConfig.fromOpenRange();
var translationRecognizer = new TranslationRecognizer(speechTranslationConfig, autoDetectSourceLanguageConfig, audioConfig);
For a complete code sample with the Speech SDK, see speech translation samples on GitHub.
Using custom translation in speech translation
The custom translation feature in speech translation seamlessly integrates with the Azure Custom Translation service, allowing you to achieve more accurate and tailored translations. As the integration directly harnesses the capabilities of the Azure custom translation service, you need to use a multi-service resource to ensure the correct functioning of the complete set of features. For detailed instructions, please consult the guide on Create a multi-service resource for Azure AI services.
Additionally, for offline training of a custom translator and obtaining a "Category ID," please refer to the step-by-step script provided in the Quickstart: Build, deploy, and use a custom model - Custom Translator.
// Creates an instance of a translation recognizer using speech translation configuration
// You should use the same subscription key, which you used to generate the custom model before.
// V2 endpoint is required for the “Custom Translation” feature. Example: "wss://"
try (SpeechTranslationConfig config = SpeechTranslationConfig.fromEndpoint(URI.create(endpointUrl), speechSubscriptionKey)) {
// Sets source and target language(s).
// Set the category id
Reference documentation | Package (NuGet) | Additional samples on GitHub
In this how-to guide, you learn how to recognize human speech and translate it to another language.
See the speech translation overview for more information about:
- Translating speech to text
- Translating speech to multiple target languages
- Performing direct speech to speech translation
Sensitive data and environment variables
The example source code in this article depends on environment variables for storing sensitive data, such as the Speech resource's key and region. The C++ code file contains two string values that are assigned from the host machine's environment variables: SPEECH__SUBSCRIPTION__KEY
. Both of these fields are at the class scope, so they're accessible within method bodies of the class:
For more information on environment variables, see Environment variables and application configuration.
Use API keys with caution. Don't include the API key directly in your code, and never post it publicly. If you use an API key, store it securely in Azure Key Vault. For more information about using API keys securely in your apps, see API keys with Azure Key Vault.
For more information about AI services security, see Authenticate requests to Azure AI services.
Create a speech translation configuration
To call the Speech service by using the Speech SDK, you need to create a SpeechTranslationConfig
instance. This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.
Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.
You can initialize SpeechTranslationConfig
in a few ways:
- With a subscription: pass in a key and the associated region.
- With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
- With a host: pass in a host address. A key or authorization token is optional.
- With an authorization token: pass in an authorization token and the associated region.
Let's look at how you create a SpeechTranslationConfig
instance by using a key and region. Get the Speech resource key and region in the Azure portal.
void translateSpeech() {
auto speechTranslationConfig =
SpeechTranslationConfig::FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
int main(int argc, char** argv) {
setlocale(LC_ALL, "");
return 0;
Change the source language
One common task of speech translation is specifying the input (or source) language. The following example shows how you would change the input language to Italian. In your code, interact with the SpeechTranslationConfig
instance by calling the SetSpeechRecognitionLanguage
void translateSpeech() {
auto speechTranslationConfig =
SpeechTranslationConfig::FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
// Source (input) language
The SpeechRecognitionLanguage
property expects a language-locale format string. Refer to the list of supported speech translation locales.
Add a translation language
Another common task of speech translation is to specify target translation languages. At least one is required, but multiples are supported. The following code snippet sets both French and German as translation language targets:
void translateSpeech() {
auto speechTranslationConfig =
SpeechTranslationConfig::FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
With every call to AddTargetLanguage
, a new target translation language is specified. In other words, when speech is recognized from the source language, each target translation is available as part of the resulting translation operation.
Initialize a translation recognizer
After you created a SpeechTranslationConfig
instance, the next step is to initialize TranslationRecognizer
. When you initialize TranslationRecognizer
, you need to pass it your translationConfig
instance. The configuration object provides the credentials that the Speech service requires to validate your request.
If you're recognizing speech by using your device's default microphone, here's what TranslationRecognizer
should look like:
void translateSpeech() {
auto speechTranslationConfig =
SpeechTranslationConfig::FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
auto fromLanguage = "en-US";
auto toLanguages = { "it", "fr", "de" };
for (auto language : toLanguages) {
auto translationRecognizer = TranslationRecognizer::FromConfig(translationConfig);
If you want to specify the audio input device, then you need to create an AudioConfig
class instance and provide the audioConfig
parameter when initializing TranslationRecognizer
First, reference the AudioConfig
object as follows:
void translateSpeech() {
auto speechTranslationConfig =
SpeechTranslationConfig::FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
auto fromLanguage = "en-US";
auto toLanguages = { "it", "fr", "de" };
for (auto language : toLanguages) {
auto audioConfig = AudioConfig::FromDefaultMicrophoneInput();
auto translationRecognizer = TranslationRecognizer::FromConfig(translationConfig, audioConfig);
If you want to provide an audio file instead of using a microphone, you still need to provide an audioConfig
parameter. However, when you create an AudioConfig
class instance, instead of calling FromDefaultMicrophoneInput
, you call FromWavFileInput
and pass the filename
void translateSpeech() {
auto speechTranslationConfig =
SpeechTranslationConfig::FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
auto fromLanguage = "en-US";
auto toLanguages = { "it", "fr", "de" };
for (auto language : toLanguages) {
auto audioConfig = AudioConfig::FromWavFileInput("YourAudioFile.wav");
auto translationRecognizer = TranslationRecognizer::FromConfig(translationConfig, audioConfig);
Translate speech
To translate speech, the Speech SDK relies on a microphone or an audio file input. Speech recognition occurs before speech translation. After all objects are initialized, call the recognize-once function and get the result:
void translateSpeech() {
auto speechTranslationConfig =
SpeechTranslationConfig::FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
string fromLanguage = "en-US";
string toLanguages[3] = { "it", "fr", "de" };
for (auto language : toLanguages) {
auto translationRecognizer = TranslationRecognizer::FromConfig(translationConfig);
cout << "Say something in '" << fromLanguage << "' and we'll translate...\n";
auto result = translationRecognizer->RecognizeOnceAsync().get();
if (result->Reason == ResultReason::TranslatedSpeech)
cout << "Recognized: \"" << result->Text << "\"" << std::endl;
for (auto pair : result->Translations)
auto language = pair.first;
auto translation = pair.second;
cout << "Translated into '" << language << "': " << translation << std::endl;
For more information about speech to text, see the basics of speech recognition.
Synthesize translations
After a successful speech recognition and translation, the result contains all the translations in a dictionary. The Translations
dictionary key is the target translation language, and the value is the translated text. Recognized speech can be translated and then synthesized in a different language (speech-to-speech).
Event-based synthesis
The TranslationRecognizer
object exposes a Synthesizing
event. The event fires several times and provides a mechanism to retrieve the synthesized audio from the translation recognition result. If you're translating to multiple languages, see Manual synthesis.
Specify the synthesis voice by assigning a SetVoiceName
instance, and provide an event handler for the Synthesizing
event to get the audio. The following example saves the translated audio as a .wav file.
The event-based synthesis works only with a single translation. Do not add multiple target translation languages. Additionally, the SetVoiceName
value should be the same language as the target translation language. For example, "de"
could map to "de-DE-Hedda"
void translateSpeech() {
auto speechTranslationConfig =
SpeechTranslationConfig::FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
auto fromLanguage = "en-US";
auto toLanguage = "de";
auto translationRecognizer = TranslationRecognizer::FromConfig(translationConfig);
translationRecognizer->Synthesizing.Connect([](const TranslationSynthesisEventArgs& e)
auto audio = e.Result->Audio;
auto size = audio.size();
cout << "Audio synthesized: " << size << " byte(s)" << (size == 0 ? "(COMPLETE)" : "") << std::endl;
if (size > 0) {
ofstream file("translation.wav", ios::out | ios::binary);
auto audioData =;
file.write((const char*)audioData, sizeof(audio[0]) * size);
cout << "Say something in '" << fromLanguage << "' and we'll translate...\n";
auto result = translationRecognizer->RecognizeOnceAsync().get();
if (result->Reason == ResultReason::TranslatedSpeech)
cout << "Recognized: \"" << result->Text << "\"" << std::endl;
for (auto pair : result->Translations)
auto language = pair.first;
auto translation = pair.second;
cout << "Translated into '" << language << "': " << translation << std::endl;
Manual synthesis
You can use the Translations
dictionary to synthesize audio from the translation text. Iterate through each translation and synthesize it. When you're creating a SpeechSynthesizer
instance, the SpeechConfig
object needs to have its SetSpeechSynthesisVoiceName
property set to the desired voice.
The following example translates to five languages. Each translation is then synthesized to an audio file in the corresponding neural language.
void translateSpeech() {
auto speechTranslationConfig =
SpeechTranslationConfig::FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
auto fromLanguage = "en-US";
auto toLanguages = { "de", "en", "it", "pt", "zh-Hans" };
for (auto language : toLanguages) {
auto translationRecognizer = TranslationRecognizer::FromConfig(translationConfig);
cout << "Say something in '" << fromLanguage << "' and we'll translate...\n";
auto result = translationRecognizer->RecognizeOnceAsync().get();
if (result->Reason == ResultReason::TranslatedSpeech)
map<string, string> languageToVoiceMap;
languageToVoiceMap["de"] = "de-DE-KatjaNeural";
languageToVoiceMap["en"] = "en-US-AriaNeural";
languageToVoiceMap["it"] = "it-IT-ElsaNeural";
languageToVoiceMap["pt"] = "pt-BR-FranciscaNeural";
languageToVoiceMap["zh-Hans"] = "zh-CN-XiaoxiaoNeural";
cout << "Recognized: \"" << result->Text << "\"" << std::endl;
for (auto pair : result->Translations)
auto language = pair.first;
auto translation = pair.second;
cout << "Translated into '" << language << "': " << translation << std::endl;
auto speechConfig =
auto audioConfig = AudioConfig::FromWavFileOutput(language + "-translation.wav");
auto speechSynthesizer = SpeechSynthesizer::FromConfig(speechConfig, audioConfig);
For more information about speech synthesis, see the basics of speech synthesis.
Multilingual translation with language identification
In many scenarios, you might not know which input languages to specify. Using language identification you can detect up to 10 possible input languages and automatically translate to your target languages.
The following example anticipates that en-US
or zh-CN
should be detected because they're defined in AutoDetectSourceLanguageConfig
. Then, the speech will be translated to de
and fr
as specified in the calls to AddTargetLanguage()
auto autoDetectSourceLanguageConfig = AutoDetectSourceLanguageConfig::FromLanguages({ "en-US", "zh-CN" });
auto translationRecognizer = TranslationRecognizer::FromConfig(speechTranslationConfig, autoDetectSourceLanguageConfig, audioConfig);
For a complete code sample, see language identification.
Reference documentation | Package (Go) | Additional samples on GitHub
The Speech SDK for Go does not support speech translation. Please select another programming language or the Go reference and samples linked from the beginning of this article.
Reference documentation | Additional samples on GitHub
In this how-to guide, you learn how to recognize human speech and translate it to another language.
See the speech translation overview for more information about:
- Translating speech to text
- Translating speech to multiple target languages
- Performing direct speech to speech translation
Sensitive data and environment variables
The example source code in this article depends on environment variables for storing sensitive data, such as the Speech resource's key and region. The Java code file contains two static final String
values that are assigned from the host machine's environment variables: SPEECH__SUBSCRIPTION__KEY
. Both of these fields are at the class scope, so they're accessible within method bodies of the class:
public class App {
static final String SPEECH__SUBSCRIPTION__KEY = System.getenv("SPEECH__SUBSCRIPTION__KEY");
static final String SPEECH__SERVICE__REGION = System.getenv("SPEECH__SERVICE__REGION");
public static void main(String[] args) { }
For more information on environment variables, see Environment variables and application configuration.
Use API keys with caution. Don't include the API key directly in your code, and never post it publicly. If you use an API key, store it securely in Azure Key Vault. For more information about using API keys securely in your apps, see API keys with Azure Key Vault.
For more information about AI services security, see Authenticate requests to Azure AI services.
Create a speech translation configuration
To call the Speech service by using the Speech SDK, you need to create a SpeechTranslationConfig
instance. This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.
Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.
You can initialize a SpeechTranslationConfig
instance in a few ways:
- With a subscription: pass in a key and the associated region.
- With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
- With a host: pass in a host address. A key or authorization token is optional.
- With an authorization token: pass in an authorization token and the associated region.
Let's look at how you create a SpeechTranslationConfig
instance by using a key and region. Get the Speech resource key and region in the Azure portal.
public class App {
static final String SPEECH__SUBSCRIPTION__KEY = System.getenv("SPEECH__SERVICE__KEY");
static final String SPEECH__SERVICE__REGION = System.getenv("SPEECH__SERVICE__REGION");
public static void main(String[] args) {
try {
} catch (Exception ex) {
static void translateSpeech() {
SpeechTranslationConfig speechTranslationConfig = SpeechTranslationConfig.fromSubscription(
Change the source language
One common task of speech translation is specifying the input (or source) language. The following example shows how you would change the input language to Italian. In your code, interact with the SpeechTranslationConfig
instance by calling the setSpeechRecognitionLanguage
static void translateSpeech() {
SpeechTranslationConfig speechTranslationConfig = SpeechTranslationConfig.fromSubscription(
// Source (input) language
The setSpeechRecognitionLanguage
function expects a language-locale format string. Refer to the list of supported speech translation locales.
Add a translation language
Another common task of speech translation is to specify target translation languages. At least one is required, but multiples are supported. The following code snippet sets both French and German as translation language targets:
static void translateSpeech() {
SpeechTranslationConfig speechTranslationConfig = SpeechTranslationConfig.fromSubscription(
// Translate to languages. See
With every call to addTargetLanguage
, a new target translation language is specified. In other words, when speech is recognized from the source language, each target translation is available as part of the resulting translation operation.
Initialize a translation recognizer
After you created a SpeechTranslationConfig
instance, the next step is to initialize TranslationRecognizer
. When you initialize TranslationRecognizer
, you need to pass it your speechTranslationConfig
instance. The configuration object provides the credentials that the Speech service requires to validate your request.
If you're recognizing speech by using your device's default microphone, here's what TranslationRecognizer
should look like:
static void translateSpeech() {
SpeechTranslationConfig speechTranslationConfig = SpeechTranslationConfig.fromSubscription(
String fromLanguage = "en-US";
String[] toLanguages = { "it", "fr", "de" };
for (String language : toLanguages) {
try (TranslationRecognizer translationRecognizer = new TranslationRecognizer(speechTranslationConfig)) {
If you want to specify the audio input device, then you need to create an AudioConfig
class instance and provide the audioConfig
parameter when initializing TranslationRecognizer
First, reference the AudioConfig
object as follows:
static void translateSpeech() {
SpeechTranslationConfig speechTranslationConfig = SpeechTranslationConfig.fromSubscription(
String fromLanguage = "en-US";
String[] toLanguages = { "it", "fr", "de" };
for (String language : toLanguages) {
AudioConfig audioConfig = AudioConfig.fromDefaultMicrophoneInput();
try (TranslationRecognizer translationRecognizer = new TranslationRecognizer(speechTranslationConfig, audioConfig)) {
If you want to provide an audio file instead of using a microphone, you still need to provide an audioConfig
parameter. However, when you create an AudioConfig
class instance, instead of calling fromDefaultMicrophoneInput
, you call fromWavFileInput
and pass the filename
static void translateSpeech() {
SpeechTranslationConfig speechTranslationConfig = SpeechTranslationConfig.fromSubscription(
String fromLanguage = "en-US";
String[] toLanguages = { "it", "fr", "de" };
for (String language : toLanguages) {
AudioConfig audioConfig = AudioConfig.fromWavFileInput("YourAudioFile.wav");
try (TranslationRecognizer translationRecognizer = new TranslationRecognizer(speechTranslationConfig, audioConfig)) {
Translate speech
To translate speech, the Speech SDK relies on a microphone or an audio file input. Speech recognition occurs before speech translation. After all objects are initialized, call the recognize-once function and get the result:
static void translateSpeech() throws ExecutionException, InterruptedException {
SpeechTranslationConfig speechTranslationConfig = SpeechTranslationConfig.fromSubscription(
String fromLanguage = "en-US";
String[] toLanguages = { "it", "fr", "de" };
for (String language : toLanguages) {
try (TranslationRecognizer translationRecognizer = new TranslationRecognizer(speechTranslationConfig)) {
System.out.printf("Say something in '%s' and we'll translate...", fromLanguage);
TranslationRecognitionResult translationRecognitionResult = translationRecognizer.recognizeOnceAsync().get();
if (translationRecognitionResult.getReason() == ResultReason.TranslatedSpeech) {
System.out.printf("Recognized: \"%s\"\n", translationRecognitionResult.getText());
for (Map.Entry<String, String> pair : translationRecognitionResult.getTranslations().entrySet()) {
System.out.printf("Translated into '%s': %s\n", pair.getKey(), pair.getValue());
For more information about speech to text, see the basics of speech recognition.
Synthesize translations
After a successful speech recognition and translation, the result contains all the translations in a dictionary. The getTranslations
function returns a dictionary with the key as the target translation language and the value as the translated text. Recognized speech can be translated and then synthesized in a different language (speech-to-speech).
Event-based synthesis
The TranslationRecognizer
object exposes a synthesizing
event. The event fires several times and provides a mechanism to retrieve the synthesized audio from the translation recognition result. If you're translating to multiple languages, see Manual synthesis.
Specify the synthesis voice by assigning a setVoiceName
instance, and provide an event handler for the synthesizing
event to get the audio. The following example saves the translated audio as a .wav file.
The event-based synthesis works only with a single translation. Do not add multiple target translation languages. Additionally, the setVoiceName
value should be the same language as the target translation language. For example, "de"
could map to "de-DE-Hedda"
static void translateSpeech() throws ExecutionException, FileNotFoundException, InterruptedException, IOException {
SpeechTranslationConfig speechTranslationConfig = SpeechTranslationConfig.fromSubscription(
String fromLanguage = "en-US";
String toLanguage = "de";
// See:
try (TranslationRecognizer translationRecognizer = new TranslationRecognizer(speechTranslationConfig)) {
translationRecognizer.synthesizing.addEventListener((s, e) -> {
byte[] audio = e.getResult().getAudio();
int size = audio.length;
System.out.println("Audio synthesized: " + size + " byte(s)" + (size == 0 ? "(COMPLETE)" : ""));
if (size > 0) {
try (FileOutputStream file = new FileOutputStream("translation.wav")) {
} catch (IOException ex) {
System.out.printf("Say something in '%s' and we'll translate...", fromLanguage);
TranslationRecognitionResult translationRecognitionResult = translationRecognizer.recognizeOnceAsync().get();
if (translationRecognitionResult.getReason() == ResultReason.TranslatedSpeech) {
System.out.printf("Recognized: \"%s\"\n", translationRecognitionResult.getText());
for (Map.Entry<String, String> pair : translationRecognitionResult.getTranslations().entrySet()) {
String language = pair.getKey();
String translation = pair.getValue();
System.out.printf("Translated into '%s': %s\n", language, translation);
Manual synthesis
The getTranslations
function returns a dictionary that you can use to synthesize audio from the translation text. Iterate through each translation and synthesize it. When you're creating a SpeechSynthesizer
instance, the SpeechConfig
object needs to have its setSpeechSynthesisVoiceName
property set to the desired voice.
The following example translates to five languages. Each translation is then synthesized to an audio file in the corresponding neural language.
static void translateSpeech() throws ExecutionException, InterruptedException {
SpeechTranslationConfig speechTranslationConfig = SpeechTranslationConfig.fromSubscription(
String fromLanguage = "en-US";
String[] toLanguages = { "de", "en", "it", "pt", "zh-Hans" };
for (String language : toLanguages) {
try (TranslationRecognizer translationRecognizer = new TranslationRecognizer(speechTranslationConfig)) {
System.out.printf("Say something in '%s' and we'll translate...", fromLanguage);
TranslationRecognitionResult translationRecognitionResult = translationRecognizer.recognizeOnceAsync().get();
if (translationRecognitionResult.getReason() == ResultReason.TranslatedSpeech) {
// See:
Map<String, String> languageToVoiceMap = new HashMap<String, String>();
languageToVoiceMap.put("de", "de-DE-KatjaNeural");
languageToVoiceMap.put("en", "en-US-AriaNeural");
languageToVoiceMap.put("it", "it-IT-ElsaNeural");
languageToVoiceMap.put("pt", "pt-BR-FranciscaNeural");
languageToVoiceMap.put("zh-Hans", "zh-CN-XiaoxiaoNeural");
System.out.printf("Recognized: \"%s\"\n", translationRecognitionResult.getText());
for (Map.Entry<String, String> pair : translationRecognitionResult.getTranslations().entrySet()) {
String language = pair.getKey();
String translation = pair.getValue();
System.out.printf("Translated into '%s': %s\n", language, translation);
SpeechConfig speechConfig =
AudioConfig audioConfig = AudioConfig.fromWavFileOutput(language + "-translation.wav");
try (SpeechSynthesizer speechSynthesizer = new SpeechSynthesizer(speechConfig, audioConfig)) {
For more information about speech synthesis, see the basics of speech synthesis.
Reference documentation | Package (npm) | Additional samples on GitHub | Library source code
In this how-to guide, you learn how to recognize human speech and translate it to another language.
See the speech translation overview for more information about:
- Translating speech to text
- Translating speech to multiple target languages
- Performing direct speech to speech translation
Create a translation configuration
To call the translation service by using the Speech SDK, you need to create a SpeechTranslationConfig
instance. This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.
Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.
You can initialize SpeechTranslationConfig
in a few ways:
- With a subscription: pass in a key and the associated region.
- With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
- With a host: pass in a host address. A key or authorization token is optional.
- With an authorization token: pass in an authorization token and the associated region.
Let's look at how you create a SpeechTranslationConfig
instance by using a key and region. Get the Speech resource key and region in the Azure portal.
const speechTranslationConfig = SpeechTranslationConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
Initialize a translator
After you created a SpeechTranslationConfig
instance, the next step is to initialize TranslationRecognizer
. When you initialize TranslationRecognizer
, you need to pass it your speechTranslationConfig
instance. The configuration object provides the credentials that the translation service requires to validate your request.
If you're translating speech provided through your device's default microphone, here's what TranslationRecognizer
should look like:
const translationRecognizer = new TranslationRecognizer(speechTranslationConfig);
If you want to specify the audio input device, then you need to create an AudioConfig
class instance and provide the audioConfig
parameter when initializing TranslationRecognizer
Reference the AudioConfig
object as follows:
const audioConfig = AudioConfig.fromDefaultMicrophoneInput();
const translationRecognizer = new TranslationRecognizer(speechTranslationConfig, audioConfig);
If you want to provide an audio file instead of using a microphone, you still need to provide an audioConfig
parameter. However, you can do this only when you're targeting Node.js. When you create an AudioConfig
class instance, instead of calling fromDefaultMicrophoneInput
, you call fromWavFileOutput
and pass the filename
const audioConfig = AudioConfig.fromWavFileInput("YourAudioFile.wav");
const translationRecognizer = new TranslationRecognizer(speechTranslationConfig, audioConfig);
Translate speech
The TranslationRecognizer class for the Speech SDK for JavaScript exposes methods that you can use for speech translation:
- Single-shot translation (async): Performs translation in a nonblocking (asynchronous) mode. It translates a single utterance. It determines the end of a single utterance by listening for silence at the end or until a maximum of 15 seconds of audio is processed.
- Continuous translation (async): Asynchronously initiates a continuous translation operation. The user registers to events and handles various application states. To stop asynchronous continuous translation, call
To learn more about how to choose a speech recognition mode, see Get started with speech to text.
Specify a target language
To translate, you must specify both a source language and at least one target language.
You can choose a source language by using a locale listed in the Speech translation table. Find your options for translated language at the same link.
Your options for target languages differ when you want to view text or you want to hear synthesized translated speech. To translate from English to German, modify the translation configuration object:
speechTranslationConfig.speechRecognitionLanguage = "en-US";
Single-shot recognition
Here's an example of asynchronous single-shot translation via recognizeOnceAsync
translationRecognizer.recognizeOnceAsync(result => {
// Interact with result
You need to write some code to handle the result. This sample evaluates result.reason
for a translation to German:
function (result) {
let translation = result.translations.get("de");
function (err) {
Your code can also handle updates provided while the translation is processing. You can use these updates to provide visual feedback about the translation progress. This JavaScript Node.js example shows these kinds of updates. The following code also displays details produced during the translation process:
translationRecognizer.recognizing = function (s, e) {
var str = ("(recognizing) Reason: " + SpeechSDK.ResultReason[e.result.reason] +
" Text: " + e.result.text +
" Translation:");
str += e.result.translations.get("de");
translationRecognizer.recognized = function (s, e) {
var str = "\r\n(recognized) Reason: " + SpeechSDK.ResultReason[e.result.reason] +
" Text: " + e.result.text +
" Translation:";
str += e.result.translations.get("de");
str += "\r\n";
Continuous translation
Continuous translation is a bit more involved than single-shot recognition. It requires you to subscribe to the recognizing
, recognized
, and canceled
events to get the recognition results. To stop translation, you must call stopContinuousRecognitionAsync
Here's an example of how continuous translation is performed on an audio input file. Let's start by defining the input and initializing TranslationRecognizer
const translationRecognizer = new TranslationRecognizer(speechTranslationConfig);
In the following code, you subscribe to the events sent from TranslationRecognizer
: Signal for events that contain intermediate translation results.recognized
: Signal for events that contain final translation results. These results indicate a successful translation attempt.sessionStopped
: Signal for events that indicate the end of a translation session (operation).canceled
: Signal for events that contain canceled translation results. These events indicate a translation attempt that was canceled as a result of a direct cancellation. Alternatively, they indicate a transport or protocol failure.
translationRecognizer.recognizing = (s, e) => {
console.log(`TRANSLATING: Text=${e.result.text}`);
translationRecognizer.recognized = (s, e) => {
if (e.result.reason == ResultReason.RecognizedSpeech) {
console.log(`TRANSLATED: Text=${e.result.text}`);
else if (e.result.reason == ResultReason.NoMatch) {
console.log("NOMATCH: Speech could not be translated.");
translationRecognizer.canceled = (s, e) => {
console.log(`CANCELED: Reason=${e.reason}`);
if (e.reason == CancellationReason.Error) {
console.log(`"CANCELED: ErrorCode=${e.errorCode}`);
console.log(`"CANCELED: ErrorDetails=${e.errorDetails}`);
console.log("CANCELED: Did you set the speech resource key and region values?");
translationRecognizer.sessionStopped = (s, e) => {
console.log("\n Session stopped event.");
With everything set up, you can call startContinuousRecognitionAsync
// Starts continuous recognition. Uses stopContinuousRecognitionAsync() to stop recognition.
// Something later can call. Stops recognition.
// translationRecognizer.StopContinuousRecognitionAsync();
Choose a source language
A common task for speech translation is specifying the input (or source) language. The following example shows how you would change the input language to Italian. In your code, find your SpeechTranslationConfig
instance and add the following line directly below it:
speechTranslationConfig.speechRecognitionLanguage = "it-IT";
The speechRecognitionLanguage
property expects a language-locale format string. Refer to the list of supported speech translation locales.
Choose one or more target languages
The Speech SDK can translate to multiple target languages in parallel. The available target languages are somewhat different from the source language list. You specify target languages by using a language code, rather than a locale.
For a list of language codes for text targets, see the speech translation table on the language support page. You can also find details about translation to synthesized languages there.
The following code adds German as a target language:
Because multiple target language translations are possible, your code must specify the target language when examining the result. The following code gets translation results for German:
translationRecognizer.recognized = function (s, e) {
var str = "\r\n(recognized) Reason: " +
sdk.ResultReason[e.result.reason] +
" Text: " + e.result.text + " Translations:";
var language = "de";
str += " [" + language + "] " + e.result.translations.get(language);
str += "\r\n";
// show str somewhere
Reference documentation | Package (download) | Additional samples on GitHub
The Speech SDK for Objective-C does support speech translation, but we haven't yet included a guide here. Please select another programming language to get started and learn about the concepts, or see the Objective-C reference and samples linked from the beginning of this article.
Reference documentation | Package (download) | Additional samples on GitHub
The Speech SDK for Swift does support speech translation, but we haven't yet included a guide here. Please select another programming language to get started and learn about the concepts, or see the Swift reference and samples linked from the beginning of this article.
Reference documentation | Package (PyPi) | Additional samples on GitHub
In this how-to guide, you learn how to recognize human speech and translate it to another language.
See the speech translation overview for more information about:
- Translating speech to text
- Translating speech to multiple target languages
- Performing direct speech to speech translation
Sensitive data and environment variables
The example source code in this article depends on environment variables for storing sensitive data, such as the Speech resource's subscription key and region. The Python code file contains two values that are assigned from the host machine's environment variables: SPEECH__SUBSCRIPTION__KEY
. Both of these variables are at the global scope, so they're accessible within the function definition of the code file:
speech_key, service_region = os.environ['SPEECH__SUBSCRIPTION__KEY'], os.environ['SPEECH__SERVICE__REGION']
For more information on environment variables, see Environment variables and application configuration.
Use API keys with caution. Don't include the API key directly in your code, and never post it publicly. If you use an API key, store it securely in Azure Key Vault. For more information about using API keys securely in your apps, see API keys with Azure Key Vault.
For more information about AI services security, see Authenticate requests to Azure AI services.
Create a speech translation configuration
To call the Speech service by using the Speech SDK, you need to create a SpeechTranslationConfig
instance. This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.
Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.
You can initialize SpeechTranslationConfig
in a few ways:
- With a subscription: pass in a key and the associated region.
- With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
- With a host: pass in a host address. A key or authorization token is optional.
- With an authorization token: pass in an authorization token and the associated region.
Let's look at how you can create a SpeechTranslationConfig
instance by using a key and region. Get the Speech resource key and region in the Azure portal.
from_language, to_language = 'en-US', 'de'
def translate_speech_to_text():
translation_config = speechsdk.translation.SpeechTranslationConfig(
subscription=speech_key, region=service_region)
Change the source language
One common task of speech translation is specifying the input (or source) language. The following example shows how you would change the input language to Italian. In your code, interact with the SpeechTranslationConfig
instance by assigning it to the speech_recognition_language
def translate_speech_to_text():
translation_config = speechsdk.translation.SpeechTranslationConfig(
subscription=speech_key, region=service_region)
# Source (input) language
from_language = "it-IT"
translation_config.speech_recognition_language = from_language
The speech_recognition_language
property expects a language-locale format string. Refer to the list of supported speech translation locales.
Add a translation language
Another common task of speech translation is to specify target translation languages. At least one is required, but multiples are supported. The following code snippet sets both French and German as translation language targets:
def translate_speech_to_text():
translation_config = speechsdk.translation.SpeechTranslationConfig(
subscription=speech_key, region=service_region)
translation_config.speech_recognition_language = "it-IT"
# Translate to languages. See,
With every call to add_target_language
, a new target translation language is specified. In other words, when speech is recognized from the source language, each target translation is available as part of the resulting translation operation.
Initialize a translation recognizer
After you created a SpeechTranslationConfig
instance, the next step is to initialize TranslationRecognizer
. When you initialize TranslationRecognizer
, you need to pass it your translation_config
instance. The configuration object provides the credentials that the Speech service requires to validate your request.
If you're recognizing speech by using your device's default microphone, here's what TranslationRecognizer
should look like:
def translate_speech_to_text():
translation_config = speechsdk.translation.SpeechTranslationConfig(
subscription=speech_key, region=service_region)
translation_config.speech_recognition_language = from_language
translation_recognizer = speechsdk.translation.TranslationRecognizer(
If you want to specify the audio input device, then you need to create an AudioConfig
class instance and provide the audio_config
parameter when initializing TranslationRecognizer
First, reference the AudioConfig
object as follows:
def translate_speech_to_text():
translation_config = speechsdk.translation.SpeechTranslationConfig(
subscription=speech_key, region=service_region)
translation_config.speech_recognition_language = from_language
for lang in to_languages:
audio_config =
translation_recognizer = speechsdk.translation.TranslationRecognizer(
translation_config=translation_config, audio_config=audio_config)
If you want to provide an audio file instead of using a microphone, you still need to provide an audioConfig
parameter. However, when you create an AudioConfig
class instance, instead of calling with use_default_microphone=True
, you call with filename="path-to-file.wav"
and provide the filename
def translate_speech_to_text():
translation_config = speechsdk.translation.SpeechTranslationConfig(
subscription=speech_key, region=service_region)
translation_config.speech_recognition_language = from_language
for lang in to_languages:
audio_config ="path-to-file.wav")
translation_recognizer = speechsdk.translation.TranslationRecognizer(
translation_config=translation_config, audio_config=audio_config)
Translate speech
To translate speech, the Speech SDK relies on a microphone or an audio file input. Speech recognition occurs before speech translation. After all objects are initialized, call the recognize-once function and get the result:
import os
import azure.cognitiveservices.speech as speechsdk
speech_key, service_region = os.environ['SPEECH__SERVICE__KEY'], os.environ['SPEECH__SERVICE__REGION']
from_language, to_languages = 'en-US', 'de'
def translate_speech_to_text():
translation_config = speechsdk.translation.SpeechTranslationConfig(
subscription=speech_key, region=service_region)
translation_config.speech_recognition_language = from_language
translation_recognizer = speechsdk.translation.TranslationRecognizer(
print('Say something...')
translation_recognition_result = translation_recognizer.recognize_once()
print(get_result_text(reason=translation_recognition_result.reason, result=translation_recognition_result))
def get_result_text(reason, translation_recognition_result):
reason_format = {
f'RECOGNIZED "{from_language}": {translation_recognition_result.text}\n' +
f'TRANSLATED into "{to_language}"": {translation_recognition_result.translations[to_language]}',
speechsdk.ResultReason.RecognizedSpeech: f'Recognized: "{translation_recognition_result.text}"',
speechsdk.ResultReason.NoMatch: f'No speech could be recognized: {translation_recognition_result.no_match_details}',
speechsdk.ResultReason.Canceled: f'Speech Recognition canceled: {translation_recognition_result.cancellation_details}'
return reason_format.get(reason, 'Unable to recognize speech')
For more information about speech to text, see the basics of speech recognition.
Synthesize translations
After a successful speech recognition and translation, the result contains all the translations in a dictionary. The translations
dictionary key is the target translation language, and the value is the translated text. Recognized speech can be translated and then synthesized in a different language (speech-to-speech).
Event-based synthesis
The TranslationRecognizer
object exposes a Synthesizing
event. The event fires several times and provides a mechanism to retrieve the synthesized audio from the translation recognition result. If you're translating to multiple languages, see Manual synthesis.
Specify the synthesis voice by assigning a voice_name
instance, and provide an event handler for the Synthesizing
event to get the audio. The following example saves the translated audio as a .wav file.
The event-based synthesis works only with a single translation. Do not add multiple target translation languages. Additionally, the voice_name
value should be the same language as the target translation language. For example, "de"
could map to "de-DE-Hedda"
import os
import azure.cognitiveservices.speech as speechsdk
speech_key, service_region = os.environ['SPEECH__SERVICE__KEY'], os.environ['SPEECH__SERVICE__REGION']
from_language, to_language = 'en-US', 'de'
def translate_speech_to_text():
translation_config = speechsdk.translation.SpeechTranslationConfig(
subscription=speech_key, region=service_region)
translation_config.speech_recognition_language = from_language
# See:
translation_config.voice_name = "de-DE-Hedda"
translation_recognizer = speechsdk.translation.TranslationRecognizer(
def synthesis_callback(evt):
size = len(
print(f'Audio synthesized: {size} byte(s) {"(COMPLETED)" if size == 0 else ""}')
if size > 0:
file = open('translation.wav', 'wb+')
print(f'Say something in "{from_language}" and we\'ll translate into "{to_language}".')
translation_recognition_result = translation_recognizer.recognize_once()
print(get_result_text(reason=translation_recognition_result.reason, result=translation_recognition_result))
def get_result_text(reason, translation_recognition_result):
reason_format = {
f'Recognized "{from_language}": {translation_recognition_result.text}\n' +
f'Translated into "{to_language}"": {translation_recognition_result.translations[to_language]}',
speechsdk.ResultReason.RecognizedSpeech: f'Recognized: "{translation_recognition_result.text}"',
speechsdk.ResultReason.NoMatch: f'No speech could be recognized: {translation_recognition_result.no_match_details}',
speechsdk.ResultReason.Canceled: f'Speech Recognition canceled: {translation_recognition_result.cancellation_details}'
return reason_format.get(reason, 'Unable to recognize speech')
Manual synthesis
You can use the translations
dictionary to synthesize audio from the translation text. Iterate through each translation and synthesize it. When you're creating a SpeechSynthesizer
instance, the SpeechConfig
object needs to have its speech_synthesis_voice_name
property set to the desired voice.
The following example translates to five languages. Each translation is then synthesized to an audio file in the corresponding neural language.
import os
import azure.cognitiveservices.speech as speechsdk
speech_key, service_region = os.environ['SPEECH__SERVICE__KEY'], os.environ['SPEECH__SERVICE__REGION']
from_language, to_languages = 'en-US', [ 'de', 'en', 'it', 'pt', 'zh-Hans' ]
def translate_speech_to_text():
translation_config = speechsdk.translation.SpeechTranslationConfig(
subscription=speech_key, region=service_region)
translation_config.speech_recognition_language = from_language
for lang in to_languages:
recognizer = speechsdk.translation.TranslationRecognizer(
print('Say something...')
translation_recognition_result = translation_recognizer.recognize_once()
def synthesize_translations(translation_recognition_result):
language_to_voice_map = {
"de": "de-DE-KatjaNeural",
"en": "en-US-AriaNeural",
"it": "it-IT-ElsaNeural",
"pt": "pt-BR-FranciscaNeural",
"zh-Hans": "zh-CN-XiaoxiaoNeural"
print(f'Recognized: "{translation_recognition_result.text}"')
for language in translation_recognition_result.translations:
translation = translation_recognition_result.translations[language]
print(f'Translated into "{language}": {translation}')
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
speech_config.speech_synthesis_voice_name = language_to_voice_map.get(language)
audio_config ='{language}-translation.wav')
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
For more information about speech synthesis, see the basics of speech synthesis.
Multi-lingual translation with language identification
In many scenarios, you might not know which input languages to specify. Using language identification you can detect up to 10 possible input languages and automatically translate to your target languages.
For a complete code sample, see language identification.
Speech to text REST API reference | Speech to text REST API for short audio reference | Additional samples on GitHub
You can use the REST API for speech translation, but we haven't yet included a guide here. Please select another programming language to get started and learn about the concepts.
In this how-to guide, you learn how to recognize human speech and translate it to another language.
See the speech translation overview for more information about:
- Translating speech to text
- Translating speech to multiple target languages
- Performing direct speech to speech translation
- An Azure subscription. You can create one for free.
- Create a Speech resource in the Azure portal.
- Get the Speech resource key and region. After your Speech resource is deployed, select Go to resource to view and manage keys.
Download and install
Follow these steps and see the Speech CLI quickstart for other requirements for your platform.
Run the following .NET CLI command to install the Speech CLI:
dotnet tool install --global Microsoft.CognitiveServices.Speech.CLI
Run the following commands to configure your Speech resource key and region. Replace
with your Speech resource key and replaceREGION
with your Speech resource region.spx config @key --set SUBSCRIPTION-KEY spx config @region --set REGION
Set source and target languages
This command calls the Speech CLI to translate speech from the microphone from Italian to French:
spx translate --microphone --source it-IT --target fr