What are high definition voices? (Preview)

Artikkeli
10/23/2024

Note

This feature is currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Azure AI Speech continues to advance in the field of text to speech technology with the introduction of neural text to speech high definition (HD) voices. The HD voices can understand the content, automatically detect emotions in the input text, and adjust the speaking tone in real-time to match the sentiment. HD voices maintain a consistent voice persona from their neural (and non HD) counterparts, and deliver even more value through enhanced features.

Key features of neural text to speech HD voices

The following are the key features of Azure AI Speech HD voices:

Key features	Description
Human-like speech generation	Neural text to speech HD voices can generate highly natural and human-like speech. The model is trained on millions of hours of multilingual data, enabling it to accurately interpret input text and generate speech with the appropriate emotion, pace, and rhythm without manual adjustments.
Conversational	Neural text to speech HD voices can replicate natural speech patterns, including spontaneous pauses and emphasis. When given conversational text, the model can reproduce common phonemes like pauses and filler words. The generated voice sounds as if someone is conversing directly with you.
Prosody variations	Neural text to speech HD voices introduce slight variations in each output to enhance realism. These variations make the speech sound more natural, as human voices naturally exhibit variation.
High fidelity	The primary objective of neural text to speech HD voices is to generate high-fidelity audio. The synthetic speech produced by our system can closely mimic human speech in both quality and naturalness.
Version control	With neural text to speech HD voices, we release different versions of the same voice, each with a unique base model size and recipe. This offers you the opportunity to experience new voice variations or continue using a specific version of a voice.

Comparison of Azure AI Speech HD voices to other Azure text to speech voices

How do Azure AI Speech HD voices compare to other Azure text to speech voices? How do they differ in terms of features and capabilities?

Here's a comparison of features between Azure AI Speech HD voices, Azure OpenAI HD voices, and Azure AI Speech voices:

Feature	Azure AI Speech HD voices	Azure OpenAI HD voices	Azure AI Speech voices (not HD)
Region	East US, Southeast Asia, West Europe	North Central US, Sweden Central	Available in dozens of regions. See the region list.
Number of voices	12	6	More than 500
Multilingual	No (perform on primary language only)	Yes	Yes (applicable only to multilingual voices)
SSML support	Support for a subset of SSML elements.	Support for a subset of SSML elements.	Support for the full set of SSML in Azure AI Speech.
Development options	Speech SDK, Speech CLI, REST API	Speech SDK, Speech CLI, REST API	Speech SDK, Speech CLI, REST API
Deployment options	Cloud only	Cloud only	Cloud, embedded, hybrid, and containers.
Real-time or batch synthesis	Real-time only	Real-time and batch synthesis	Real-time and batch synthesis
Latency	Less than 300 ms	Greater than 500 ms	Less than 300 ms
Sample rate of synthesized audio	8, 16, 24, and 48 kHz	8, 16, 24, and 48 kHz	8, 16, 24, and 48 kHz
Speech output audio format	opus, mp3, pcm, truesilk	opus, mp3, pcm, truesilk	opus, mp3, pcm, truesilk

Supported Azure AI Speech HD voices

The Azure AI Speech HD voice values are in the format voicename:basemodel:version. The name before the colon, such as en-US-Ava, is the voice persona name and its original locale. The base model is tracked by versions in subsequent updates.

Currently, DragonHD is the only base model available for Azure AI Speech HD voices. To ensure that you're using the latest version of the base model that we provide without having to make a code change, use the LatestNeural version.

For example, for the persona en-US-Ava you can specify the following HD voice values:

en-US-Ava:DragonHDLatestNeural: Always uses the latest version of the base model that we provide later.

The following table lists the Azure AI Speech HD voices that are currently available.

Neural voice persona	HD voices
de-DE-Seraphina	de-DE-Seraphina:DragonHDLatestNeural
en-US-Andrew	en-US-Andrew:DragonHDLatestNeural
en-US-Andrew2	en-US-Andrew2:DragonHDLatestNeural
en-US-Aria	en-US-Aria:DragonHDLatestNeural
en-US-Ava	en-US-Ava:DragonHDLatestNeural
en-US-Brian	en-US-Brian:DragonHDLatestNeural
en-US-Davis	en-US-Davis:DragonHDLatestNeural
en-US-Emma	en-US-Emma:DragonHDLatestNeural
en-US-Emma2	en-US-Emma2:DragonHDLatestNeural
en-US-Jenny	en-US-Jenny:DragonHDLatestNeural
en-US-Steffan	en-US-Steffan:DragonHDLatestNeural
ja-JP-Masaru	ja-JP-Masaru:DragonHDLatestNeural
zh-CN-Xiaochen	zh-CN-Xiaochen:DragonHDLatestNeural

How to use Azure AI Speech HD voices

You can use HD voices with the same Speech SDK and REST APIs as the non HD voices.

Here are some key points to consider when using Azure AI Speech HD voices:

Voice locale: The locale in the voice name indicates its original language and region.
Base models:
- HD voices come with a base model that understands the input text and predicts the speaking pattern accordingly. You can specify the desired model (such as DragonHDLatestNeural) according to the availability of each voice.
SSML usage: To reference a voice in SSML, use the format voicename:basemodel:version. The name before the colon, such as de-DE-Seraphina, is the voice persona name and its original locale. The base model is tracked by versions in subsequent updates.
Temperature parameter:
- The temperature value is a float ranging from 0 to 1, influencing the randomness of the output. You can also adjust the temperature parameter to control the variation of outputs. Less randomness yields more stable results, while more randomness offers variety but less consistency.
- Lower temperature results in less randomness, leading to more predictable outputs. Higher temperature increases randomness, allowing for more diverse outputs. The default temperature is set at 1.0.

Here's an example of how to use Azure AI Speech HD voices in SSML:

<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='https://www.w3.org/2001/mstts' xml:lang='en-US'>
<voice name='en-US-Ava:DragonHDLatestNeural' parameters='temperature=0.8'>Here is a test</voice>
</speak>

Supported and unsupported SSML elements for Azure AI Speech HD voices

The Speech Synthesis Markup Language (SSML) with input text determines the structure, content, and other characteristics of the text to speech output. For example, you can use SSML to define a paragraph, a sentence, a break or a pause, or silence. You can wrap text with event tags such as bookmark or viseme that your application processes later.

The Azure AI Speech HD voices don't support all SSML elements or events that other Azure AI Speech voices support. Of particular note, Azure AI Speech HD voices don't support word boundary events.

For detailed information on the supported and unsupported SSML elements for Azure AI Speech HD voices, refer to the following table. For instructions on how to use SSML elements, refer to the Speech Synthesis Markup Language (SSML) documentation.

SSML element	Description	Supported in Azure AI Speech HD voices
`<voice>`	Specifies the voice and optional effects (`eq_car` and `eq_telecomhp8k`).	Yes
`<mstts:express-as>`	Specifies speaking styles and roles.	No
`<mstts:ttsembedding>`	Specifies the `speakerProfileId` property for a personal voice.	No
`<lang xml:lang>`	Specifies the speaking language.	Yes
`<prosody>`	Adjusts pitch, contour, range, rate, and volume.	No
`<emphasis>`	Adds or removes word-level stress for the text.	No
`<audio>`	Embeds prerecorded audio into an SSML document.	No
`<mstts:audioduration>`	Specifies the duration of the output audio.	No
`<mstts:backgroundaudio>`	Adds background audio to your SSML documents or mixes an audio file with text to speech.	No
`<phoneme>`	Specifies phonetic pronunciation in SSML documents.	No
`<lexicon>`	Defines how multiple entities are read in SSML.	Yes (only supports alias)
`<say-as>`	Indicates the content type, such as number or date, of the element's text.	Yes
`<sub>`	Indicates that the alias attribute's text value should be pronounced instead of the element's enclosed text.	Yes
`<math>`	Uses the MathML as input text to properly pronounce mathematical notations in the output audio.	No
`<bookmark>`	Gets the offset of each marker in the audio stream.	No
`<break>`	Overrides the default behavior of breaks or pauses between words.	No
`<mstts:silence>`	Inserts pause before or after text, or between two adjacent sentences.	No
`<mstts:viseme>`	Defines the position of the face and mouth while a person is speaking.	No
`<p>`	Denotes paragraphs in SSML documents.	Yes
`<s>`	Denotes sentences in SSML documents.	Yes

Note

Although a previous section in this guide also compared Azure AI Speech HD voices to Azure OpenAI HD voices, the SSML elements supported by Azure AI Speech aren't applicable to Azure OpenAI voices.

Jaa

What are high definition voices? (Preview)

Key features of neural text to speech HD voices

Comparison of Azure AI Speech HD voices to other Azure text to speech voices

Supported Azure AI Speech HD voices

How to use Azure AI Speech HD voices

Supported and unsupported SSML elements for Azure AI Speech HD voices

Palaute

Lisäresursseja

Jaa

What are high definition voices? (Preview)

Key features of neural text to speech HD voices

Comparison of Azure AI Speech HD voices to other Azure text to speech voices

Supported Azure AI Speech HD voices

How to use Azure AI Speech HD voices

Supported and unsupported SSML elements for Azure AI Speech HD voices

Related content

Palaute

Lisäresursseja