Note
Please see Azure Cognitive Services for Speech documentation for the latest supported speech solutions.
Microsoft Speech Platform
Speech Synthesis API Overview
This page provides an overview of the interfaces for speech synthesis (text-to-speech, or TTS) in the Microsoft Speech Platform, and provides links to additional topics and examples.
Manage text-to-speech
Applications can control text-to-speech (TTS) using the ISpVoice Component Object Model (COM) interface. Once an application has created an ISpVoice object, the application only needs to call ISpVoice::Speak to generate speech output from some text data. In addition, the IspVoice interface also provides several methods for changing voice and synthesis properties such as speaking rate ISpVoice::SetRate, output volume ISpVoice::SetVolume, and changing the current speaking voice ISpVoice::SetVoice
The IspVoice::Speak method can operate either synchronously (return only when completely finished speaking) or asynchronously (return immediately and speak as a background process). When speaking asynchronously (SPF_ASYNC), you can poll for real-time status information such as speaking state and current text location using ISpVoice::GetStatus. Also while speaking asynchronously, you can generate new speech output by either immediately interrupting the current output (SPF_PURGEBEFORESPEAK), or by automatically appending new text to the end of the current output.
In addition to the ISpVoice interface, the Speech Platform API also provides many utility COM interfaces for more advanced TTS applications.
Customize TTS using SSML
Using the Speech Platform API, you can also insert XML that conforms to the Speech Synthesis Markup Language (SSML) Version 1.0 together with the input text to change real-time synthesis properties like voice, pitch, speaking rate, and volume. SSML markup is a simple but powerful way to customize TTS output, independent of the specific engine or voice currently in use. See Use SSML to Create Prompts and Control TTS.
Register for event notifications
The Speech Platform API communicates with applications by sending events using standard callback mechanisms (Window Message, callback proc, or Win32 Event). You typically use TTS events for synchronizing to the speech output. Applications can synchronize to real-time actions as they occur such as word boundaries, phoneme or viseme (mouth animation) boundaries, or custom bookmarks. Applications can initialize and handle these real-time events using ISpNotifySource, ISpNotifySink, ISpNotifyTranslator, ISpEventSink, ISpEventSource, and ISpNotifyCallback. Also see Use TTS Events for an example and a list of TTS events.
Customize pronunciations
You can provide custom word pronunciations for speech synthesis engines to use by authoring a custom application lexicon. When your application loads a prompt that contains a link to a lexicon, the TTS engine will use the pronunciations specified in the custom application lexicon instead of those in its internal lexicon. See Create Custom Pronunciations with Lexicons.
You can also specify custom pronunciations inline in SSML-format prompts. See Guide the pronunciation of specific words.
Manage resources for TTS
You can find and select speech data, such as voice files and pronunciation lexicons, using the following COM interfaces in the Speech Platform API: ISpDataKey, ISpRegDataKey, ISpObjectTokenInit, ISpObjectTokenCategory, ISpObjectToken, IEnumSpObjectTokens, ISpObjectWithToken, and ISpResourceManager.
Manage audio output
Finally, there is an interface for customizing the audio output to some special destination such as telephony and custom hardware (ISpAudio, ISpMMSysAudio, ISpStream, ISpStreamFormat, ISpStreamFormatConverter).