Prerecorded Prompts
Text-to-speech (TTS), uses specialized software to convert text strings into synthesized speech. There are currently two different methods for doing this conversion. One is based on concatenation, and the other, called formant TTS, is based on the tiny subsyllables of prerecorded speech called phonemes.
Though recent breakthroughs in TTS technology have improved the prosody and overall sound of TTS systems, they are not yet indistinguishable from real speech. Because the software does not understand the content of the text, the rise and fall of the synthetic voice often sounds artificial.
Still, TTS is a necessary component in many systems that handle changing data. When a voice actor provides basic prompt coverage and TTS reads all other unpredictable content, the two forms of speech will inevitably occur in proximity. Speech Prompt Editor in the Speech Application SDK can determine which prompts in a system have not been recorded, reducing the likelihood that unrecorded prompts will slip through unnoticed.
What Happens at the Boundaries
No matter how closely the timbre of the voices match, the boundary between prerecorded human speech and TTS is always noticeable. We are all experts at identifying even the mildest flaws in synthetic human speech. In order to minimize these awkward boundaries between real and synthetic speech, it is best to maximize the differences by using TTS of a different gender.
If you are planning to run multiple applications on a single language configuration, keep in mind that you cannot associate a specific voice with an application. Speech engines preload all of their resources, including voice prompts. When a call is made, the application requests the first available engine from the Speech Application Deployment Service (SADS). Therefore, when selecting a voice talent, select either a male or a female voice, but not both.
Strategies for Mixing TTS and Audio
While specifically labeling TTS as the voice of the computer may be inappropriately pedantic, in many cases, observing the three-way interaction triangle discussed later can be a useful approach to keep in mind. See the section headed Interaction Triangle in the topic Designing Persona for Speech Systems.
Voice of the computer example:
ACTOR: | The computer is showing three people with that last name. Here's the first one... |
TTS: | Frank Miller |
ACTOR: | and the next |
TTS: | Eric Miller |
ACTOR: | And finally... |
TTS: | Geoff Miller |
Most often, it is sufficient to ignore labeling the computer specifically, and simply place the TTS phrase outside any spoken text.
ACTOR: | I found three people at Microsoft with the name. |
TTS: | Frank Miller |
Avoid embedding TTS inside a real actor's phrase.
ACTOR: | There are three people with the name |
TTS: | Frank Miller |
ACTOR: | here at Microsoft. |
Custom TTS
Even when using a custom-concatenated TTS persona created with the same voice actor, the boundary between the live reading of system prompts and the out-of-context concatenated speech will be noticeable. In usability studies, users have remarked that these boundaries sounded as if the actor had been "possessed by some outside force" or "had the life drained out of them."
Programmatic Solutions Versus Written Ones
Designers often employ programmatic solutions to combine phrases when creating prompts. In most cases, writing a custom version of the prompt, even though that may seem counterintuitive, is a better approach. Repeated use of prompts, spliced onto longer messages—in error-handling situations, for example—is only recommended when space is an issue. Writing a custom version of each instance of the prompt gives greater control to the writer, making it possible to design language specifically tailored for the circumstance. Even if the same text appears a number of times, it is better to record multiple readings of the same language so the user does not hear the same recordings over and over.
Automated interactions can take advantage of many of the linguistic markers on which human-to-human interactions rely. Many constructs such as discourse markers and other cohesion devices can not only raise predictability within the interface, but also improve everything from recognition rates to task completion.