Deployment and Iteration
![]() |
After the audio files are in the system and developers and producers can hear them in context, a significant amount of fine-tuning work remains, including choosing the prompt takes and the timing of concatenated material. A plethora of other small details that complete the finished audio experience also remain. The audio in a speech system must be tested for bugs just as software is tested.
Avoid the temptation to treat the audio as an asset to be checked in and done with. Because of the integral nature of audio in speech systems and the creation of persona, it must be treated as a part of the code and tested just as thoroughly. There will be bugs, including timing errors, improperly truncated phases, and popped "P"s. The best systems are the product of diligent iteration and grooming. Audio for any project destined for mass consumption is mixed, sweetened, edited, and polished before it is presented. Speech systems are no different from film, television, or any other media that relies heavily on audio.
As the system is tested and exercised along each possible path through the dialog model, the prompt databases should be carefully groomed to emulate the natural spacing and timing of prompt playback and any other fine adjustments that will enhance the illusion of conversational speech. In some ways, the creation and deployment of speech systems is an exercise in audio animation. Concatenation is the sequencing of audio elements to simulate a smooth sentence. Just as great animation is based on keen observation of how things move through time, great sounding speech systems are the product of keen observation of how people naturally speak, pause and how they use inflection in speech.