Partilhar via


Postproduction Techniques

  Microsoft Speech Technologies Homepage

Levels

Dynamic range refers to the difference between the loudest information in a sample and the softest. The loudest sections should never exceed 100 percent of the amplitude that is reproducible. When the audio exceeds this range, this is called clipping. Clipping results in audio distortion. Normalization is the process of increasing or decreasing the volume of a sample so that the maximum amplitude of the file is adjusted to a specified percentage of the maximum allowable output. In other words, any sound that is normalized to 100 percent will be adjusted so that its loudest part is set to 100 percent. A sound normalized to 50 percent will be adjusted so that its loudest portion is set to 50 percent. All other material in the file will be adjusted relative to the loudest portion.

It is best practice when normalizing dynamic range to avoid setting audio to 100 percent. Files should not exceed 92 percent as most sound cards are not tuned to handle material beyond this level. Unity gain, or the level at which the input power is equal to output power, is usually around 10 percent below full output level. This is called head room and it allows for occasional transience or spikes in the volume to be reproduced without distortion during playback. Many recording packages include a normalization feature.

This can be a useful way to maximize dynamic range and minimize minor discrepancies in volume among a group of recorded prompts. The danger of using this technique is that some words like "ten" have a transient burst of audio energy created by the "t" sound. During normalization, this "t" will be set to the normalization value. But, if another file in the application contains a single word with little or no transient material (such as in the word "on"), normalization will have a greater effect than on a word with more transient activity. This can result in words like "on" or "one" becoming relatively louder than other files with more transient energy.

The solution is not to normalize but to increase all files by a percentage of their original volume. The difficulty with this approach is in calculating the optimum percentage based on the loudest sound in the loudest file. Since this is not possible to compute with most recording packages, it may be advisable to normalize all files and then fix the handful of files that are abnormally elevated. A certain amount of trial and error is typical during the leveling process. The Speech Application SDK provides the capability for root mean sequence (RMS) normalization, which is a good measure of perceived volume.

Down-Sampling

Standard telephone lines have a frequency response of approximately 300hz to 3000hz. Because of this limited range, it is standard procedure for audio files to be down-sampled. Typically, prompts intended for use over the phone are down sampled to 8 kHz at 8bit. Assuming an original recording rate of 44.1k samples per second, this represents a significant reduction to the audio quality of the recordings.

Here are a few guidelines to keep in mind when down-sampling:

  • If possible, start with sources that have the maximum dynamic range. The silence will be quieter. This is technically described as a lower noise floor in the down-sampled files.
  • High frequency speech sounds like the "s" and "sh" sounds are particularly vulnerable during the down-sampling process. It may be desirable to boost the higher frequencies in the source files slightly to compensate for the loss of sibilance at reduced bandwidth.
  • Always save separate versions of the files at each stage of the process. Do not save over the source files.

Naming Files

The use of file name in 8.3 format is less an issue than in the past. Long file names are the norm in most file systems today, but depending on the tools used during the course of the production process, adhering to 8.3 file names may simplify one or more steps.

Even moderately-sized speech systems can accumulate thousands of individual files that must be tracked throughout the production process. One advantage of using the 8.3 format is that the file name will be visible in any environment.

When specialized prompt recording software is used, the resulting files are often sequentially numbered. This method has advantages for creating file maps based on the recording script. In order to keep the files in perfect sequential order, it is usually necessary to include number padding. This is a technique that ensures that even single digit numbers are preceded by zeros. Consequently, filenames based on CAL for calendar prompts range from CAL_001.wav through CAL_999.wav. This system keeps files in numerical order even when they are sorted alphanumerically in most file systems.

It is useful to begin with seven characters for the source file name as often edits based on the source file may be extracted. The resulting edited file might be named CAL_006a.wav, or CAL_006b.wav, or some similar variation. Following this method, it is obvious what sources were used to create an edit. When combinations of takes are used, it is best to notate the edits in the recording script / file map.

Source Control

At various stages in the audio production process, it is wise to save complete backups of all the files as they progress from raw sources to trimmed, down-sampled finals.

If file names remain consistent throughout the process, it should be easy enough to retrace steps if one or more file must be remade.

See Also

Audio Production