How to create a custom text to speech avatar

Getting started with a custom text to speech avatar is a straightforward process. All it takes are a few of video files. If you'd like to train a custom neural voice for the same actor, you can do so separately.

An avatar talent is an individual or target actor whose video of speaking is recorded and used to create neural avatar models. You must obtain sufficient consent under all relevant laws and regulations from the avatar talent to use their video to create the custom text to speech avatar.

You must provide a video file with a recorded statement from your avatar talent, acknowledging the use of their image and voice. Microsoft verifies that the content in the recording matches the predefined script provided by Microsoft. Microsoft compares the face of the avatar talent in the recorded video statement file with randomized videos from the training datasets to ensure that the avatar talent in video recordings and the avatar talent in the statement video file are from the same person.

You can find the verbal consent statement in multiple languages on GitHub. The language of the verbal statement must be the same as your recording. See also the disclosure for voice talent.

Prepare training data for custom text to speech avatar

You're required to provide video recordings of the avatar talent speaking in a language of your choice. The video recordings should contain high signal-to-noise ratio voice. The voice in the video recording isn't used as training data for a custom neural voice; its purpose is to train the custom text to speech avatar model.

For more information about preparing the training data, see How to record video samples.

Next steps