Article
01/20/2015

Note

Please see Azure Cognitive Services for Speech documentation for the latest supported speech solutions.

Using Custom Pronunciations

There may be some scenarios in which you can improve the performance of speech recognition in your applications by specifying custom pronunciations for words in your application's vocabulary that are uncommon or whose pronunciation is unusual. Unusual words for which you may want to create custom pronunciations include proper names, place names, fictional words, slang, or words that are specific to an educational or medical discipline. Using the information and examples in this topic, you can create custom pronunciations for the specialized vocabulary in your application.

When to Create Custom Pronunciations

You create custom pronunciations to improve the accuracy of speech recognition for vocabulary in your application that the speech recognition engine does not interpret as well as expected. Typically, you will only need to create custom pronunciations for words that are not common to a language and that do not follow the typical pronunciation rules for the orthography of a language. The Microsoft speech recognition engine is well equipped with tools that enable it to determine the correct pronunciation of words it is familiar with, as well as for words it has never encountered.

Speech recognition engines include a default lexicon that specifies which words can be recognized, and how each word must be pronounced to be recognized. The scope of a speech recognition engine's lexicon is typically a single language, and the lexicon contains a large number of words that the speech recognition engine can recognize in that language. This provides the speech recognition engine with a substantial native vocabulary.

In addition to its lexicon, a speech recognition engine also has an acoustic model that describes the sounds of a language, and a language model that describes how the sounds of a language can be combined into meaningful phrases. This gives the speech recognition engine an understanding of its language that transcends the words in its lexicon. The speech recognition engine uses this understanding to create pronunciations for words it encounters that are not in its lexicon, and it does this almost instantaneously.

However, if your application includes words that feature unusual spelling or atypical pronunciation of familiar spellings, then the speech recognition engine may not create the pronunciation that works best for your application. In these cases, you can specify a custom pronunciation that may improve the recognition accuracy for the specialized vocabulary in your application. It is important to test the performance of your custom pronunciations to confirm that they provide an improved speech recognition experience for your intended audience.

Methods of Incorporation

You can create custom pronunciations either inline in a grammar, or in a lexicon file that a grammar references. A lexicon is a list of words together with their pronunciations. When deciding whether to implement custom pronunciations inline in a grammar or in a linked lexicon, consider the following:

Custom pronunciations specified inline in grammars apply only to the single occurrence of a word in the grammar.
Custom pronunciations specified in lexicons apply to all occurrences of a word in a grammar.
A lexicon linked from a grammar is only active while the grammar is active for recognition.

When deciding which pronunciation to use for a word or phrase during speech recognition, a speech recognition engine looks for pronunciations at the following locations in order:

Inline in grammar documents
In lexicon files linked from a grammar document
In the speech recognition engine's internal lexicon

If there are custom pronunciations specified for the same word both inline in a grammar and in a linked lexicon, the speech recognition engine uses only the inline pronunciations. Similarly, if there are custom pronunciations specified in a lexicon linked from a grammar, the speech recognition engine uses those pronunciations instead of, not in addition to, the pronunciations given in the engine's internal lexicon.

If the speech recognition engine does not find a pronunciation for a word, either in grammars or lexicons to which it currently has access, it will create a pronunciation using the rules of its language model and acoustic model.

You can determine the pronunciation that the speech recognition engine associates with a phrase using the Check Phrase tool. See Check Phrase Reference Manual. You pass in the phrase and a grammar containing the phrase, and the tool generates a result that includes the pronunciation associated with the phrase. This can help you to decide whether or not to provide a custom pronunciation for a phrase.

You specify custom pronunciations using characters from a phonetic alphabet. A phonetic alphabet contains combinations of letters, numbers, and characters which are known as "phones". Phones describe the sounds of speech for a particular language. Similar to those used in dictionaries, phonetic spellings describe how words should be pronounced for successful speech recognition in a specified language.

Creating Inline Custom Pronunciations

You can create custom pronunciations inline in XML-format grammars that are based on the Speech Recognition Grammar Specification (SRGS) Version 1.0. To add custom pronunciations inline in a grammar document, you use special attributes that were created by Microsoft for the grammar Element (Microsoft.Speech) and the token Element (Microsoft.Speech). Remember that inline custom pronunciations apply only to a single occurrence of a word. Use the following steps to create custom pronunciations inline in a grammar document:

Create Inline Custom Pronunciations

Add the following declarations to the grammar element:
- sapi:alphabet="x-microsoft-ups". This informs the grammar that you will use Microsoft's Universal Phone Set (UPS) to specify pronunciations. You will typically use UPS to specify pronunciations in US English. This attribute is case-sensitive.
- xmlns:sapi=https://schemas.microsoft.com/Speech/2002/06/SRGSExtensions. This provides a link to the namespace that defines Microsoft's custom attributes for the grammar and token elements.
Create a token element that encloses the word for which you want to specify a custom pronunciation. Add an empty sapi:pron attribute.
- <token sapi:pron=""> habanera </token>
Using the phone tables below, look up the phones that correspond to the sound of each syllable in your word. Use the Sample Pronunciations below to understand how to combine phones to describe the sound of a syllable.
In the sapi:pron attribute, enter the phones that describe the word's pronunciation. Phones are case-sensitive and must be space-delimited. Optionally use markers for syllable emphasis (S1, S2) and for separating syllables (.) to further refine the pronunciation. See Suprasegmentals later in this topic.

Here is an example of a grammar that specifies custom pronunciations inline.

<?xml version="1.0" encoding="UTF-8"?>

<grammar 
  version="1.0" mode="voice" root="sauce"
  xml:lang="en-US" tag-format="semantics/1.0" 
  sapi:alphabet="x-microsoft-ups" 
  xml:base="https://www.contoso.com/"
  xmlns="http://www.w3.org/2001/06/grammar"
  xmlns:sapi="https://schemas.microsoft.com/Speech/2002/06/SRGSExtensions">
  
  <rule id="sauce" scope="public">
    <item> Please bring more </item>
      <one-of>
        <item><token sapi:display="habanero" sapi:pron="H AE . B AX . S1 N EH lng . R O"> habanera </token></item>
        <item><token sapi:display="habanera" sapi:pron="H AE . B AX . S1 N EH lng . R AX"> habanera </token></item>
      </one-of> 
    <item> sauce. </item>
  </rule>

</grammar>

Note

You can use the sapi:display attribute of the token element to specify the form of the word that will display in a user interface. The display form of a word is often the same as its lexical form, which is the content of the token element. However, in some languages, such as Japanese, the application may choose to have a display form that is different than the lexical form, as in the following example:

  <item>
    <token sapi:display="ストップ" sapi:pron="ストップ"> すとっぷ </token>
    <tag>cancel</tag>
  </item>

Creating Custom Pronunciations Using a PLS Lexicon

If your application uses specialized words repeatedly across multiple grammars, or if you are generating pronunciations separately from the grammar generation, you can create a lexicon that contains the words and their pronunciations. A lexicon is a separate document that you can link to one or more grammars. If the speech recognition engine loads a grammar that is linked to a lexicon, it uses the pronunciations that the lexicon contained when the grammar was loaded. Any pronunciations inline in a grammar still take precedence over pronunciations in a linked lexicon.

You author lexicons as XML documents that follow the format of the Pronunciation Lexicon Specification (PLS) Version 1.0. Use the following steps to author a PLS lexicon:

Create a Lexicon

Start with a new, blank XML document.
Enter the XML declaration: <?xml version="1.0" encoding="UTF-8"?>.

Enter the opening tag of the lexicon element. This must declare the single language-culture of the words in the lexicon, and the phonetic alphabet used to construct the pronunciations. Here is an example of the opening tag of a lexicon element:

<lexicon version="1.0" 
  xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
  xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon 
  http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
  alphabet="x-microsoft-ups" xml:lang="en-US">

Add a lexeme element for each word for which you want to define one or more pronunciations:
- <lexeme></lexeme>. The parent element for defining words and their pronunciations.
Within the lexeme element, add a grapheme element and a phoneme element.
- <grapheme></grapheme>. Contains the written form of a word.
- <phoneme></phoneme>. Contains phones that describe the pronunciation of a word.
In the grapheme element, enter the word for which you want to specify a pronunciation.
Using the phone tables below, look up the phones that correspond to the sound of each syllable in your word. Use the Sample Pronunciations below to understand how to combine phones to describe the sound of a syllable.
In the phoneme element, enter the phones that describe the word's pronunciation. Phones are case-sensitive and must be space-delimited. Optionally include markers for syllable emphasis (S1, S2) and for separating syllables (.) to further refine the pronunciation. See Suprasegmentals later in this topic.
If you want to specify multiple pronunciations for the same word, add more phoneme elements within the lexeme element. When specifying multiple pronunciations for a word, you can designate one pronunciation as preferred by adding the attribute/value pair prefer="true" to the phoneme element, for example: <phoneme prefer="true">1 l eh d </phoneme>.
Continue adding lexeme elements for each word whose pronunciation you want to specify.
Add the closing lexicon tag: </lexicon>, and save the document with a .pls extension.

You can optionally include an <example></example> element within the lexeme element that contains a sample usage of the grapheme.

Here is an example of a completed lexicon document that specifies pronunciations for fictional names:

<?xml version="1.0" encoding="UTF-8"?>

<lexicon version="1.0" 
  xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
  xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon 
  http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
  alphabet="x-microsoft-ups" xml:lang="en-US">

  <lexeme>
    <grapheme> Klhtr </grapheme>
    <phoneme> K L EH . S1 T AA R </phoneme>
  </lexeme>

  <lexeme>
    <grapheme> Eanor </grapheme>
    <phoneme> S1 I . AX . N O R </phoneme>
  </lexeme>

  <lexeme>
    <grapheme> Puntahrik </grapheme>
    <phoneme> P UH N . S1 T AA . R IH K</phoneme>
  </lexeme>

</lexicon>

Link the Lexicon to the Grammar

Now that you have created your lexicon, you must link to it from your grammar, using the following steps.

Link a Lexicon to a Grammar

After the opening tag of the grammar element, and before the first rule element, enter a lexicon element: <lexicon />
Add a uri attribute to the lexicon element and enter the path and name of the lexicon file. For example: <lexicon uri="c:\MyLexicon.pls" /> OR <lexicon uri= “https://contoso.com/lexiconstore/MyLexicon.pls”/>

The following is an example of a grammar that references the lexicon of fictional names shown above:

<?xml version="1.0" encoding="UTF-8"?>

<grammar 
  version="1.0" mode="voice" root="warriors"
  xml:lang="en-US" tag-format="semantics/1.0" 
  sapi:alphabet="x-microsoft-ups" 
  xml:base="https://www.contoso.com/"
  xmlns="http://www.w3.org/2001/06/grammar"
  xmlns:sapi="https://schemas.microsoft.com/Speech/2002/06/SRGSExtensions">

  <lexicon uri="c:\test\Warriors.pls" />
  
  <rule id="warriors" scope="public">
    <item> The warrior's name is </item>
    <one-of>
      <item> Klhtr </item> 
      <item> Eanor </item>  
      <item> Puntahrik </item>
    </one-of>
  </rule>

</grammar>

UPS Phone Tables

The following tables contain the most commonly used phones for US English from Microsoft's Universal Phone Set (UPS).

Note

UPS phones are case-sensitive.

Consonants

The following table lists the most commonly used consonant phones for US English from Microsoft's Universal Phone Set (UPS).

UPS Phone	Example
B	big
CH	chin
D	dig
DH	then
DX	butter
F	fork
G	gut
H	help
JH	joy
K	cut
L	lid
M	mat
N	no
NG	sing
P	put
R	red
S	sit
SH	she
T	talk
TH	thin
V	vat
W	with
J	yard
Z	zap
ZH	pleasure

Vowels

The following table lists the most commonly used vowel phones for US English from Microsoft's Universal Phone Set (UPS).

UPS Phone	Example
AA	father
AE	cat
AH	cut
AO	dog
AOX	four
AU	foul
AX	ago
AX rho	minor
AI	bite
EH	pet
EHX	stairs
ER	fur
ER rho	urban
EI	ate
IH	fill
I	feel
O	go
OI	toy
OWX	boa
Q	hot
UH	book
U	too, blue
UWX	lure

Suprasegmentals

Suprasegmentals are optional markers that indicate the division of a word into syllables, identify which syllables receive emphasis (stress), and specify the length of syllables. The following suprasegmental markers of the UPS are commonly used to create pronunciations for words in US English:

UPS Phone	Description	Example
S1	Indicates that the following syllable receives primary emphasis.	F AX . S1 N EH S (finesse)
S2	Indicates that the following syllable receives secondary emphasis.	K AX N . S1 S ER rho . V AX . S2 T R I (conservatory)
.	Indicates a break between syllables.	S1 K O . K O (cocoa)
lng	Extends the length of the preceding syllable.	B I . S1 W EH lng R (beware)

Sample Pronunciations

The following table contains a sampling of words in US English and their phonetic spellings (pronunciations) using UPS phones. Use this table to understand which phones represent sounds in US English with which you are already familiar, and as a guide to combining phones to create syllables and words.

Note

The words in this table are not words that require custom pronunciations.
The phones used in pronunciations are case-sensitive and must be space-delimited.
The use of the suprasegmental markers (S1, S2, .) is optional.
UPS provides a compounding or tying symbol "+" that can be used to describe composite sounds from any two phones. For more information, see Compound Tying Symbol (Microsoft.Speech).

Word	Pronunciation	Word	Pronunciation
action	S1 AE K . SH IH N	junkyard	S1 JH AH NG . K J AA R D
adverse	S1 AE D . V ER rho S	king	S1 K IH NG
analog	S1 AE . N AX . L AA G	lady	S1 L E + I . D I
around	AX . S1 R A + UH N D	leave	S1 L I V
avail	AX . S1 V E + I L	left	S1 L EH F T
beauty	S1 B J U . DX I	lion	S1 L A + I . AX N
believe	B AX . S1 L I V	little	S1 L IH . DX AX L
beware	B I . S1 W EH lng R	luscious	S1 L AH . SH AX S
bittersweet	S1 B IH . DX AX rho . S W I T	magic	S1 M AE . JH IH K
blood	S1 B L AH D	Mary	S1 M EH lng . R I
burrito	B AX . S1 R I . T O	minor	S1 M A + I . N AX rho
cell-phone	S1 S EH L _& S1 F O N	name	S1 N E + I M
coconuts	S1 K O . K AX . N AH T S	outdoors	S1 A + UH T . D AO R Z
collect	K AX . S1 L EH K T	Pisces	S1 P A + I . S I Z
comets	S1 K AA . M IH T S	quick	S1 K W IH K
conformity	K AX N . S1 F AO R . M IH . DX I	raindrop	S1 R E + I N . D R AA P
conservatory	K AX N . S1 S ER rho . V AX . S2 T R I	refreshments	R AX . S1 F R EH SH . M AX N T S
contemporary	S2 K AX N . S1 T EH M . P AX . R AX . R I	refused	R AX . S1 F J U Z D
creations	S1 K R I . E + I . SH AX N Z	revolution	R EH . V AX . S1 L U . SH AX N
cross	S1 K R AA S	rolling	S1 R O . L IH NG
deed	S1 D I D	round	S1 R A + UH N D
dog	S1 D AO G	saffron	S1 S AE . F R AA N
duckling	S1 D AH . K L IH NG	short	S1 SH AO R T
enthusiast	EH N . S1 TH U . Z I . AE S T	snow	S1 S N O
excerpt	S1 EH K . S AX rho P T	song	S1 S AO NG
facile	S1 F AE . S A + I L	sponge	S1 S P AH N JH
falling	S1 F AO L . L IH NG	strawberries	S1 S T R AO . B EH . R I Z
far	S1 F AA R	strollers	S1 S T R AO . L AX rho Z
fastball	S1 F AE S T . B AO L	subway	S1 S AH B . W E + I
friends	S1 F R EH N Z	thrill	S1 TH R IH L
furnaces	S1 F ER rho . N IH . S IH Z	toasters	S1 T O . S T AX rho Z
garden	S1 G AA R . D IH N	toothpaste	S1 T U TH . P E + I S T
great	S1 G R E + I T	towering	S1 T A + UH . AX . R IH NG
head	S1 H EH D	toy	S1 T AO +I
hothouse	S1 H AO T . H A + UH S	tractors	S1 T R AE K . T AX rho Z
howlers	S1 H A + UH . L AX rho Z	tragically	S1 T R AE . JH IH . K L I
hygienist	H A + I . S1 JH EH . N IH S T	unrest	S1 AH N . R EH S T
impossible	IH M . S1 P AA . S IH . B AX L	urban	S1 ER rho . B AX N
informant	IH N . S1 F AO R . M AX N T	vagabonds	S1 V AE . G AX . B AA N D Z
instruction	IH N . S1 S T R AH K . SH AX N	vanished	S1 V AE . N IH SH T
intruders	IH N . S1 T R U . D AX rho Z	velvet	S1 V EH L . V IH T
islanders	S1 A + I . L AX N . D AX rho Z	why	S1 H W A + I
journal	S1 JH ER rho . N AX L	why	S1 W A + I
jungle	S1 JH AH NG . G AX L	zippers	S1 Z IH . P AX rho Z

Share via

Using Custom Pronunciations

When to Create Custom Pronunciations

Methods of Incorporation

Creating Inline Custom Pronunciations

Create Inline Custom Pronunciations

Creating Custom Pronunciations Using a PLS Lexicon

Create a Lexicon

Link the Lexicon to the Grammar

Link a Lexicon to a Grammar

UPS Phone Tables

Consonants

Vowels

Suprasegmentals

Sample Pronunciations

See Also

Concepts

Additional resources