Note
Please see Azure Cognitive Services for Speech documentation for the latest supported speech solutions.
Using Custom Pronunciations
There may be some scenarios in which you can improve the performance of speech recognition in your applications by specifying custom pronunciations for words in your application's vocabulary that are uncommon or whose pronunciation is unusual. Unusual words for which you may want to create custom pronunciations include proper names, place names, fictional words, slang, or words that are specific to an educational or medical discipline. Using the information and examples in this topic, you can create custom pronunciations for the specialized vocabulary in your application.
When to Create Custom Pronunciations
You create custom pronunciations to improve the accuracy of speech recognition for vocabulary in your application that the speech recognition engine does not interpret as well as expected. Typically, you will only need to create custom pronunciations for words that are not common to a language and that do not follow the typical pronunciation rules for the orthography of a language. The Microsoft speech recognition engine is well equipped with tools that enable it to determine the correct pronunciation of words it is familiar with, as well as for words it has never encountered.
Speech recognition engines include a default lexicon that specifies which words can be recognized, and how each word must be pronounced to be recognized. The scope of a speech recognition engine's lexicon is typically a single language, and the lexicon contains a large number of words that the speech recognition engine can recognize in that language. This provides the speech recognition engine with a substantial native vocabulary.
In addition to its lexicon, a speech recognition engine also has an acoustic model that describes the sounds of a language, and a language model that describes how the sounds of a language can be combined into meaningful phrases. This gives the speech recognition engine an understanding of its language that transcends the words in its lexicon. The speech recognition engine uses this understanding to create pronunciations for words it encounters that are not in its lexicon, and it does this almost instantaneously.
However, if your application includes words that feature unusual spelling or atypical pronunciation of familiar spellings, then the speech recognition engine may not create the pronunciation that works best for your application. In these cases, you can specify a custom pronunciation that may improve the recognition accuracy for the specialized vocabulary in your application. It is important to test the performance of your custom pronunciations to confirm that they provide an improved speech recognition experience for your intended audience.
Methods of Incorporation
You can create custom pronunciations either inline in a grammar, or in a lexicon file that a grammar references. A lexicon is a list of words together with their pronunciations. When deciding whether to implement custom pronunciations inline in a grammar or in a linked lexicon, consider the following:
Custom pronunciations specified inline in grammars apply only to the single occurrence of a word in the grammar.
Custom pronunciations specified in lexicons apply to all occurrences of a word in a grammar.
A lexicon linked from a grammar is only active while the grammar is active for recognition.
When deciding which pronunciation to use for a word or phrase during speech recognition, a speech recognition engine looks for pronunciations at the following locations in order:
Inline in grammar documents
In lexicon files linked from a grammar document
In the speech recognition engine's internal lexicon
If there are custom pronunciations specified for the same word both inline in a grammar and in a linked lexicon, the speech recognition engine uses only the inline pronunciations. Similarly, if there are custom pronunciations specified in a lexicon linked from a grammar, the speech recognition engine uses those pronunciations instead of, not in addition to, the pronunciations given in the engine's internal lexicon.
If the speech recognition engine does not find a pronunciation for a word, either in grammars or lexicons to which it currently has access, it will create a pronunciation using the rules of its language model and acoustic model.
You can determine the pronunciation that the speech recognition engine associates with a phrase using the Check Phrase tool. See Check Phrase Reference Manual. You pass in the phrase and a grammar containing the phrase, and the tool generates a result that includes the pronunciation associated with the phrase. This can help you to decide whether or not to provide a custom pronunciation for a phrase.
You specify custom pronunciations using characters from a phonetic alphabet. A phonetic alphabet contains combinations of letters, numbers, and characters which are known as "phones". Phones describe the sounds of speech for a particular language. Similar to those used in dictionaries, phonetic spellings describe how words should be pronounced for successful speech recognition in a specified language.
Creating Inline Custom Pronunciations
You can create custom pronunciations inline in XML-format grammars that are based on the Speech Recognition Grammar Specification (SRGS) Version 1.0. To add custom pronunciations inline in a grammar document, you use special attributes that were created by Microsoft for the grammar Element (Microsoft.Speech) and the token Element (Microsoft.Speech). Remember that inline custom pronunciations apply only to a single occurrence of a word. Use the following steps to create custom pronunciations inline in a grammar document:
Create Inline Custom Pronunciations
Add the following declarations to the grammar element:
sapi:alphabet="x-microsoft-ups". This informs the grammar that you will use Microsoft's Universal Phone Set (UPS) to specify pronunciations. You will typically use UPS to specify pronunciations in US English. This attribute is case-sensitive.
xmlns:sapi=https://schemas.microsoft.com/Speech/2002/06/SRGSExtensions. This provides a link to the namespace that defines Microsoft's custom attributes for the grammar and token elements.
Create a token element that encloses the word for which you want to specify a custom pronunciation. Add an empty sapi:pron attribute.
- <token sapi:pron=""> habanera </token>
Using the phone tables below, look up the phones that correspond to the sound of each syllable in your word. Use the Sample Pronunciations below to understand how to combine phones to describe the sound of a syllable.
In the sapi:pron attribute, enter the phones that describe the word's pronunciation. Phones are case-sensitive and must be space-delimited. Optionally use markers for syllable emphasis (S1, S2) and for separating syllables (.) to further refine the pronunciation. See Suprasegmentals later in this topic.
Here is an example of a grammar that specifies custom pronunciations inline.
<?xml version="1.0" encoding="UTF-8"?>
<grammar
version="1.0" mode="voice" root="sauce"
xml:lang="en-US" tag-format="semantics/1.0"
sapi:alphabet="x-microsoft-ups"
xml:base="https://www.contoso.com/"
xmlns="http://www.w3.org/2001/06/grammar"
xmlns:sapi="https://schemas.microsoft.com/Speech/2002/06/SRGSExtensions">
<rule id="sauce" scope="public">
<item> Please bring more </item>
<one-of>
<item><token sapi:display="habanero" sapi:pron="H AE . B AX . S1 N EH lng . R O"> habanera </token></item>
<item><token sapi:display="habanera" sapi:pron="H AE . B AX . S1 N EH lng . R AX"> habanera </token></item>
</one-of>
<item> sauce. </item>
</rule>
</grammar>
Note
You can use the sapi:display attribute of the token element to specify the form of the word that will display in a user interface. The display form of a word is often the same as its lexical form, which is the content of the token element. However, in some languages, such as Japanese, the application may choose to have a display form that is different than the lexical form, as in the following example:
<item>
<token sapi:display="ストップ" sapi:pron="ストップ"> すとっぷ </token>
<tag>cancel</tag>
</item>
Creating Custom Pronunciations Using a PLS Lexicon
If your application uses specialized words repeatedly across multiple grammars, or if you are generating pronunciations separately from the grammar generation, you can create a lexicon that contains the words and their pronunciations. A lexicon is a separate document that you can link to one or more grammars. If the speech recognition engine loads a grammar that is linked to a lexicon, it uses the pronunciations that the lexicon contained when the grammar was loaded. Any pronunciations inline in a grammar still take precedence over pronunciations in a linked lexicon.
You author lexicons as XML documents that follow the format of the Pronunciation Lexicon Specification (PLS) Version 1.0. Use the following steps to author a PLS lexicon:
Create a Lexicon
Start with a new, blank XML document.
Enter the XML declaration: <?xml version="1.0" encoding="UTF-8"?>.
Enter the opening tag of the lexicon element. This must declare the single language-culture of the words in the lexicon, and the phonetic alphabet used to construct the pronunciations. Here is an example of the opening tag of a lexicon element:
<lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd" alphabet="x-microsoft-ups" xml:lang="en-US">
Add a lexeme element for each word for which you want to define one or more pronunciations:
- <lexeme></lexeme>. The parent element for defining words and their pronunciations.
Within the lexeme element, add a grapheme element and a phoneme element.
<grapheme></grapheme>. Contains the written form of a word.
<phoneme></phoneme>. Contains phones that describe the pronunciation of a word.
In the grapheme element, enter the word for which you want to specify a pronunciation.
Using the phone tables below, look up the phones that correspond to the sound of each syllable in your word. Use the Sample Pronunciations below to understand how to combine phones to describe the sound of a syllable.
In the phoneme element, enter the phones that describe the word's pronunciation. Phones are case-sensitive and must be space-delimited. Optionally include markers for syllable emphasis (S1, S2) and for separating syllables (.) to further refine the pronunciation. See Suprasegmentals later in this topic.
If you want to specify multiple pronunciations for the same word, add more phoneme elements within the lexeme element. When specifying multiple pronunciations for a word, you can designate one pronunciation as preferred by adding the attribute/value pair prefer="true" to the phoneme element, for example: <phoneme prefer="true">1 l eh d </phoneme>.
Continue adding lexeme elements for each word whose pronunciation you want to specify.
Add the closing lexicon tag: </lexicon>, and save the document with a .pls extension.
You can optionally include an <example></example> element within the lexeme element that contains a sample usage of the grapheme.
Here is an example of a completed lexicon document that specifies pronunciations for fictional names:
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
alphabet="x-microsoft-ups" xml:lang="en-US">
<lexeme>
<grapheme> Klhtr </grapheme>
<phoneme> K L EH . S1 T AA R </phoneme>
</lexeme>
<lexeme>
<grapheme> Eanor </grapheme>
<phoneme> S1 I . AX . N O R </phoneme>
</lexeme>
<lexeme>
<grapheme> Puntahrik </grapheme>
<phoneme> P UH N . S1 T AA . R IH K</phoneme>
</lexeme>
</lexicon>
Link the Lexicon to the Grammar
Now that you have created your lexicon, you must link to it from your grammar, using the following steps.
Link a Lexicon to a Grammar
After the opening tag of the grammar element, and before the first rule element, enter a lexicon element: <lexicon />
Add a uri attribute to the lexicon element and enter the path and name of the lexicon file. For example: <lexicon uri="c:\MyLexicon.pls" /> OR <lexicon uri= “https://contoso.com/lexiconstore/MyLexicon.pls”/>
The following is an example of a grammar that references the lexicon of fictional names shown above:
<?xml version="1.0" encoding="UTF-8"?>
<grammar
version="1.0" mode="voice" root="warriors"
xml:lang="en-US" tag-format="semantics/1.0"
sapi:alphabet="x-microsoft-ups"
xml:base="https://www.contoso.com/"
xmlns="http://www.w3.org/2001/06/grammar"
xmlns:sapi="https://schemas.microsoft.com/Speech/2002/06/SRGSExtensions">
<lexicon uri="c:\test\Warriors.pls" />
<rule id="warriors" scope="public">
<item> The warrior's name is </item>
<one-of>
<item> Klhtr </item>
<item> Eanor </item>
<item> Puntahrik </item>
</one-of>
</rule>
</grammar>
UPS Phone Tables
The following tables contain the most commonly used phones for US English from Microsoft's Universal Phone Set (UPS).
Note
UPS phones are case-sensitive.
Consonants
The following table lists the most commonly used consonant phones for US English from Microsoft's Universal Phone Set (UPS).
UPS Phone |
Example |
---|---|
B |
big |
CH |
chin |
D |
dig |
DH |
then |
DX |
butter |
F |
fork |
G |
gut |
H |
help |
JH |
joy |
K |
cut |
L |
lid |
M |
mat |
N |
no |
NG |
sing |
P |
put |
R |
red |
S |
sit |
SH |
she |
T |
talk |
TH |
thin |
V |
vat |
W |
with |
J |
yard |
Z |
zap |
ZH |
pleasure |
Vowels
The following table lists the most commonly used vowel phones for US English from Microsoft's Universal Phone Set (UPS).
UPS Phone |
Example |
---|---|
AA |
father |
AE |
cat |
AH |
cut |
AO |
dog |
AOX |
four |
AU |
foul |
AX |
ago |
AX rho |
minor |
AI |
bite |
EH |
pet |
EHX |
stairs |
ER |
fur |
ER rho |
urban |
EI |
ate |
IH |
fill |
I |
feel |
O |
go |
OI |
toy |
OWX |
boa |
Q |
hot |
UH |
book |
U |
too, blue |
UWX |
lure |
Suprasegmentals
Suprasegmentals are optional markers that indicate the division of a word into syllables, identify which syllables receive emphasis (stress), and specify the length of syllables. The following suprasegmental markers of the UPS are commonly used to create pronunciations for words in US English:
UPS Phone |
Description |
Example |
---|---|---|
S1 |
Indicates that the following syllable receives primary emphasis. |
F AX . S1 N EH S (finesse) |
S2 |
Indicates that the following syllable receives secondary emphasis. |
K AX N . S1 S ER rho . V AX . S2 T R I (conservatory) |
. |
Indicates a break between syllables. |
S1 K O . K O (cocoa) |
lng |
Extends the length of the preceding syllable. |
B I . S1 W EH lng R (beware) |
Sample Pronunciations
The following table contains a sampling of words in US English and their phonetic spellings (pronunciations) using UPS phones. Use this table to understand which phones represent sounds in US English with which you are already familiar, and as a guide to combining phones to create syllables and words.
Note
-
The words in this table are not words that require custom pronunciations.
-
The phones used in pronunciations are case-sensitive and must be space-delimited.
-
The use of the suprasegmental markers (S1, S2, .) is optional.
-
UPS provides a compounding or tying symbol "+" that can be used to describe composite sounds from any two phones. For more information, see Compound Tying Symbol (Microsoft.Speech).
Word |
Pronunciation |
Word |
Pronunciation |
action |
S1 AE K . SH IH N |
junkyard |
S1 JH AH NG . K J AA R D |
adverse |
S1 AE D . V ER rho S |
king |
S1 K IH NG |
analog |
S1 AE . N AX . L AA G |
lady |
S1 L E + I . D I |
around |
AX . S1 R A + UH N D |
leave |
S1 L I V |
avail |
AX . S1 V E + I L |
left |
S1 L EH F T |
beauty |
S1 B J U . DX I |
lion |
S1 L A + I . AX N |
believe |
B AX . S1 L I V |
little |
S1 L IH . DX AX L |
beware |
B I . S1 W EH lng R |
luscious |
S1 L AH . SH AX S |
bittersweet |
S1 B IH . DX AX rho . S W I T |
magic |
S1 M AE . JH IH K |
blood |
S1 B L AH D |
Mary |
S1 M EH lng . R I |
burrito |
B AX . S1 R I . T O |
minor |
S1 M A + I . N AX rho |
cell-phone |
S1 S EH L _& S1 F O N |
name |
S1 N E + I M |
coconuts |
S1 K O . K AX . N AH T S |
outdoors |
S1 A + UH T . D AO R Z |
collect |
K AX . S1 L EH K T |
Pisces |
S1 P A + I . S I Z |
comets |
S1 K AA . M IH T S |
quick |
S1 K W IH K |
conformity |
K AX N . S1 F AO R . M IH . DX I |
raindrop |
S1 R E + I N . D R AA P |
conservatory |
K AX N . S1 S ER rho . V AX . S2 T R I |
refreshments |
R AX . S1 F R EH SH . M AX N T S |
contemporary |
S2 K AX N . S1 T EH M . P AX . R AX . R I |
refused |
R AX . S1 F J U Z D |
creations |
S1 K R I . E + I . SH AX N Z |
revolution |
R EH . V AX . S1 L U . SH AX N |
cross |
S1 K R AA S |
rolling |
S1 R O . L IH NG |
deed |
S1 D I D |
round |
S1 R A + UH N D |
dog |
S1 D AO G |
saffron |
S1 S AE . F R AA N |
duckling |
S1 D AH . K L IH NG |
short |
S1 SH AO R T |
enthusiast |
EH N . S1 TH U . Z I . AE S T |
snow |
S1 S N O |
excerpt |
S1 EH K . S AX rho P T |
song |
S1 S AO NG |
facile |
S1 F AE . S A + I L |
sponge |
S1 S P AH N JH |
falling |
S1 F AO L . L IH NG |
strawberries |
S1 S T R AO . B EH . R I Z |
far |
S1 F AA R |
strollers |
S1 S T R AO . L AX rho Z |
fastball |
S1 F AE S T . B AO L |
subway |
S1 S AH B . W E + I |
friends |
S1 F R EH N Z |
thrill |
S1 TH R IH L |
furnaces |
S1 F ER rho . N IH . S IH Z |
toasters |
S1 T O . S T AX rho Z |
garden |
S1 G AA R . D IH N |
toothpaste |
S1 T U TH . P E + I S T |
great |
S1 G R E + I T |
towering |
S1 T A + UH . AX . R IH NG |
head |
S1 H EH D |
toy |
S1 T AO +I |
hothouse |
S1 H AO T . H A + UH S |
tractors |
S1 T R AE K . T AX rho Z |
howlers |
S1 H A + UH . L AX rho Z |
tragically |
S1 T R AE . JH IH . K L I |
hygienist |
H A + I . S1 JH EH . N IH S T |
unrest |
S1 AH N . R EH S T |
impossible |
IH M . S1 P AA . S IH . B AX L |
urban |
S1 ER rho . B AX N |
informant |
IH N . S1 F AO R . M AX N T |
vagabonds |
S1 V AE . G AX . B AA N D Z |
instruction |
IH N . S1 S T R AH K . SH AX N |
vanished |
S1 V AE . N IH SH T |
intruders |
IH N . S1 T R U . D AX rho Z |
velvet |
S1 V EH L . V IH T |
islanders |
S1 A + I . L AX N . D AX rho Z |
why |
S1 H W A + I |
journal |
S1 JH ER rho . N AX L |
why |
S1 W A + I |
jungle |
S1 JH AH NG . G AX L |
zippers |
S1 Z IH . P AX rho Z |