Share via


Note

Please see Azure Cognitive Services for Speech documentation for the latest supported speech solutions.

Using Custom Pronunciations

There may be some scenarios in which you can improve the performance of speech recognition in your applications by specifying custom pronunciations for words in your application's vocabulary that are uncommon or whose pronunciation is unusual. Unusual words for which you may want to create custom pronunciations include proper names, place names, fictional words, slang, or words that are specific to an educational or medical discipline. Using the information and examples in this topic, you can create custom pronunciations for the specialized vocabulary in your application.

When to Create Custom Pronunciations

You create custom pronunciations to improve the accuracy of speech recognition for vocabulary in your application that the speech recognition engine does not interpret as well as expected. Typically, you will only need to create custom pronunciations for words that are not common to a language and that do not follow the typical pronunciation rules for the orthography of a language. The Microsoft speech recognition engine is well equipped with tools that enable it to determine the correct pronunciation of words it is familiar with, as well as for words it has never encountered.

Speech recognition engines include a default lexicon that specifies which words can be recognized, and how each word must be pronounced to be recognized. The scope of a speech recognition engine's lexicon is typically a single language, and the lexicon contains a large number of words that the speech recognition engine can recognize in that language. This provides the speech recognition engine with a substantial native vocabulary.

In addition to its lexicon, a speech recognition engine also has an acoustic model that describes the sounds of a language, and a language model that describes how the sounds of a language can be combined into meaningful phrases. This gives the speech recognition engine an understanding of its language that transcends the words in its lexicon. The speech recognition engine uses this understanding to create pronunciations for words it encounters that are not in its lexicon, and it does this almost instantaneously.

However, if your application includes words that feature unusual spelling or atypical pronunciation of familiar spellings, then the speech recognition engine may not create the pronunciation that works best for your application. In these cases, you can specify a custom pronunciation that may improve the recognition accuracy for the specialized vocabulary in your application. It is important to test the performance of your custom pronunciations to confirm that they provide an improved speech recognition experience for your intended audience.

Methods of Incorporation

You can create custom pronunciations either inline in a grammar, or in a lexicon file that a grammar references. A lexicon is a list of words together with their pronunciations. When deciding whether to implement custom pronunciations inline in a grammar or in a linked lexicon, consider the following:

  • Custom pronunciations specified inline in grammars apply only to the single occurrence of a word in the grammar.

  • Custom pronunciations specified in lexicons apply to all occurrences of a word in a grammar.

  • A lexicon linked from a grammar is only active while the grammar is active for recognition.

When deciding which pronunciation to use for a word or phrase during speech recognition, a speech recognition engine looks for pronunciations at the following locations in order:

  1. Inline in grammar documents

  2. In lexicon files linked from a grammar document

  3. In the speech recognition engine's internal lexicon

If there are custom pronunciations specified for the same word both inline in a grammar and in a linked lexicon, the speech recognition engine uses only the inline pronunciations. Similarly, if there are custom pronunciations specified in a lexicon linked from a grammar, the speech recognition engine uses those pronunciations instead of, not in addition to, the pronunciations given in the engine's internal lexicon.

If the speech recognition engine does not find a pronunciation for a word, either in grammars or lexicons to which it currently has access, it will create a pronunciation using the rules of its language model and acoustic model.

You can determine the pronunciation that the speech recognition engine associates with a phrase using the Check Phrase tool. See Check Phrase Reference Manual. You pass in the phrase and a grammar containing the phrase, and the tool generates a result that includes the pronunciation associated with the phrase. This can help you to decide whether or not to provide a custom pronunciation for a phrase.

You specify custom pronunciations using characters from a phonetic alphabet. A phonetic alphabet contains combinations of letters, numbers, and characters which are known as "phones". Phones describe the sounds of speech for a particular language. Similar to those used in dictionaries, phonetic spellings describe how words should be pronounced for successful speech recognition in a specified language.

Creating Inline Custom Pronunciations

You can create custom pronunciations inline in XML-format grammars that are based on the Speech Recognition Grammar Specification (SRGS) Version 1.0. To add custom pronunciations inline in a grammar document, you use special attributes that were created by Microsoft for the grammar Element (Microsoft.Speech) and the token Element (Microsoft.Speech). Remember that inline custom pronunciations apply only to a single occurrence of a word. Use the following steps to create custom pronunciations inline in a grammar document:

Create Inline Custom Pronunciations

  1. Add the following declarations to the grammar element:

    • sapi:alphabet="x-microsoft-ups". This informs the grammar that you will use Microsoft's Universal Phone Set (UPS) to specify pronunciations. You will typically use UPS to specify pronunciations in US English. This attribute is case-sensitive.

    • xmlns:sapi=https://schemas.microsoft.com/Speech/2002/06/SRGSExtensions. This provides a link to the namespace that defines Microsoft's custom attributes for the grammar and token elements.

  2. Create a token element that encloses the word for which you want to specify a custom pronunciation. Add an empty sapi:pron attribute.

    • <token sapi:pron=""> habanera </token>
  3. Using the phone tables below, look up the phones that correspond to the sound of each syllable in your word. Use the Sample Pronunciations below to understand how to combine phones to describe the sound of a syllable.

  4. In the sapi:pron attribute, enter the phones that describe the word's pronunciation. Phones are case-sensitive and must be space-delimited. Optionally use markers for syllable emphasis (S1, S2) and for separating syllables (.) to further refine the pronunciation. See Suprasegmentals later in this topic.

Here is an example of a grammar that specifies custom pronunciations inline.

<?xml version="1.0" encoding="UTF-8"?>

<grammar 
  version="1.0" mode="voice" root="sauce"
  xml:lang="en-US" tag-format="semantics/1.0" 
  sapi:alphabet="x-microsoft-ups" 
  xml:base="https://www.contoso.com/"
  xmlns="http://www.w3.org/2001/06/grammar"
  xmlns:sapi="https://schemas.microsoft.com/Speech/2002/06/SRGSExtensions">
  
  <rule id="sauce" scope="public">
    <item> Please bring more </item>
      <one-of>
        <item><token sapi:display="habanero" sapi:pron="H AE . B AX . S1 N EH lng . R O"> habanera </token></item>
        <item><token sapi:display="habanera" sapi:pron="H AE . B AX . S1 N EH lng . R AX"> habanera </token></item>
      </one-of> 
    <item> sauce. </item>
  </rule>

</grammar>

Note

You can use the sapi:display attribute of the token element to specify the form of the word that will display in a user interface. The display form of a word is often the same as its lexical form, which is the content of the token element. However, in some languages, such as Japanese, the application may choose to have a display form that is different than the lexical form, as in the following example:

  <item>
    <token sapi:display="ストップ" sapi:pron="ストップ"> すとっぷ </token>
    <tag>cancel</tag>
  </item>

Creating Custom Pronunciations Using a PLS Lexicon

If your application uses specialized words repeatedly across multiple grammars, or if you are generating pronunciations separately from the grammar generation, you can create a lexicon that contains the words and their pronunciations. A lexicon is a separate document that you can link to one or more grammars. If the speech recognition engine loads a grammar that is linked to a lexicon, it uses the pronunciations that the lexicon contained when the grammar was loaded. Any pronunciations inline in a grammar still take precedence over pronunciations in a linked lexicon.

You author lexicons as XML documents that follow the format of the Pronunciation Lexicon Specification (PLS) Version 1.0. Use the following steps to author a PLS lexicon:

Create a Lexicon

  1. Start with a new, blank XML document.

  2. Enter the XML declaration: <?xml version="1.0" encoding="UTF-8"?>.

  3. Enter the opening tag of the lexicon element. This must declare the single language-culture of the words in the lexicon, and the phonetic alphabet used to construct the pronunciations. Here is an example of the opening tag of a lexicon element:

    <lexicon version="1.0" 
      xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
      xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon 
      http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
      alphabet="x-microsoft-ups" xml:lang="en-US">
    
  4. Add a lexeme element for each word for which you want to define one or more pronunciations:

    • <lexeme></lexeme>. The parent element for defining words and their pronunciations.
  5. Within the lexeme element, add a grapheme element and a phoneme element.

    • <grapheme></grapheme>. Contains the written form of a word.

    • <phoneme></phoneme>. Contains phones that describe the pronunciation of a word.

  6. In the grapheme element, enter the word for which you want to specify a pronunciation.

  7. Using the phone tables below, look up the phones that correspond to the sound of each syllable in your word. Use the Sample Pronunciations below to understand how to combine phones to describe the sound of a syllable.

  8. In the phoneme element, enter the phones that describe the word's pronunciation. Phones are case-sensitive and must be space-delimited. Optionally include markers for syllable emphasis (S1, S2) and for separating syllables (.) to further refine the pronunciation. See Suprasegmentals later in this topic.

  9. If you want to specify multiple pronunciations for the same word, add more phoneme elements within the lexeme element. When specifying multiple pronunciations for a word, you can designate one pronunciation as preferred by adding the attribute/value pair prefer="true" to the phoneme element, for example: <phoneme prefer="true">1 l eh d </phoneme>.

  10. Continue adding lexeme elements for each word whose pronunciation you want to specify.

  11. Add the closing lexicon tag: </lexicon>, and save the document with a .pls extension.

You can optionally include an <example></example> element within the lexeme element that contains a sample usage of the grapheme.

Here is an example of a completed lexicon document that specifies pronunciations for fictional names:

<?xml version="1.0" encoding="UTF-8"?>

<lexicon version="1.0" 
  xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
  xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon 
  http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
  alphabet="x-microsoft-ups" xml:lang="en-US">

  <lexeme>
    <grapheme> Klhtr </grapheme>
    <phoneme> K L EH . S1 T AA R </phoneme>
  </lexeme>

  <lexeme>
    <grapheme> Eanor </grapheme>
    <phoneme> S1 I . AX . N O R </phoneme>
  </lexeme>

  <lexeme>
    <grapheme> Puntahrik </grapheme>
    <phoneme> P UH N . S1 T AA . R IH K</phoneme>
  </lexeme>

</lexicon>

Now that you have created your lexicon, you must link to it from your grammar, using the following steps.

  1. After the opening tag of the grammar element, and before the first rule element, enter a lexicon element: <lexicon />

  2. Add a uri attribute to the lexicon element and enter the path and name of the lexicon file. For example: <lexicon uri="c:\MyLexicon.pls" /> OR <lexicon uri= “https://contoso.com/lexiconstore/MyLexicon.pls”/>

The following is an example of a grammar that references the lexicon of fictional names shown above:

<?xml version="1.0" encoding="UTF-8"?>

<grammar 
  version="1.0" mode="voice" root="warriors"
  xml:lang="en-US" tag-format="semantics/1.0" 
  sapi:alphabet="x-microsoft-ups" 
  xml:base="https://www.contoso.com/"
  xmlns="http://www.w3.org/2001/06/grammar"
  xmlns:sapi="https://schemas.microsoft.com/Speech/2002/06/SRGSExtensions">

  <lexicon uri="c:\test\Warriors.pls" />
  
  <rule id="warriors" scope="public">
    <item> The warrior's name is </item>
    <one-of>
      <item> Klhtr </item> 
      <item> Eanor </item>  
      <item> Puntahrik </item>
    </one-of>
  </rule>

</grammar>

UPS Phone Tables

The following tables contain the most commonly used phones for US English from Microsoft's Universal Phone Set (UPS).

Note

UPS phones are case-sensitive.

Consonants

The following table lists the most commonly used consonant phones for US English from Microsoft's Universal Phone Set (UPS).

UPS Phone

Example

B

big

CH

chin

D

dig

DH

then

DX

butter

F

fork

G

gut

H

help

JH

joy

K

cut

L

lid

M

mat

N

no

NG

sing

P

put

R

red

S

sit

SH

she

T

talk

TH

thin

V

vat

W

with

J

yard

Z

zap

ZH

pleasure

Vowels

The following table lists the most commonly used vowel phones for US English from Microsoft's Universal Phone Set (UPS).

UPS Phone

Example

AA

father

AE

cat

AH

cut

AO

dog

AOX

four

AU

foul

AX

ago

AX rho

minor

AI

bite

EH

pet

EHX

stairs

ER

fur

ER rho

urban

EI

ate

IH

fill

I

feel

O

go

OI

toy

OWX

boa

Q

hot

UH

book

U

too, blue

UWX

lure

Suprasegmentals

Suprasegmentals are optional markers that indicate the division of a word into syllables, identify which syllables receive emphasis (stress), and specify the length of syllables. The following suprasegmental markers of the UPS are commonly used to create pronunciations for words in US English:

UPS Phone

Description

Example

S1

Indicates that the following syllable receives primary emphasis.

F AX . S1 N EH S (finesse)

S2

Indicates that the following syllable receives secondary emphasis.

K AX N . S1 S ER rho . V AX . S2 T R I (conservatory)

.

Indicates a break between syllables.

S1 K O . K O (cocoa)

lng

Extends the length of the preceding syllable.

B I . S1 W EH lng R (beware)

Sample Pronunciations

The following table contains a sampling of words in US English and their phonetic spellings (pronunciations) using UPS phones. Use this table to understand which phones represent sounds in US English with which you are already familiar, and as a guide to combining phones to create syllables and words.

Note

  • The words in this table are not words that require custom pronunciations.

  • The phones used in pronunciations are case-sensitive and must be space-delimited.

  • The use of the suprasegmental markers (S1, S2, .) is optional.

  • UPS provides a compounding or tying symbol "+" that can be used to describe composite sounds from any two phones. For more information, see Compound Tying Symbol (Microsoft.Speech).

Word

Pronunciation

Word

Pronunciation

action

S1 AE K . SH IH N

junkyard

S1 JH AH NG . K J AA R D

adverse

S1 AE D . V ER rho S

king

S1 K IH NG

analog

S1 AE . N AX . L AA G

lady

S1 L E + I . D I

around

AX . S1 R A + UH N D

leave

S1 L I V

avail

AX . S1 V E + I L

left

S1 L EH F T

beauty

S1 B J U . DX I

lion

S1 L A + I . AX N

believe

B AX . S1 L I V

little

S1 L IH . DX AX L

beware

B I . S1 W EH lng R

luscious

S1 L AH . SH AX S

bittersweet

S1 B IH . DX AX rho . S W I T

magic

S1 M AE . JH IH K

blood

S1 B L AH D

Mary

S1 M EH lng . R I

burrito

B AX . S1 R I . T O

minor

S1 M A + I . N AX rho

cell-phone

S1 S EH L _& S1 F O N

name

S1 N E + I M

coconuts

S1 K O . K AX . N AH T S

outdoors

S1 A + UH T . D AO R Z

collect

K AX . S1 L EH K T

Pisces

S1 P A + I . S I Z

comets

S1 K AA . M IH T S

quick

S1 K W IH K

conformity

K AX N . S1 F AO R . M IH . DX I

raindrop

S1 R E + I N . D R AA P

conservatory

K AX N . S1 S ER rho . V AX . S2 T R I

refreshments

R AX . S1 F R EH SH . M AX N T S

contemporary

S2 K AX N . S1 T EH M . P AX . R AX . R I

refused

R AX . S1 F J U Z D

creations

S1 K R I . E + I . SH AX N Z

revolution

R EH . V AX . S1 L U . SH AX N

cross

S1 K R AA S

rolling

S1 R O . L IH NG

deed

S1 D I D

round

S1 R A + UH N D

dog

S1 D AO G

saffron

S1 S AE . F R AA N

duckling

S1 D AH . K L IH NG

short

S1 SH AO R T

enthusiast

EH N . S1 TH U . Z I . AE S T

snow

S1 S N O

excerpt

S1 EH K . S AX rho P T

song

S1 S AO NG

facile

S1 F AE . S A + I L

sponge

S1 S P AH N JH

falling

S1 F AO L . L IH NG

strawberries

S1 S T R AO . B EH . R I Z

far

S1 F AA R

strollers

S1 S T R AO . L AX rho Z

fastball

S1 F AE S T . B AO L

subway

S1 S AH B . W E + I

friends

S1 F R EH N Z

thrill

S1 TH R IH L

furnaces

S1 F ER rho . N IH . S IH Z

toasters

S1 T O . S T AX rho Z

garden

S1 G AA R . D IH N

toothpaste

S1 T U TH . P E + I S T

great

S1 G R E + I T

towering

S1 T A + UH . AX . R IH NG

head

S1 H EH D

toy

S1 T AO +I

hothouse

S1 H AO T . H A + UH S

tractors

S1 T R AE K . T AX rho Z

howlers

S1 H A + UH . L AX rho Z

tragically

S1 T R AE . JH IH . K L I

hygienist

H A + I . S1 JH EH . N IH S T

unrest

S1 AH N . R EH S T

impossible

IH M . S1 P AA . S IH . B AX L

urban

S1 ER rho . B AX N

informant

IH N . S1 F AO R . M AX N T

vagabonds

S1 V AE . G AX . B AA N D Z

instruction

IH N . S1 S T R AH K . SH AX N

vanished

S1 V AE . N IH SH T

intruders

IH N . S1 T R U . D AX rho Z

velvet

S1 V EH L . V IH T

islanders

S1 A + I . L AX N . D AX rho Z

why

S1 H W A + I

journal

S1 JH ER rho . N AX L

why

S1 W A + I

jungle

S1 JH AH NG . G AX L

zippers

S1 Z IH . P AX rho Z

See Also

Concepts

Custom Pronunciations Support