Compartilhar via


Note

Please see Azure Cognitive Services for Speech documentation for the latest supported speech solutions.

Parsing Guidelines for SAPI Speech Recognition Phone Converters (Microsoft.Speech)

SAPI does not enforce any parsing conventions on phone strings, nor does it place any restrictions on the phonetic validity of compounds or the correct usage of diacritics.

Application developers need to be aware of guidelines for writing pronunciations and using UPS phones. In addition, engine developers require some guidelines for parsing UPS phone strings.

The Compound Symbol

The compound phone "+" is a Boolean symbol that requires a left and right context. Any two SAPI symbols can be joined by the compound symbol, including sequences of segmental symbols, Tones (Microsoft.Speech), and Diacritics (Microsoft.Speech).

Any number of segmental symbols can be joined by the "+" symbol. Usually, this will be limited to two in the case of affricates and diphthongs. Some sounds, such as triphthongs or geminates, may require three or more.

The following are all valid phone strings, shown with the parsed meaning:

Phone string

Parsed Phone String

A B C

A B C

A + B

AB

A B + C

A BC

A + B + C

ABC

Parsing Affricates, Diphthongs, and Nasal Vowels

When UPS contains a specific phone for a compound sound such as a nasal vowel or affricate, the SAPI phone ID of the phone provides the information required to split it up into its constituent phones, including the "+" marker.

For example, UPS contains phones for the affricate "PF" and the nasal vowel "AN". If the speech recognition (SR) engine does not model such sounds and would prefer to split them up, it is possible by parsing the SAPI phone ID. Each SAPI phone ID is a fixed length of four integers as can be seen below. The phone ID for "PF" breaks down to three individual phones: "P", "+", and "F".

UPS Phone

SAPI Phone ID

PF

007003610066

P + F

0070 0361 0066

AN

00610303

A nas

0061 0303

Diacritics

A diacritic cannot stand alone. Diacritics must modify the preceding segmental phone. The compound marker is not required to bind diacritics. If the phone preceding a diacritic is a "+" symbol, it should be ignored by the SR engine. Any number of diacritics can follow a segmental symbol. Diacritics may use the three-letter symbol or the symbolic form if there is one. The SR engine should ignore the "+" marker.

Phone

Parsed Phone

A lng

Alng

A vls lng

Avlslng

A vls + lng

Avlslng

Tones

Lexical tones can follow a vowel or syllabic consonant. They can be made up of simple levels or they can be contours composed of sequences of tones. Lexical tones are described using a universal scale which has a maximum of five level contrasts. A sequence of three tones should be sufficient to describe any tone contour. The following are valid strings using tones.

Phone String

Parsed Phone String

M + A T3 T5

MA35

M + A + T3 + T5

MA35

M A + T3 + T5

MA35

The first example defines a syllabic unit "MA" through the use of the compound "+" marker, with a high rising 35 tone contour.

The second example attaches the tone to the vowel only. It is up to the speech recognition engine to map the "M A" to a syllable "MA" depending on its internal acoustic modeling. The use of the compound symbol "+" between tones is redundant here.

Geminate Consonants

Italian geminates are sometimes described as long consonants, and they could be described using the length diacritic or the "+" symbol. The following table shows some alternative phone strings that have equivalent meaning. The examples are in Italian.

Compounds

IPA

Example

TS lng

T lng + S

TS + TS

t:.s

bozza

DZ lng

D lng + Z

DZ + DZ

d:.z

mezzo

CH lng

T lng + SH

CH + CH

t:.ʃ

braccio

JH lng

D lng + ZH

JH + JH

d:.ʒ

oggi

M lng

M + M

M + M

m.m

mamma

The preferred method is to use the length diacritic. To be safe, the engine developer should anticipate that any of these cases could be encountered in Italian SAPI lexicons or grammars.

Diphthongs

Diphthongs are represented by placing a "+" marker between two vowels, for example: "EY + EU".

German contains several examples of centering diphthongs for which there is no single phone label in UPS. See the diphthongs table in Diphthong Vowel Phones to see how diphthongs would be represented using UPS compounds.

Note that due to the length diacritic there are two ways in which the vowel in Bär could be represented. This first example attaches the length correctly to the EH vowel:

Phone String

SAPI Phone ID

EH lng + AEX

025B 02D0 0361 0250

However it is possible that people could choose to attach the length to the schwa, (the second vowel) such as in the following example.

Phone String

SAPI Phone ID

EH + AEX lng

025B 0361 0250 02D0

The SR engine should expect to handle this kind of ambiguity when converting SAPI phones to SR engine phones.