Note
Please see Azure Cognitive Services for Speech documentation for the latest supported speech solutions.
Parsing Guidelines for SAPI Speech Recognition Phone Converters (Microsoft.Speech)
SAPI does not enforce any parsing conventions on phone strings, nor does it place any restrictions on the phonetic validity of compounds or the correct usage of diacritics.
Application developers need to be aware of guidelines for writing pronunciations and using UPS phones. In addition, engine developers require some guidelines for parsing UPS phone strings.
The Compound Symbol
The compound phone "+" is a Boolean symbol that requires a left and right context. Any two SAPI symbols can be joined by the compound symbol, including sequences of segmental symbols, Tones (Microsoft.Speech), and Diacritics (Microsoft.Speech).
Any number of segmental symbols can be joined by the "+" symbol. Usually, this will be limited to two in the case of affricates and diphthongs. Some sounds, such as triphthongs or geminates, may require three or more.
The following are all valid phone strings, shown with the parsed meaning:
Phone string |
Parsed Phone String |
---|---|
A B C |
A B C |
A + B |
AB |
A B + C |
A BC |
A + B + C |
ABC |
Parsing Affricates, Diphthongs, and Nasal Vowels
When UPS contains a specific phone for a compound sound such as a nasal vowel or affricate, the SAPI phone ID of the phone provides the information required to split it up into its constituent phones, including the "+" marker.
For example, UPS contains phones for the affricate "PF" and the nasal vowel "AN". If the speech recognition (SR) engine does not model such sounds and would prefer to split them up, it is possible by parsing the SAPI phone ID. Each SAPI phone ID is a fixed length of four integers as can be seen below. The phone ID for "PF" breaks down to three individual phones: "P", "+", and "F".
UPS Phone |
SAPI Phone ID |
---|---|
PF |
007003610066 |
P + F |
0070 0361 0066 |
AN |
00610303 |
A nas |
0061 0303 |
Diacritics
A diacritic cannot stand alone. Diacritics must modify the preceding segmental phone. The compound marker is not required to bind diacritics. If the phone preceding a diacritic is a "+" symbol, it should be ignored by the SR engine. Any number of diacritics can follow a segmental symbol. Diacritics may use the three-letter symbol or the symbolic form if there is one. The SR engine should ignore the "+" marker.
Phone |
Parsed Phone |
---|---|
A lng |
Alng |
A vls lng |
Avlslng |
A vls + lng |
Avlslng |
Tones
Lexical tones can follow a vowel or syllabic consonant. They can be made up of simple levels or they can be contours composed of sequences of tones. Lexical tones are described using a universal scale which has a maximum of five level contrasts. A sequence of three tones should be sufficient to describe any tone contour. The following are valid strings using tones.
Phone String |
Parsed Phone String |
---|---|
M + A T3 T5 |
MA35 |
M + A + T3 + T5 |
MA35 |
M A + T3 + T5 |
MA35 |
The first example defines a syllabic unit "MA" through the use of the compound "+" marker, with a high rising 35 tone contour.
The second example attaches the tone to the vowel only. It is up to the speech recognition engine to map the "M A" to a syllable "MA" depending on its internal acoustic modeling. The use of the compound symbol "+" between tones is redundant here.
Geminate Consonants
Italian geminates are sometimes described as long consonants, and they could be described using the length diacritic or the "+" symbol. The following table shows some alternative phone strings that have equivalent meaning. The examples are in Italian.
Compounds |
IPA |
Example |
||
---|---|---|---|---|
TS lng |
T lng + S |
TS + TS |
t:.s |
bozza |
DZ lng |
D lng + Z |
DZ + DZ |
d:.z |
mezzo |
CH lng |
T lng + SH |
CH + CH |
t:.ʃ |
braccio |
JH lng |
D lng + ZH |
JH + JH |
d:.ʒ |
oggi |
M lng |
M + M |
M + M |
m.m |
mamma |
The preferred method is to use the length diacritic. To be safe, the engine developer should anticipate that any of these cases could be encountered in Italian SAPI lexicons or grammars.
Diphthongs
Diphthongs are represented by placing a "+" marker between two vowels, for example: "EY + EU".
German contains several examples of centering diphthongs for which there is no single phone label in UPS. See the diphthongs table in Diphthong Vowel Phones to see how diphthongs would be represented using UPS compounds.
Note that due to the length diacritic there are two ways in which the vowel in Bär could be represented. This first example attaches the length correctly to the EH vowel:
Phone String |
SAPI Phone ID |
---|---|
EH lng + AEX |
025B 02D0 0361 0250 |
However it is possible that people could choose to attach the length to the schwa, (the second vowel) such as in the following example.
Phone String |
SAPI Phone ID |
---|---|
EH + AEX lng |
025B 0361 0250 02D0 |
The SR engine should expect to handle this kind of ambiguity when converting SAPI phones to SR engine phones.