Jaa


How to recognize languages for which there is no recognizer

This is the first part of a two part post where I will tackle the problem of creating grammars in a target language for which no recognition engine exists.  My goal was to create a simple GRXML grammar capable of recognizing a few phrases in Mandarin Chinese.  Along the way, I ran into a number of pitfalls and I will relate my experiences here so others can avoid them.

First, of course, the fine print.  Obviously a Mandarin engine will perform far better than approximating phonemes so this technique will likely not work for a full Mandarin application.  The techniques I explain here will work for small grammars in a target language for which no recognition engine exist.  The larger the grammars become, as well as the more similar matches contained, the less reliable this technique will be.

I should also note here that Mandarin is a poor choice for approximation because none of the supported engines have the concept of tone.  Therefore we will not be able to differentiate identically spelled words of different tones.  For non tonal languages, such as Swedish, this would not be an issue.

The way I used to approximate foreign words is by using a GRXML grammar and approximating the pronunciation of each word using the pronunciation editor.  To do this, use the following steps.

1) Add a new GRXML grammar to your project

2) Set the language of the grammar to the one closest to your target language.  For instance, if you wish to approximate Dutch, German will probably be closer than Spanish.  If you wish to appromimate Catalan, Spanish will be closer than English and so on.  For some languages you may have to experiment to see which existing recognition engine gets you the closest.

3) Create your grammar as you would a supported language by dragging grammar components to the canvas using the handy Grammar Editor in the Speech SDK.  One pitfall to avoid though is that extended characters are not supported.  So if you are creating a grammar for a language such as Korean, Mandarin, Japanese, or Thai, you will need to approximate the words using latin characters.  In general, you should use a standard transliteration that is understandable.

4) After you have created the grammar items, you will need to select each grammar item and set the pronunciation.  To do this, make sure the property window is open when a grammar item is selected and click in the pronunciation item.  You will see an ellipsis which you can double click to open the pronunciation editor.  There you can look up pronunciations for existing words in the target language and enter your own phonemes.  I will have more to say about this in tomorrow's post.  Make sure you enter only supported phonemes as unsupported ones will cause an error when your grammar is loaded.  You can see a list of phonemes available for each supported recognition engine in the documents included with the SDK. (go to the index tab and type phoneme).

The first grammar I created basically worked but the confidence values returned were very low.  The following is a simple grammar that I created.

<grammar xmlns:sapi="https://schemas.microsoft.com/Speech/2002/06/SRGSExtensions" xml:lang="EN-US" tag-format="semantics-ms/1.0" version="1.0" mode="voice" xmlns="https://www.w3.org/2001/06/grammar" sapi:alphabet="x-microsoft-ups">
 <rule id="Rule1" scope="public">
  <one-of>
   <item>
    <token sapi:pron="SH I AX">xiexie</token>
    <token sapi:pron="N I N">ni</token>
    <tag>$._value = "thanks"</tag>
   </item>
   <item>
    <token sapi:pron="W AA">wo</token>
    <token sapi:pron="K AX N;G AX N">hen</token>
    <token sapi:pron="H AU;K AU">hao</token>
    <tag>$._value = "good"</tag>
   </item>
   <item>
    <item>I am a duck</item>
    <tag>$._value = "duck"</tag>
   </item>
  </one-of>
 </rule>
</grammar>

 I added the "I am a duck" phrase just for sanity.  If you run this through a simple application in the debugger you should get fairly decent confidence values for "wo hen hao" but rather poor confidence values for "xiexie ni".  This is because the phoneme approximations for "wo hen hao" are much closer than those for "xiexie ni".  So how can you choose the closest matching phonemes?  This will be the topic of tomorrow's post.