Поделиться через


How to approximate phonemes for a non supported language

Yesterday I wrote about how to create a grammar for a language for which we do not have a recognition engine.  The post ended with a question on how to best approximate the phonemes for a target word.  As I mentioned yesterday, my first attempts at approximation led to very low confidence values.  In order to improve these confidence values, I knew that I must have a better way to determine the phonemes.

I decided the best way to do this was to create a tool where I can speak in a word and it would output the phonemes used to create the word.  This tool would best be graphical, but I was too lazy so I wound up writing a simple speech application to do this for me.  However, the code below can easily be modified to use SpeechFX.  I also limited the application to two phonemes per word.  Since I was approximating Mandarin Chinese, this wasn't a problem.  For other languages, you will need to modify this to support more phonemes - though keep in mind that the size of the grammar increases exponentially with the number of phonemes supported per word.  For languages such as Dutch and Swedish, it is a much better idea to break up longer words (not in the grammar, just for the purposes of determining the phonemes).

To use the code below, add a QuestionAnswerActivity to your project and call it questionAnswerActivity2.  Set the prompt to anything you want and then click to add a dynamic grammar.  Add the following code to the TurnStarting event.

string[] phonemes = new string[] {

    "AA", "AE", "AH", "AI", "AU", "AO", "AX", "AX RA", "EH", "EH RA",

    "EI", "ER", "I", "IH", "O + U", "OI", "U", "UH", "AX L", "AX M",

    "AX N", "B", "CH", "D", "DH", "F", "G", "H", "J", "JH", "K", "L",

    "M", "N", "NG", "P", "R", "S", "SH", "T", "TH", "V", "W", "Z", "ZH"};

// Create the grammar objects

SrgsDocument doc = new SrgsDocument();

doc.PhoneticAlphabet = SrgsPhoneticAlphabet.Ups;

SrgsRule rule = new SrgsRule("check");

SrgsOneOf oneOf = new SrgsOneOf();

// Iterate through the phonemes

int phonemeCount = phonemes.Length;

for (int firstCount = 0; firstCount < phonemeCount; firstCount++)

{

    for (int secondCount = 0; secondCount < phonemeCount; secondCount++)

    {

  // Generate the text

  string text = phonemes[firstCount] + phonemes[secondCount];

  // Determine the pronunciation

  string pronunciation = phonemes[firstCount] + " " + phonemes[secondCount];

                   

  // Add this item

  SrgsToken token = new SrgsToken(text);

  token.Pronunciation = pronunciation;

  SrgsItem newItem = new SrgsItem(token);

  oneOf.Add(newItem);

  }

 }

 // Finish the grammar

 rule.Add(oneOf);

 doc.Rules.Add(rule);

 doc.Root = rule;

 doc.Culture = System.Globalization.CultureInfo.GetCultureInfo("en-US");

 questionAnswerActivity2.Grammars.Add(new Grammar(doc));

What this code does is simply create a grammar containing all possible combinations of two phonemes in the English language.  When run in the debugger, simply speak anything and then examine the result of the grammar to determine the closest phonemes.  At some later time, I may publish a better version of this to cover other languages and more phonemes outside of Speech Server, but this should be enough to get you started.

One note when building this.  You may receive a build error if your application also contains a lexicon file.  The workaround is to either use a different phoneme alphabet in the code above or disable the lexicon file from building when building the entire project.

For many words, you will need to run this several times for the same word.  As I displayed in the grammar yesterday, giving multiple pronunciations for one word will increase the confidence value.  This is especially true when approximating phonemes of the target language that do not exist in the engine language.  Your desired phoneme may lie between two different phonemes and therefore you should include both pronunciations.

Comments