Creating bots in UCMA - Part II - Using grammars instead of parsing
In our last bot post I wrote about creating a bot that accepts a message from Communicator and sends a response. The response logic was very crude. We simply looked for certain words in the message from the client and sent back an appropriate response. Obviously as our bot becomes more sophisticated this method will no longer suffice.
There's a wealth of information out there on natural language parsing - understanding the syntax, context, and real world knowledge necessary to respond as if it were truly a human on the other end. However, few companies have a need for a bot that can respond to questions like "How do you feel?". Typically bots are designed to answer specific questions or direct the user through a series of questions or responses. Does this sound like anything familiar? In practice - a lot of the things we need our bots to do are also present in speech applications.
We can also look at this from a different perspective. Our goal is to take some response from the user and extract a meaning specific enough for our bot to give a reasonable response. Isn't this the same thing speech applications do? But wait, you might say - speech applications accept speech as input and output speech, while bots use text? This is true, but as you'll see there is an easy workaround.
At the outset, prompts would seem to be far more simplified in bots than in speech applications. As we will later see, this is not necessarily the case. However, for a simple bot this is absolutely true. In speech applications we need to worry about piecing together recorded transcriptions and making sure TTS pronounces text correctly. In bots, we are simply outputting a string.
Even recognition, actually, is not much different than in speech applications from a coding point of view. We will use the System.Speech namespace, which comes in .Net 3.0. It is currently not possible to use the Speech Server APIs in a bot but the truth is we do not need them because System.Speech has everything we need. The object we are interested in is SpeechRecognizer. The method we are interested in is EmulateRecognizeAsync. This method accepts a text string and then from a coding standpoint recognition is similar to speech.
This means that we only need to create a grammar to recognize the user's response. As those of you who have used Speech Server know, there are threetypes of grammars available.
GRXML - This is backed by a standard and has been supported since the original release of Speech Server. Most other voice platforms support this standard as well. The library grammars that ship with Speech Server are GRXML grammars. These types of grammars are best for parsing things - for instance dates, social security numbers, etc. Note that for performance reasons it is possible to compile a .grxml grammar into a binary .cfg format.
.gbuilder grammars - These grammars are new in Speech Server 2007 and have their own tool called Conversational Grammar Builder. This tool actually allows us to build two different types of grammars.
Simple grammars - These are grammars of the form <prefix> keyword <suffix>. For instance - "I would like a largepizza." where "I would like" is the prefix, "pizza" is the suffix, and "large" is the keyword we are looking for. This is the most common type of grammar and the Conversational Grammar Builder allows us to create these grammar much quicker than using the Grammar Editor Tool or hand coding a .grxml grammar.
HMIHY grammars - These grammars take a response and place it in one of several buckets. The typical scenario here is the case where you have a help desk and need to route the user to the appropriate type of help (HMIHY stands for How May I Help You). To achieve this, you provide a list of training sentences for each bucket and when you compile the .gbuilder file (unlike .grxml grammars, .gbuilder files must be compiled) to a .cfg the engine is trained off your sentences. The more sentences you add the better the recognition will be. I have already mentioned these grammars before in my posts on answering machine detection.
Dynamic grammars - This is basically a set of classes that allow you to dynamically build .grxml grammars. They are most useful when you need to build a grammar based on information in a database or provided by the user.
For bots, HMIHY grammars are the most useful because user responses tend to be very broad. The advantage of the HMIHY grammar is the user does not need to respond with something we have preprogrammed - as long as it is close enough to one of our training sentences it will work.
The SpeechRecognizer object can read both .grxml and .cfg grammars. Because .gbuilder files compile into .cfg grammars, SpeechRecognizer can understand both. I have placed the grammar file in the attached project for those interested at looking at it. Our nonsense application can now perform two functions.
1) Users can request to fly between two destinations.
2) Users can order a pizza, along with size and type.
For simplicity sake, when flying the user must specify both the origin and the destination and when ordering a pizza the user must specify the size and the type. Let's start by continuing with the application for Part I (or you can download the code for today's part). First, change the code in HandleMessageReceived to the following.
RecordProgress("Received message {0}", message);
try
{
string grammarPath = Path.Combine(Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location), "Grammar1_EN-US.cfg");
SpeechRecognizer recognizer = new SpeechRecognizer();
recognizer.SpeechRecognized += new EventHandler<SpeechRecognizedEventArgs>(recognizer_SpeechRecognized);
recognizer.SpeechRecognitionRejected +=new EventHandler<SpeechRecognitionRejectedEventArgs>(recognizer_SpeechRecognitionRejected);
recognizer.LoadGrammar(new Grammar(grammarPath, "WhatToDo"));
recognizer.EmulateRecognizeAsync(message);
}
catch (Exception ex)
{
Error(ex.ToString());
}
OK, I admit I was lazy in catching the general Exception. We will change that in a later post. Also keep in mind that this code does not scale well. I would be better to have the grammar preloaded. Note also that we use the Synchronous version of LoadGrammar where we should be using the asynchronous version. Other than that the code is straight forward. We create a SpeechRecognizer instance, load the grammar, and emulate the response from the user.
The following is our SpeechRecognitionRejected handler.
/// <summary>
/// Called when we could not understand what the user said
/// </summary>
/// <param name="sender"></param>
/// <param name="e"></param>
void recognizer_SpeechRecognitionRejected(object sender, SpeechRecognitionRejectedEventArgs e)
{
SpeechRecognizer recognizer = (SpeechRecognizer)sender;
recognizer.SpeechRecognitionRejected -= recognizer_SpeechRecognitionRejected;
SendResponse("I'm sorry. I did not understand what you said.");
}
We detach from the event because it is possible to receive multiple events and we do not want to output the same message multiple times.
The following is our SpeechRecognized handler.
/// <summary>
/// Called when we have recognized something
/// </summary>
/// <param name="sender"></param>
/// <param name="e"></param>
void recognizer_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
{
try
{
RecordProgress(e.Result.ConstructSmlFromSemantics().CreateNavigator().OuterXml);
if (e.Result.Semantics.ContainsKey("PizzaSize"))
{
// User wants to buy a pizza
string pizzaSize = e.Result.Semantics["PizzaSize"].Value.ToString();
string pizzaType = e.Result.Semantics["PizzaType"].Value.ToString();
string response = String.Format(CultureInfo.CurrentUICulture,
"One {0} {1} pizza coming up!",
pizzaSize,
pizzaType);
SendResponse(response);
}
else
{
// User wants to book a flight
string origin = e.Result.Semantics["Origin"].Value.ToString();
string destination = e.Result.Semantics["Destination"].Value.ToString();
string response = String.Format(CultureInfo.CurrentUICulture,
"I too would like to one day fly from {0} to {1}",
origin,
destination);
SendResponse(response);
}
}
catch (Exception ex)
{
Error(ex.ToString());
}
}
As you can see we check our semantic results to see whether the user is ordering a pizza or flying somewhere. Of course, our application is not very helpful and has no history but we have certainly advanced far from string checking recognition.