Unity3d - Using LUIS for voice activated commands

Article
1/17/2024

Overview

As a companion wiki post to Azure Cognitive Services - Bing Speech API and Language Understanding Intelligent Service (LUIS), this post illustrates how LUIS can be called from Unity3d (Unity). This post sets up a Unity project using free assets from the Unity Store, uses the microphone to capture commands and uses the Azure Bing Speech API to translate the spoken word into text and LUIS to translate them into in-game commands.

Unity3d

Unity is a game development platform that can be used to develop both 3D and 2D games and supports building for a large variety of platforms including Android, Windows, Xbox, iOS and other platforms. Unity supports javascript and C# as development languages but an important fact to emphasize is the engine uses the C# compiler Mono to build the game.

Unity Project

The scope of Unity is vast and it is a challenge to provide an example project with game objects other than cubes and spheres. This post will then highlight some free assets and code snippets used to make the example a little more interesting.

Asset Store

The Unity Asset Store is a great place to gain inspiration as well as to leverage high-quality assets for use in projects. For this post, two assets were used. The first was a background to represent deep space (Start Nest Skybox), and the second is a 3D model of a starship (Stratos Class Cruiser). Both of these were imported into a blank project using the Asset Store. For example,

Setting up the scene

To save time, the Example scene of the Star Nest Skybox was used as a starting point. This has the camera and directional light setup as well as the SkyboxSwitcher script for switching between different star backgrounds:

Visual stimuli

In the scene, a simple text overlay will be used to indicate when the user is issuing a command. This example post was simplified by making the issue of a command controlled by the 'c' keyboard key. In short, 'c' to start recording and 'c' to stop recording. Imagine a Star Trek episode with Captain Kirk hitting a button on his throne like command chair. This was done by adding a simple text object to the scene:

Star Cruiser

Bringing the starship into the scene is as simple as dragging a prefab from the StratosClassCruiser/Prefabs folder into the scene. To give the illusion of size, the camera was position so only a portion of the ship was visible and when moved forward, the ship would fly away from the viewer.

The settings are shown below including a new script, VoiceListener, that will be used to capture the voice commands:

Capturing Voice

Unity has built-in support for accessing the system's microphone devices. The first step in this is to check if the microphone is available and is performed in the Startup method:

// setup microphone     
if (Microphone.devices.Length <= 0)
{           
    Debug.LogError("Microphone not available!");
}

Next, when the 'c' command is first pressed, recording begins. The following shows we will loop and take the last 10 seconds recorded into the AudioClip _commandAudio (Microphone.Start()) :

_commandAudio = Microphone.Start(null, true, 10, 44100);

When the 'c' command is pressed again, the recording ends using the Microphone.End() command:

Microphone.End(null);

Converting to .wav format

The basis of transforming the AudioClip to a wav file was derived from Calvin Rien's forum post which was derived from Gregorio Zanon's. This is required in order to submit to the Bing Speech API. The details can be found in the unity project.

Convert to Text

The next step is to accurately convert the recorded audio to a string. Bing Speech API will be used. Note that taking the same C# used in the TechNet wiki post Azure Cognitive Services - Bing Speech API and Language Understanding Intelligent Service (LUIS) results in an authentication failure. This was not explored and instead, the call was converted to use the Unity classes provided for HTTP communication.

As the retrieval of the text result should be asynchronously performed, the retrieval is done in a coroutine using the UnityWebRequest. An interesting thing about the method is the use of both the UnityWebRequest and the UploadHandler in order to retrieve the result.

IEnumerator GetSpeechText(byte[] payload)
{
    UploadHandler uploader = new  UploadHandlerRaw(payload);
 
    UnityWebRequest wr = new  UnityWebRequest();
    wr.downloadHandler = new  DownloadHandlerBuffer();
    wr.url = "https://speech.platform.bing.com/speech/recognition/conversation/cognitiveservices/v1?language=en-US&format=detailed";
    wr.method = UnityWebRequest.kHttpVerbPOST;        
    wr.SetRequestHeader("Ocp-Apim-Subscription-Key", "[subscriptionkey]");
    wr.SetRequestHeader("Content-type", "audio/wav; codec=audio/pcm; samplerate=16000");
    wr.SetRequestHeader("ProtocolVersion", "HttpVersion.Version11");
    wr.SetRequestHeader("Accept", "application/json;text/xml");
 
    wr.chunkedTransfer = true;
    wr.uploadHandler = uploader;
 
    yield return  wr.SendWebRequest();
 
    if (wr.responseCode == 200)
    {
        string results = Encoding.Default.GetString(wr.downloadHandler.data);      
             
        _commands.Enqueue(results);
 
        Debug.Log(results);
    }
    else
    {            
        Debug.LogWarningFormat("Speech services returned a {0} code.", wr.responseCode);
    }
}

If the result is successful, then a command will be added to a code. In the Update() method, these commands are handled by removing the queue and starting a new coroutine.

if (_commands.Count > 0)
{
    StartCoroutine(DetermineCommand(_commands.Dequeue()));
}

Translating from Text to Command

As the command could be "All ahead full" or "Turn on Engines" or "Mr. Sulu, warp factor 5" or whatever intent is set up in LUIS, we will send the translated text to LUIS to determine the command. This was also performed using the UnityWebRequest and a DownloadHandlerBuffer:

IEnumerator DetermineCommand(string command)
{
    if (string.IsNullOrEmpty(command))
    {
        yield break;
    }
 
    var commandUrl = WWW.EscapeURL(command);
 
    UnityWebRequest wr = new  UnityWebRequest();
    wr.downloadHandler = new  DownloadHandlerBuffer();
    wr.url = string.Format("https://southeastasia.api.cognitive.microsoft.com/luis/v2.0/apps/{0}?q={1}&timezoneOffset=0&verbose=false&spellCheck=false&staging=false", "theLUISAPP", commandUrl);
    wr.SetRequestHeader("Ocp-Apim-Subscription-Key", "[subscriptionkey]");
 
    yield return  wr.SendWebRequest();
         
    var commandResult = CommandResult.CreateFromJSON(Encoding.Default.GetString(wr.downloadHandler.data));
 
    Debug.Log(commandResult);
 
    if (commandResult != null && commandResult.topScoringIntent != null)
    {
        if (commandResult.topScoringIntent.intent == "EngineOn")
        {
            _enginesOn = true;
        }
    }
}

If the result ended up matching the EngineOn Intent, then the _enginesOn boolean would be set. For this illustration, this has the simple effect of moving the gameobject forward and is performed in the Update() method:

if (_enginesOn)
{
    var position = transform.position;
 
    transform.position = new  Vector3(position.x, position.y + .5f, position.z + 2.5f);
}

Conclusion

Combining AI with gaming has been happening for awhile now and using hosted services makes a lot of sense for scalability, global coverage, and the simplicity in getting up and running for both indie devs and professional studios. The example shown here is simple and in all likelihood controlling the movement of a ship would be easier to be done with an arrow or WASD keys.

But imagine a more complex scenario. For example, Lock phasers on target alpha, strength to stun or All ahead full to Alpha Centauri in the Gamma Quadrant. It is a guess that many players of loot grabbing games on consoles would have loved a voice-controlled inventory system: Sell to a merchant, all ammo where the inventory is over 10 and not used by any of my guns.

The source can be found at Azure Cognitive Services - Bing Speech API & LUIS.

Share via