Microsoft HoloLens 2 MR headset and Unity Visual Scripting illustration

More about Embodiment of AI

18 min readDec 7, 2022

Business

Recent promising successes of LLM (Large Language Models) and other impressive advancements in Cognitive Services such as speech, computer vision, etc. do suggest the growing expectations from the next generation AI agents. In the previous article here we highlighted an Embodied AI use case and shared a point of view on building an entire solution using Azure AI Platform. In our opinion, Embodied AI has all the necessary attributes to become the next generation intuitive interface between humans and AI. Thus, this article is a logical continuation of the previous work where we cover more advanced topics about Embodied AI from both developer and artist & designer points of view for rapidly building scalable and extendable Embodied AI solutions.

Embodied AI Experience

From the solution architecture perspective we might break the whole system (the experience) into a set of sub-systems (components) as suggested below:

These are typical building blocks for building an Embodied AI system nowadays using a Suite of Cognitive Services. In the reference article here we built our foundational Dialog Management sub-system for orchestration of dialog flows based on Power Virtual Agent (PVA) and Bot Framework capabilities. The resulting Bot App was separately deployed in Azure (or on the Edge as a container) and Unity App would integrate with it over WSS by means of DialogServiceConnector class. This also raised a logical question if we could build our conversational flows inside of Unity instead and what implication this would have to the overall solution. And we are determined to get the answer to this question in this article.

Note: There have been efforts to come up with a universal descriptive language for behavioral interactions, for example, BML (Behavioral Markup Language) as described here.

One more important consideration for deploying Embodied AI system(s) in production is their quality assessment. Specifically, we’ll consider the quality of input in order to produce a quality output (to avoid so called “garbage-in garbage-out” scenario). As illustrated below there’re several steps we have to carefully take to properly process user’s spoken input:

On every step in this process we risk to lose and/or misinterpret the received information which might impact our ability to deliver delightful and helpful overall experience, unless we take some precaution measures to improve the quality of our hardware (microphone, etc.) and software (custom trained models, etc.).

With the suggested solution architecture in mind, let’s review different components on the solution in more detail. We start with the Character for Embodied AI.

Character

Creating a Character may involve multiple packages including 3D Design software of your choice with the ultimate goal to bring your character over to the Game Engine of your choice. In this article we focus on using Autodesk Maya to create a character and Unity for building the experience app. For more details you may want to watch this video on Artist Workflow with Maya and Unity.

Typically you Model -> Rig -> Animate your character in Maya before bringing it over to Unity and taking advantage of Unity’s Mechanim Animation System to trigger the right animations and behaviors at the right time.

Below is an illustration of an animation done in Maya done with a minimalistic effort:

Maya as well as other 3D Design software packages come with robust set of features to facilitate modeling, rigging and animating your characters. For example, Maya provides a concept of animation layers which allows you to animate your character based on a sub-set of key frames to speed up your progress.

You can conveniently define positioning for key frames while disabling and/or enabling individual animation layers.

Similar layering architecture is also available in Game Engines such as Unity to facilitate creating of complex applications.

Once you’re done with Model -> Rig -> Animate workflow for your character, it’s time to bring it over to the Game Engine (Unity).

Below we illustrate how using a cross-package compatible FBX format you can export your work from Maya and import it to Unity.

Simple process of exporting FBX from Maya and importing FBX into Unity

On the more detailed level, the following visual illustrates the process of character creation, rigging and animating in Maya, importing the result to Unity and then creating an Animation Controller for animation orchestration logic in Unity:

Detailed process of exporting FBX from Maya and importing FBX into Unity

Practically, you will end up with multiple animations for your project. Thus, you will have a choice to export/import them individually or in a combined FBX as illustrated below:

Exporting animations from Maya into FBX(s)

At this point you should have you character and desired animations already in Unity, and you are ready to organize your conversational flows (Dialog Management System). In this article for conversational flows orchestration we use Unity Visual Scripting capability.

Note: It’s considerably more complicated to create high quality humanoid-style or human model animations because of complexity of the underlying model (joints, etc.). That’s why it would make sense to leverage ML & AI for creating natural character poses as described in this Unity new experimental development.

Unity Visual Scripting (Dialog Management System)

Visual scripting in Unity empowers creators to develop gameplay mechanics or interaction logic using a visual, graph-based system instead of writing lines of traditional code. You can find more information about Unity Visual Scripting here.

Using Unity Visual Scripting Script graphs and State graph(s) allows us to implement our Dialog Management System inside of Unity itself.

One important quality of Unity Visual Scripting is that it’s very extensible. Please find more information about Unity Visual Scripting Extensibility here. Essentially, you have a choice of 1) Leveraging pre-built Visual Scripting nodes to compose your flows (you may also want to leverage sub-graphs to package certain pieces for reuse), 2) Develop your own custom nodes and/or events using C# (you may also want to package your C# code in nodes differently depending on the desired level of business logic encapsulation), 3) Interop with your C# scripts directly in your flows OR previous options combined.

Practically, you may still have a Manager class (MonoBehavior) attached to the Character and having access to Update loop in there. This is especially handy when you have other activities constantly running in the background threads, for example, Body Tracking using Azure Kinect.

Let’s consider a concrete example of building a custom C# node for Character Response (Text to Speech = TTS) which encapsulates the business logic for Speaking. Assume we chose to create a custom node called CharacterResponse which takes an Input (Text or Ssml to be spoken via SpeechSynthesizer class) and starts the speech synthesis process. For better code encapsulation purposes we may want to declare & instantiate SpeechSynthesizer and subscribe to SpeechSynthesizer events inside of our Manager class (MonoBehavior). Thus, CharacterResponse custom C# node takes an input, delegates control to the Manager class to synthesize speech and reports process completion. Then we can store the speech synthesis status in an Application-level variable and introduce CharacterWaitResponse (Coroutine) subgraph which based on the value of that variable waits until the process has completed to pass control to the next node.

Using atomic (single function) C# custom nodes for Speech Synthesis

Now we may want to wrap these 2 nodes (custom node + sub-graph) into its own CharacterResponse sub-graph to fully encapsulate speech synthesis logic into a single node (sub-graph).

Note: Please note that CharacterResponse custom C# node is NOT directly connected to CharacterWaitResponse sub-graph,- no flow connection, there’s only data point connection passing an GUID. This is because CharacterWaitResponse sub-graph will be triggered based on a custom event (Custom Event Trigger) raised inside of CharacterResponse custom C# node (Custom Event) for the specific GUID (the glue).

Encapsulating custom node + sub-graph into another higher level sub-graph

This illustrates a modular pattern how nodes can be packaged inside of other nodes for reuse. However, there’s a caveat related to this type of packaging. Namely, if we auto-generate GUIDs inside of CharacterResponse custom C# node and then package CharacterResponse custom C# node inside of CharacterResponse sub-graph, all the GUIDs generated will be the same which will break the connections between CharacterResponse custom C# nodes and CharacterWaitResponse sub-graphs inside of CharacterResponse sub-graph.

Note: To avoid this exact issue in the illustration above we pass EventID (GUID) from the outside to CharacterResponse sub-graph instead of generating GUIDs inside of it.

Nodes packaging specifics and issues with auto-generated values

Another typical requirement is to handle collection-based (multiple) inputs and multiple outputs for your custom C# node. Unity Visual Scripting offers several options for you in this regard including object-based, collection-based inputs, etc. or you may choose to implement a serialized collection-based input with custom drawer in Inspector (more on this topic later in this article).

Options for multiple inputs for custom C# nodes and sub-graphs

While leveraging atomic (single function) C# custom nodes and sub-graphs and connecting nodes via the Messaging pattern (Custom Event and Custom Event Trigger nodes) is one option to package flow logic, you may want to encapsulate more C# code behind C# custom nodes to take some burden away from the Unity Visual Scripting Canvas.

The following visual illustrates the overall thinking process from rapidly-prototyping design flows using only sub-graphs to encapsulating more business logic into custom C# nodes:

Thinking process while building re-usable components in UVS

There’re several Code patterns we’d like to highlight on this topic and they are related to the use of Tasks and Coroutines in Unity.

First, let’s implement an All-in-One UserInput (Listen) C# custom node for Listening using SpeechRecognizer class as shown below:

This implementation encapsulates everything for Listening including SpeechRecognizer declaration/instantiation, its events subscriptions and obtaining the result. However, in practice for better encapsulation you may want to offload the most of this plumbing logic in your Manager class (MonoBehavior) while focusing only on Listening function inside of your custom C# custom node and waiting until listening is completed to pass control further. In doing so, you will have 2 main options for your custom C# node input(s) based on ControlInput or ControlInputCoroutine.

Below is an example of a custom C# node which has ControlInputCoroutine input trigger and executes a Coroutine logic implemented in Manager class (MonoBehavior):

Coroutines in Unity do not return values (unless modified), that’s why to communicate the Listening result back to the flow when Listening process is completed we might use an Application-level variable which will be set in Manager class SpeechRecognizer class SpeechRecognized event subscription as shown below:

Now, to return the Listening result with a return from the calling method we may want to use Task instead of Coroutine as shown below:

This pattern allows to directly communicate the result back to the calling method without any extra middle-man variables involved. There’s one caveat though: while Coroutines are fully controlled by Unity, Tasks are not (while being executed in their thread), thus, you may want to explicitly implement a cancellation logic in the Task for scenarios when the Unity app stops but the task is still being executed.

Because pretty much everything is a Web API (in the Cloud, or a Micro-service on the Edge) nowadays, these 2 Code patterns (Coroutines & Tasks) set us well for success when developing custom C# nodes.

So far, we’ve been focusing on low-level Script graphs for conversational flows and developing well-encapsulated custom C# nodes, however, Unity Visual Scripting offers something great on the higher-level on abstraction called State graphs.

State graphs allow you to orchestrate business logic across different Script graphs which results in an elegant and modular composition of your conversational logic. Below is an illustration of a State graph which contains of dedicated Script graphs for Greeting, Farewell, Idle, Unknown, Help, Fun & Hub. The central part in this architecture is Hub which recognized the higher level user’s intent and dispatches the control to the right Script graph to fulfill a specific goal, once the goal has been fulfilled the control may be given back to the Hub to continue the conversation.

State Graph with main orchestration logic (in Hub Script Graph)

Using Messaging pattern we can define and organize transitions between Script graphs in the State graph as shown below:

State Graph transitions with or without memory

Now it feels like we have a good handle on our Dialog Management System, isn’t it? However, there’re some additional considerations we wanted to highlight. Specifically, when we leveraged Bot Framework as our conversational backend we enjoyed a built-in LUIS integration for intent recognition. Now, as we use SpeechSynthesizer and SpeechRecognizer classes explicitly, we have to build an integration with LUIS ourselves. To do so, we might implement another custom C# node for Intent Recognition based on a Web API call to LUIS/CLU. Then in Unity Visual Scripting Canvas we’d need to handle different obtained results (LUIS/CLU domains such as string, int, confirmation, etc.) properly. Often times, when using LUIS/CLU you may also want to leverage Entities (in addition to Intents). The following illustration highlights how you may want to use an Application-level variable in Unity Visual Scripting to aid with interpretation of user’s Intent based on Entities extracted from the provided Utterance.

Script Graphs handling LUIS intents and entities

The last hurdle (we wish :)) when developing Unity Visual Scripting custom C# nodes is to find relevant code examples for the requirements of your project. While the official documentation here provides some examples (and there’s a GitHub CS reference here), this might be not enough. What we found useful is to debug (in Visual Studio) the node that does similar to what you are looking for, this way you can easily retrieve a reference code implementation of this reference node. Please see the illustration below which demonstrates how to set a breakpoint and use call stack to get to the code you are looking for:

Now, equipped with this knowledge, let’s see how we may implement a custom C# node which takes a variable List of data inputs. The idea is to implement a Mad Lib style node which processes multiple inputs and combines them into one output (for example, Text or Ssml in and Text or Ssml out). The reference node for this task is Custom Event Trigger which takes a variable number of inputs;

Custom Event Trigger reference node comparing to some out-of-the-box options for MadLib

Having access to the reference implementation of Custom Event Trigger node, we are able to quickly develop our own custom C# node to our spec.

Ultimately, the inputs for this node might be either Text or Ssml (as shown below):

<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US"><voice name="en-US-JennyNeural">Hello World!</voice></speak>

Here’s the reference implementation of TriggerCustomEvent node in out-of-the-box Unity Visual Scripting:

For the sake of simplicity, the below implementation of MadLib String node only concatenates strings:

Okay! Huh, now we’ve covered the Visual Scripting part. Let’s continue with some advanced topics. Next up is Interrupts.

Interrupts

Why interrupts are important? As described in Best practices for Bots here and by following the common sense, interrupts allow us to build more intuitive, immersive and user-friendly experiences. Below illustration draws parallels with how interrupts are supported in Bot Framework and what capabilities of Azure Speech SDK can be used to power up a robust interrupt sub-system.

There may be different interrupt types which you may want to enable in your solution including User-triggered interrupts or System/Character-triggered interrupts. Below illustration outlines some solution architecture considerations for building an interrupt sub-system while taking into account complexity, performance, cost, etc.

Solution architecture considerations for interrupts sub-system

You can find more information about Azure Speech SDK Keyword Recognition here and Speaker Recognition here.

Note: The current system for Keyword recognition is designed to detect a keyword or phrase preceded by a short amount of silence. Detecting a keyword in the middle of a sentence or utterance is not supported.

Another area which helps to make the experience more intuitive is Lip Sync and Visemes.

Visemes

A viseme is the visual description of a phoneme in spoken language. It defines the position of the face and mouth while a person is speaking. Each viseme depicts the key facial poses for a specific set of phonemes. You can use visemes to control the movement of 2D and 3D avatar models, so that the facial positions are best aligned with synthetic speech. Please find more information about Visemes here.

Below is an example implementation of Visemes in Unity using blend-tree:

Unity blend-tree for visemes and Phonemes to Visemes mapping

Depending on how many blend-shapes for mouth you created, you may do phonemes to visemes mapping differently. When speech is synthesized via SpeechSynthesizer class, you can subscribe to Visemes events as described here.

Below is an illustration of how mouth shapes change over time to correspond to the synthesized speech:

Sample synthesized speech and visemes timeline

In essence, to implement visemes you want to lerp from one shape to another as the speech is being synthesized:

One important consideration is that if you want to handle offsets precisely and have some associated behaviors with offsets, you may want to collect the visemes first, generate an audio clip in Unity and then play the audio clip with synthesized speech while triggering different associated events based on offsets. Also, by playing the synthesized speech as an audio clip the sound playback will be fully under Unity’s control.

We’ve already covered quite a bit of Body Tracking including using GPUs, etc. in the previous article here. However, in this article we’ll provide a quick example of gestures detection.

Body Tracking

Computer Vision is yet another important aspect of making the experience intuitive. And being able to recognize what the user is doing in front of a device and react to it properly will likely go a long way towards making the experience relevant and memorable.

To implement Body Tracking for our scenario we leverage Azure Kinect Body Tracking SDK which has a flipped axis orientation and pre-defined set of joints as shown below:

Body Tracking SDK coordinate system and joints system

While using the foundational Body Tracking template from here, you may inspect the internal structure of the Skeleton data as shown below:

Then, if we want to implement a gesture detection, for example, for waving, we have a choice of using joints 3D positions or joints rotations (quaternions) as shown below:

This reference paper has the math explained for calculating elbow angle to detect waving gesture.

Below is a sample code listing for joint angle calculation for gesture detection:

This sample implementation is based on static joint angle(s), however, for more robust gesture detection (or pose detection) implementation in dynamic you may want to consider using Dynamic Time Warping (DTW). A well-known application of DTW has been automatic speech recognition, to cope with different speaking speeds. Other applications include speaker recognition and online signature recognition. It can also be used in partial shape matching applications.

for the sample of DTW code what if we ask ChatGPT to help. Here’s what we got from ChatGPT as the result:

Note: Many people have already attempted to use ChatGPT for code generation which already led to StackOverflow temporarily banning ChatGPT generated answers on their platform. As people are trying to turn “chit chat” into “cheat sheet” :). More on this topic is here.

The general idea is that DTW allows you to estimate a distance between 2 time-series sequences which allows to apply this algorithm in dynamic over a period of time. That would also mean that you may consider using joint 3D positions for DTW instead of calculated joint angles. Although, the complication with this approach is that you need much more data for comparison (ground truth data points for desired gestures/poses).

We’ve just provided a quick GPT teaser in this section. Let’s get to it in the next section.

OpenAI

As we’ve already mentioned, LLM (Large Language Models), Generative AI and multi-modal models are booming these days. The capabilities (both commercial and open source) being released nowadays truly enable us to build the next generation experiences. However, before we can see a wide adoption of these technologies in production, there’s still a long way to go addressing issues with legislation (copy rights, etc.), moderation of content, ethical & responsible use, just to name a few.

Although all the hurdles, we are still excited to bring those technologies to life today within the safety guardrails and for select number of curated use cases. Probably, everybody already saw the recent announcement by OpenAI about “New GPT-3 model: text-davinci-003”. And everybody played with OpenAI playground GPT-3 model(s) here.

Note: Please note that you can leverage deployed GPT-3 model(s) on Microsoft Azure Platform as described here.

For the purposes of this article we’re interested in conversational AI models such as text-davinci-002 and text-davinci-003. Below is a quick illustration of how to consume a stateless Web API for text-davinci model and implement a memory (by bringing the history with each subsequent request)

Here’s a closer look at Completions endpoint and OpenAI Python library:

Notably, while these models are obviously awesome, they still suffer from certain issues when applied for a specific use case. For that exact reason OpenAI folks implemented improvements to the originally released models as described here and called InstructGPT. You can find more information on GitHub here.

In essence, as shown below the areas of improvement included 1) Helpfulness; 2) Truthfulness and 3) Harmlessness to better align these models with human intent while following specific instructions:

GPT family of models has so many potential applications and as described here “the implementation is a key” to apply it for different use cases. For example, recently released ChatGPT implementation is very useful for conversational AI scenarios. Please find more information about ChatGPT here.

ChatGPT “stateful” Web API (with conversation_id, message_id(s), etc.) assigned to your chat session facilitates implementation of continuous conversational (chat) scenarios.

To fine-tune GPT model(s) for your scenarios it is important to set the overall context and do a proper Prompt Engineering. OpenAI provides guidance on Prompt Engineering on their web site, for example, for Text Completion endpoint the guidance is provided here. More information about using OpenAI APIs for different use cases can be found on GitHub in this Cookbook here.

In addition to Large Language Models, Generative AI expands to images, speech, etc. & multi-modal model(s) covering the whole spectrum of senses for building the next generation immersive experiences.

Described here DALL·E is one of those models. DALL·E 2 is a new AI system that can create realistic images and art from a description in natural language. Below is an illustration with images generated by Mini DALL·E for “Embodied Artificial Intelligence” text prompt:

Mini DALL·E is available here. And here you can find DALL·E 2 released by OpenAI. Similar text prompt in DALL·E 2 yields the following images:

Actually, this time we applied a modified text prompt to illustrate how you can control image attributes by means of a text prompt. As you can see similar Prompt Engineering techniques apply for image generation using text prompt. More Prompt Engineering guidance for DALL·E can be found here.

There’s no limit to perfection and there’s more to be covered on the topic of building a robust and scalable Embodied AI solution. For example, hardware considerations (say, improving microphone quality for a quality speech recognition) or leveraging CMS (Content Management System) for a centralized creative content management. But this might a story for another day.

In Closing

Building immersive and intuitive experiences with Embodied AI nowadays is more possible and achievable than ever before. The progress powering the future of The Metaverse will continue, opening even more doors for building the next generation experiences. We believe that Microsoft AI Platform is strongly positioned to support the future of The Metaverse. And in this article we provided yet another illustration on how developers, artists and designers can collaborate together to build robust and scalable Embodied AI solutions for Hybrid Cloud on Microsoft AI Platform.

Disclaimer

Opinions expressed are solely of the author and do not express the views and opinions of author’s current employer, Microsoft.

More about Embodiment of AI

In Closing

Written by Alex Anikiev