Microsoft HoloLens 2 MR headset and references to digital avatars in The Metaverse

Thoughts about Embodiment of AI

18 min readOct 1, 2022

Business

Rapid innovation, incubation of new ideas and turning them into a repeatable and scalable business is a top of mind for many small startups and large corporations across different industries. Leveraging AI to achieve your goals is imperative these days. Based on our previous experience building products and delivering projects using AI, we’d like to highlight 3 important Engineering aspects for rapidly building a successful AI solution for a well-defined business case in the modern world: 1) Solution Templates; 2) Design Patterns; 3) Trusted & Reliable Partner for AI Platform.

3 *Engineering* aspects for rapidly building a successful AI solution

First off, everything requires a solid foundation and in this article we’ll focus on leveraging a Trusted & Reliable AI Platform and the Suite of Cognitive Services from Microsoft Azure. Then, starting with an appropriate Solution Template(s) will get you going from the right foot without reinventing the wheel. Specifically, Azure Samples GitHub page here is a great place to start. Finally, standardizing the development process around proven Design Pattern(s) will ensure Engineering excellence for the solution you build, allow your Engineers to speak the same language and be extremely productive from the get go. Namely, Azure Engineering Center here is a great resource for you to check out.

Rapid innovation leveraging AI is a vital component of success for small and large companies nowadays and allows not only to overcome difficult economic times we live in, but to be successful in highly competitive markets.

Undeniably, some of the most amazing innovations are currently happening around The Metaverse. Metaverse is still being defined, but some of its qualities and attributes may already be distilled to the following list: virtual worlds, 3D, real-time rendered, interoperable network, massively scaled, persistence, synchronous, unlimited users and individual presence. Based on this list in his book “The Metaverse” Matthew Ball crafted the following definition of The Metaverse: “A massively scaled and interoperable network of real-time rendered 3D virtual worlds that can be experiences synchronously and persistently by an effectively unlimited number of users with an individual sense of presence, and with continuity of data, such as identity, history, entitlements, objects, communications, and payments”. Ultimately The Metaverse will likely be consumed in many different ways including applications in Industrial Metaverse, Enterprise Metaverse, etc. However, one ubiquitous need which will likely emerge big (and is already emerging fast) across different Metaverse applications is the use of digital Avatars and Embodiment of AI.

Embodied AI may take different forms from cartoonish characters just acting funny to entertain You to super realistic humanoids interacting with You in a natural and intuitive way just like humans do. We also think that Embodied AI is destined to become an ultimate computer interface for Us in The Metaverse.

Note: For the cover of this article we borrowed some visuals from this publication which provides a point of view on the collaborative aspect of The Metaverse. Also for more context about rapid prototyping and simulations please check out our thoughts in the article here.

Scenario

For the purposes of this article we choose a fairly generic Embodied AI use case with a character created and rigged in one of 3D Modeling Software such as Autodesk Maya, then exported from there as an asset in a compatible format (for example, FBX) to be imported into one of Gaming Engines such as Unity or Unreal for bringing it to life. At the minimum, we want our character to be able to Move itself, See us, Hear us, Understand us and Speak with us in a meaningful way.

For our solution we are going to use Unity Game Engine for orchestration of character behaviors and mainly focus on vision (computer vision) and speech (STT, TTS) capabilities using Azure Cognitive Services Suite. For the sake of simplicity we deliberately keep the following out of scope in this article: advanced hardware considerations, more sophisticated use of OpenAI, Edge deployment aspects, security/scalability & other production grade requirements. These topics will be highlighted in our future articles though.

Note: The challenges with orchestration for multimodal interactive systems are not brand new, thus please consider checking out The Platform for Situated Intelligence here: https://github.com/microsoft/psi (and a recent showcase here) for additional inspiration and context.

If you need to learn more about Unity (2021.3) please review the official documentation here. We also find this comparison between Unity and Unreal features and terminology quite insightful, succinct and helpful.

Let’s get to the Architecture and Code now. We hope you enjoy the ride!

The first area we cover will be Computer Vision. We want our character to See us and Understand what we are doing in front of it.

Vision (Computer Vision :))

To enable our character sight we use Azure Kinect DK: https://azure.microsoft.com/en-us/products/kinect-dk/

Azure Kinect is a cutting-edge spatial computing developer kit with sophisticated computer vision and speech models, advanced AI sensors, and a range of powerful SDKs that can be connected to Azure cognitive services. There’re 2 SDKs for Azure Kinect we are particularly interested in: 1) Sensors SDK leveraging RGB camera for still pictures; 2) Body Tracking SDK leveraging Depth camera for skeletal tracking.

This is how Azure Kinect 3D Visualization tool looks like after you install Azure Kinect Body Tracking SDK on Windows and physically connect the device. We’re looking good over there, right? :)

Azure Kinect (and its Body Tracking SDK)

Now we’ll further focus on Azure Kinect Body Tracking in Unity. Please consider this Unity template: https://github.com/microsoft/Azure-Kinect-Samples/tree/master/body-tracking-samples/sample_unity_bodytracking to get started with Azure Kinect Body Tracking in Unity.

Azure Kinect Body Tracking GitHub Unity Sample

To properly set up Azure Kinect Body Tracking please consult with this article.

Note: Body Tracking SDK requires a NVIDIA GPU installed in the host PC if you want to leverage GPU instead of CPU. The recommended body tracking host PC requirement is described in system requirements page.

In the following code snippet we provide a simple example how Azure Kinect Body Tracking SDK may be leveraged to enable our character to Track us (See us and Understand what we are doing). Specifically, as a character we might be interested in knowing if there’s a person in front of us or not, if so — if this person is paying attention or not, what are some of anthropometry characteristics of this person so we can better assist them:

To further clarify the example above, the next visual illustrates some calculations we might be interested in doing by leveraging pitch, yaw and roll to assess person’s gaze (paying attention or not) based on different joints (for example, head or nose)

Gaze assessment using Azure Kinect Body Tracking SDK

From the performance perspective, this Unity Body Tracking sample runs the body tracking routine in a separate background thread, but it would also be great to leverage GPU for that instead of CPU. The general instructions for enabling CUDA support are described here in “For CUDA” section and the Processing Mode is set to GPU by default (here) which you may want to change to CUDA. For better clarity we illustrate how to enable CUDA for this Unity Body Tracking sample in more detail below:

To begin with, you reinstall Nuget packages for the project either via Terminal by running a command or in conveniently in Visual Studio by pressing a button

Then you install NVIDIA CUDA Computing Toolkit

Then you install NVIDIA cuDNN (rather download a zip and extract it in a folder)

Then you make sure that you merge all DLLs appropriately

We merged cuDNN DLLs into the NVIDIA GPU Computing Toolkit folder which corresponds to the CUDA_PATH environmental variable

Also according to this GitHub issue you will want to copy the highlighted below DLLs into the root of your project and into Assets > Plugins

Here’s the list of packages with their versions for compatibility reference

Finally, you restart your computer and run the sample using CUDA Processing Mode (here). BAM! Now it is running much faster on GPU!

Azure Kinect Body Tracking in Unity with CUDA support

Note: For the above experiment we used NVIDIA GeForce RTX GPU

Next we will cover Speech to enable our character to Hear us and Speak to us

Speech

To enable our character hearing & speaking we use Azure Speech SDK for Unity: https://aka.ms/csspeech/unitypackage

The Speech SDK (software development kit) exposes many of the Speech service capabilities, so you can develop speech-enabled applications. There’re 2 in Azure Speech SDK we are particularly interested in: 1) Speech-To-Text (or Custom Speech-To-Text); 2) Text-To-Speech (Neural Text-To-Speech)

Here: https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/voice-assistants is a great reference about voice assistants scenarios to review as we progress forward. By using voice assistants with the Speech service, developers can create natural, human-like, conversational interfaces for their applications and experiences.

Technically, by using SpeechSynthesizer class and SpeechRecognizer class we may enable our character to Speak to us (TTS) and Hear us (STT). There’re Unity templates which illustrate this as presented below

But also there’s another Unity template which deserves our attention: https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/quickstart/csharp/unity which allows to enable an PVA (Power Virtual Agent) functionality with speech in Unity

Azure Speech SDK with Bot Framework GitHub Unity Sample

The interesting part about this Unity template is that it leverages DialogServiceConnector class which can be used in conjunction with Bot Framework (and Bot Framework Composer) to build more sophisticated conversational flows and effectively allows to Speak and Hear all through the same WSS connection to a deployed Bot App.

The setup for a Bot App to leverage Azure Cognitive Service for Speech via Direct Speech Line channel is illustrated on this convenient one slider below:

BotFramework with Direct Speech Line channel tutorial

Note: Please consider using Single tenant in App Registration and check Default bot checkbox in Direct Speech Line channel configuration

There’re some other useful Unity templates you may want to review to beef up your knowledge of Azure Speech SDK in Unity

Another tangible gain you get from Azure Speech SDK Samples is that you quickly get to learn how to wire up different events for various classes. For example, SpeechRecognizer class allows for 2 recognition modes: 1) Single-Shot recognition (RecognizeAsyncOnce); 2) Continuous recognition (StartContinuousRecognitionAsync & StopContinuousRecognitionAsync) which may both be useful in different scenarios. Specifically, you may need to listen while your character is speaking by leveraging continuous recognition on the background. When you listen once, you will be able to catch a single utterance (or none) until a pause/silence or up to maximum 15 seconds of speech

Azure Speech SDK Samples — Class events (Focused view: Recognizer & Synthesizer)

Note: Currently when leveraging DialogServiceConnector class, once you start using one method of recognition (say, listen once), you won’t be able to switch to the other one (listen continuously) at the run-time

Below is a broader view of Azure Speech SDK Samples, their main classes and event subscriptions implementations you can quickly learn from:

Azure Speech SDK Samples — Class events (Broader view)

PVA (Power Virtual Agent) canvas may not be enough for designing sophisticated conversational flows, and in this case you may consider leveraging Bot Framework Composer which is a part of PVA’s official extensibility story. Bot Framework Composer can be installed on your local machine per instructions here and it is convenient for design, debugging and deployment of Bot App(s).

As a quick Bot Framework Composer 101 here we’ll just highlight that Bot Framework Composer comes with pre-defined templates leveraging LUIS (Language Understanding Intelligent Service) and QnA (Question&Answer) Maker

Bot Framework Composer Templates for LUIS and QnA Maker

Note: QnA Maker Service has been recently deprecated in favor of its successor Custom Question & Answering Service which is now a part of Language Studio.

Now as we figured out the plumbing for conversational AI, we also want our character to Speak with us emotionally and have a mood just like humans do. For these purposes we can leverage Azure Speech Service Neural Voices and Speaking Styles.

Azure Speech Service Neural Voices in Bot Framework Composer

Note: For now we’ll defer the discussion about Lip Syncing to later in this article

Depending on your needs and architecture there may be multiple options to implement the support of Speaking Styles in Bot Framework Composer as presented below:

Using Speaking Styles in Bot Framework Composer

The most important part is to decide whether you generate speech on the Server or on the Client. If you generate speech on the Server, then the Client receives a stream of bites (already generated speech). And if you choose to send a plain text (or SSML (Speech Synthesis Markup Language)) from the Server to the Client, then Client needs to generate speech from that text by itself. There’re pros & cons for each approach depending on your needs and architecture.

Below is the illustration of how we could adjust pitch, rate, volume and contour for a specific utterance for which speech is generated on the Server side:

How to adjust speech generated on the Server side

Please consider checking out Speech Studio portal and its Audio Content Creation feature which facilitates production of creative content using SSML (Speech Synthesis Markup Language).

Once your creative designers produce the desired conversational flows, the next question is can we enrich those flows and bring an element of surprise into conversations when our character may say the same thing in different ways and “never” sound the same just like humans do. To achieve this you may consider building a Content Enrichment Pipeline leveraging Azure OpenAI Service (with the family of models behind it including GPT)

Content Enrichment Pipeline using Azure OpenAI Service

While you may be tempted to leverage Azure OpenAI Service for real-time request-response conversation within the provided context (and if you haven’t used the service before, you would certainly be surprised about the quality of the intelligence you can achieve with it :)), a more practical option might be to leverage it for “offline” content generation to produce variations of the existing content (for example, via paraphrasing). This way you will still have Human-in-the-Loop to proof-read and approve (cherry-pick) the generated content and you won’t need to spend a lot of efforts for real-time content moderation

All of the above gets us pretty close to bringing our character to life in an exciting and meaningful way. However, there’re some last mile considerations we need to account for before the solution is truly useful and reliable. One of them is to reassess the quality of speech recognition and how it is relevant to our industry domain. Specifically, if we are not able to clearly recognize what a person is communicating to our character, it will be difficult for the character to assist this person in a meaningful way (Garbage-in Garbage-out principle). To properly fine-tune our STT (Speech-to-Text) capabilities to account for a specific problem domain you may consider training Custom Speech-to-Text models in the Cloud and deploying the corresponding endpoints in the Cloud or bringing those trained models for inference on the Edge. Specifically, this is how you can “adjust” STT to properly recognize concrete terminology for a specific industry.

Training Custom Speech-to-Text models in the Cloud for Hybrid deployment

At this point our character is quite capable to survive in the modern world with vision and hearing & speaking :). And it’s time to put everything together to allow it to Move, Act & React, etc. (and enjoy its physical presence in digital life, and/or vice versa :))

Unity

To bring our character to life we’ll build a Unity App to orchestrate our character behaviors. To implement such orchestrator we can leverage State Machine Design Pattern. In the center of this orchestrator we have a MonoBehavior script attached to a Game Object (our character) which will coordinate transitions from one state to another

In this sample implementation we use DialogServiceConnector class to connect to conversational AI logic deployed as Bot App and Unity Animator class to trigger the right behaviors at the right time

There’s a lot of literature available on the topic of architectures for embodied conversational characters (for example, here). The science behind human interactions and anatomy of a dialog is also well described over the years. By and large a typical dialog is comprised of a set of speaking and listening turns. However, comparing to the original Unity Virtual Assistant template here we still may want to extend it for handling more sophisticated dialog logic such as consecutive speaking and listening turns to better match the real life experiences

Handling consecutive speaking and listening turns in Unity Virtual Assistant Template

The standard Unity Virtual Assistant template may be modified appropriately here and here to support the desired logic for consecutive turns

Note: Please note that the consecutive Bot responses will be submitted from Bot App to the Client App based on ASAP principle (but will still arrive in proper sequence), thus the queuing logic is required to properly process them in the Client App (Unity)

Another exciting way to make our character more vivid is to use Lip Syncing. Azure Speech SDK allows to implement Lip Syncing via Visemes as described here

Lip Syncing (Pitch & Volume) using Blend Tree in Unity

There are multiple ways to implement visemes: 1) By leveraging up to 21 SVG shapes as described here or 2) By leveraging up to 55 3D blend shapes as described here (these can be used to “light up” avatar’s whole face not only do lip syncing)

And just like in previous section we might need to account for some last mile considerations as well. One of them is about what do we exactly communicate from the Server (Bot App) to the Client (Unity App) and what transport we want to use for the communication. Speaking about the transport options we started with Direct Speech Line over WSS (Secure Web Sockets) with DialogServiceConnector class based on the standard Virtual Assistant Unity template, but technically if we submit text (or SSML) from the Server to the Client we might also consider Direct Line which also supports WSS (with DirectLineClient class)

Direct Speech Line vs Direct Line transport

Another more nuanced consideration is about the quality of generated speech. Currently the best audio quality you can produce in Audio Content Creation using SSML in Speech Studio is 48KHz. And in the standard Virtual Assistant Unity template the audio quality for the received stream of bytes (WAV) is expected to be 16KHz (sampleRate) by default as described here. Thus if the quality of generated speech is important for your scenario (you want the maximum quality of 48KHz) and you plan to leverage Visemes for Lip Sync which are only available via SpeechSynthesizer class VisemeReached event subscription on the Client, you may want to consider generating speech on the Client (Unity App) instead of the Server (Bot App)

Generating speech on the Server (Bot App) vs Generating speech on the Client (Unity App)

By now our character is even more capable with abilities to See, Hear, Speak and Move. This is certainly great, but from the conversational AI perspective so far we always assumed that a person interacting with the character provides a verbal feedback. What if we want to enable a non-verbal feedback (nodes, grimaces, gestures, etc.) from a person? The challenge will be to incorporate it cohesively into our conversation AI flow in Bot App

Here’s a quick mock-up for “Game of hands” when we would like a person to raise their hand(s) in response to character’s prompt. This is how the decision making tree would look like if we can capture a non-verbal feedback

There’re a few things we need to be mindful of to be able to incorporate this game into a conversational AI flow using Bot App. One of them is that we’d like our conversational flow to be cohesive (and one), otherwise we might end up with a multiple disjoint flows like below

Game of hands disjoint flows based on separate dedicated intents

This approach might work, but it is extremely hard to maintain, debug and manage because it introduces a lot of smaller disconnected sub-flows. Using recognizable intents in this case is also not ideal either because we’ll be sending concrete events (we don’t need to “guess”/recognize) once we track person’s activity via Azure Kinect in Unity App, that’s why we can switch to be using Event activities as presented below

Recognizable intents vs Event activities in Bot Framework Composer

The final piece of this puzzle will be to find the right prompt type to be able to communicate person’s activity back to the conversational flow in Bot App, a kind of request which will wait for us to submit a non-verbal response

Types of prompts in Bot Framework Composer including User input (Attachment)

Most of the prompts in the list above are based on verbal responses (in the context of voice assistants), however we may use Attachment(s) type which will require a non-verbal response and may be associated with any kind of an object including, for example, JSON. Upon a closer look below we can see that we could submit an object as a part of response from Unity App back to Bot App to fulfill the request

Sample implementation of non-verbal feedback using User input (Attachment) type

This is how behind the scenes looks like if you test it in Bot Framework Emulator. You can submit a custom activity (JSON-based) as illustrated below

Handling Attachments in Bot Framework Emulator

Similarly you can submit a custom activity (JSON-based) in Windows Voice Assistant Client App when testing your voice-enabled Bot App

Submitting custom activities in Windows Voice Assistant Client

Ultimately to illustrate the concept for handling non-verbal feedback in Bot App we wrote a small test client app (a simple Console app)

The main idea is to simulate a non-verbal interaction in voice-enabled Bot App by submitting requests and receiving responses. We still use DialogServiceConnector class to communicate with Bot App over WSS just like in the standard Unity Virtual Assistant template.

By the result we are able to successfully handle non-verbal feedback received from the Client App and take it forward in the cohesive (single) conversational flow without breaking it down in parts

Handling non-verbal feedback in Bot App using Console App

Obviously, there’s more to this story. And we’ll stash those details until the next publication. One thing we’ll mention is that using PVA (Power Virtual Agent)/Bot Framework Composer is not the only way to organize conversational flows in your AI Apps using Azure Speech SDK. If you are able to design your conversational flows elsewhere or inside of Unity itself, you will always be able to use Azure Speech SDK classes such as SpeechRecognizer class, SpeechSynthesizer class, etc. and integrate with LUIS or QnA Maker as necessary in the Client.

In Closing

We are genuinely excited about the future of Embodied AI and its role in The Metaverse. In the future articles we plan to cover more advanced topics about Embodied AI from both developer and artist & designer points of view and how Microsoft AI Platform is strongly positioned to support the future of The Metaverse.

Disclaimer

Opinions expressed are solely of the author and do not express the views and opinions of author’s current employer, Microsoft.

Books you might enjoy reading

You might enjoy reading the following books you have probably noticed on the front cover of this article:

“Design Patterns: Elements of reusable object-oriented software” by Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides (1994) (link)
“The Metaverse: And how it will revolutionize everything” by Matthew Ball (2022) (link)
“Unity Game Development Cookbook: Essential for every game” by Paris Buttfield-Addison, Jon Manning & Tim Nugent (2019) (link)
“C# 10 in a nutshell: The definitive reference” by Joseph Albahari (2022) (link)
“The Standard: The ultimate guide to building enterprise-level systems from idea to product” by Hassan Habib (2022) (link)

Thank You

[In the near future] Please check out The Standard in Ukrainian here: https://github.com/hassanhabib/The-Standard-Ukrainian (WIP, still working on that PR there :))

PS. Please consider supporting Ukraine 🇺🇦 Armed Forces ❤️ here: https://bank.gov.ua/en/about/support-the-armed-forces