The ecosystem for AI-powered Copilot apps is already vast and fast evolving day by day

Thoughts about Multi-Modality for Generative AI

Alex Anikiev
26 min readAug 23, 2023

--

Business

Copilots! A new class of AI-powered applications which has rapidly emerged and is fast evolving to fundamentally transform industries and businesses worldwide. So You want to build a copilot for your business or you are building “the next big thing” app/service? Already exciting! There’s a lot to think and talk about, for sure. In this article we will share practical thoughts and useful experiences for building successful copilot apps using generative AI based on more than a year focusing on this space.

To begin we wanted to highlight a few content-rich and thought-provoking events happened recently earlier this year which allow us to better understand generative AI space, its challenges and opportunities ahead:

Noteworthy events in Generative AI space in 2023

Please take time to listen to the thought leaders like Kevin Scott, Andrej Karpathy, Greg Brockman and others.

List of abovementioned noteworthy events:

Sanity Check

Before we dive deeper into our opinionated view of the opportunities for using generative AI and building copilots, it’s always good to take a pause and think about the main challenges associated with generative AI. Until now you can already find a plethora of research and analysis on GPT-like models challenges (for example, ”Sparks of Artificial General Intelligence: Early experiments with GPT-4" paper, etc.), however we find that the most easily understandable and intuitive reference is “Aligning language models to follow instructions” paper by OpenAI which takes us to the roots of the problem when the first GPT-3 models have just been created and InstructGPT models were introduced shortly after for better alignment of these models with human instructions.

Important characteristics of LLMs

This allows us to formulate these important characteristics of LLMs:

  • Helpfulness: It should help the user solve their task
  • Truthfulness: It should not fabricate information or mislead the user
  • Harmlessness: It should not cause physical, psychological or social harm to people or the environment

One of the terms which is often used in regards to LLM truthfulness and defines output appropriateness is “hallucinations”. Hallucinations signify instances when LLM “imagines” or “fabricates” information that doesn’t directly correspond to the provided input. Thus the goal is to make sure that LLM makes up facts (“hallucinates”) less often (or as rarely as possible) and generates more appropriate outputs more consistently.

ChatGPT tips

Another important foundational knowledge is to properly delineate horizontal capabilities of LLMs from their vertical applications. When we first got access to, for example, GPT-3 text-davinci-001 model and OpenAI completions endpoint (and/or Azure OpenAI completions endpoint), we leveraged a model trained on a vast array of data for a generic domain (without one or few concrete specialties). Later improvements were made already leading to creation of GPT-3.5 and GPT-4 models, however they are still generic domain models (horizontal capability). It was just a matter of time for companies, organizations, developers, etc. to start building solutions leveraging LLMs for their specific or specialized use cases and domains which quickly led to a realization that for success they will need to integrate their own domain-specific business data into the mix along with LLMs (vertical solutions). Below we provide a schematic illustration of how LLMs have been evolving to cater from generic domain to specific/specialized domain(s) applications all the way in already ongoing and long-lasting pursuit towards AGI (Artificial General Intelligence).

Evolution of LLMs from horizontal capability to vertical solutions all the way to future AGI ambition

We’ve already started seeing a trend of LLMs size reduction and their vertical applications in various industries such as manufacturing, healthcare, retail, professional services, public sector, etc.

Search + Generative AI = Better together

After OpenAI ChatGPT was released it took over the world by storm because people simply loved the new conversational way of getting questions answered, finding creative inspiration or learning something new which boosted their productivity. However way before OpenAI ChatGPT was released you’ve likely already successfully mastered some well established Search approaches, whether it’s internet search on open internet data or by using enterprise applications connected to business data. It’s also important to note that currently OpenAI ChatGPT is based on the training dataset as of September 2021, so it won’t know about happenings after this cut off date until the model gets retrained on a fresher dataset.

We personally successfully developed and deployed large and impactful Search solutions in recent years using well established Search patterns such as Knowledge Mining. Namely, we published articles on Medium here and here about our “Enriched Search Experience” reference solution architecture which is fundamentally based on multi-modality (text, images, multimedia) and leverages multi-modal search in Hybrid Cloud (Cloud and Edge). Its solution architecture is also Hybrid because we leveraged a combination of intelligent services to handle different modalities and made the experience consistent with the help of an orchestrator.

Taking this knowledge into consideration, we believe that there’s a strong better together story and value proposition for when Generative AI Copilot apps (focusing on content generation) are combined with Knowledge Mining approaches (based on search and focusing on content retrieval) to enable a new category of RAG (Retrieval Augmented Generation) capabilities/apps.

Better together story for Knowledge Mining app and Copilot apps

In the center of any Knowledge Mining solution typically stands Search index capability. For implementing Search index in the Cloud it would be logical to take advantage of Azure Cognitive Search. For similar capability on the Edge we might use the latest licensed version of ElasticSearch or earlier OSS version of ElasticSearch. Azure Cognitive Search and its evolution Azure Semantic Search (leveraging vectors for semantic similarity) could help with text modality while Azure AI Video Indexer with its separate Search index could help with multimedia (audio and video modalities), and Azure Custom Vision and/or Azure Machine Learning custom models could help with images (image modality). Using this combination of capabilities we end up with multiple Search indexes for different modalities, thus the role of the orchestrator (Orchestration layer) is critical to ensure consistent experience.

Evolution of Azure Cognitive Search (Full-text, Semantic, Vector) and other adjacent capabilities

To reduce the complexity of such solution, is there a way to unify the representation data with different modalities? Yes, this can be done with the help of recently introduced Azure Vector Search which employs the concept of vector embeddings. The main idea is that data with different modalities (text, images, multimedia) can all be represented as vectors in muti-dimensional space. For example, for a specific piece of content such as PDF document with text and images we generate vector embeddings using, say, text-embedding-ada-002 model which turns the document into array of numerical vectors. Then when the search query is received, we also generate vector embeddings for it. Finally we perform search using vector representations of the content and the query. Please note that unless your content changes, you only ingest it (and generate vector embeddings for it) into Vector store once. But for much smaller queries you generate vector embeddings every time on the fly for each search.

Azure AI Studio base models including text-embedding-ada-002

Note: Please take a moment and watch this episode of Azure Friday which introduces Azure Vector Search.

However, Vector Search by itself is not a panacea, that’s why Azure Vector Search allows you to implement hybrid scenarios by combining the outputs of Vector Search with Traditional Search for best results for your use cases.

Note: On this very topic you might also enjoy “Vector Search is not enough” Build session by Elastic which covers many aspects of using Vector Search at scale, its enterprise readiness and flexibility of use for real-world scenarios.

We’ve already mentioned how vector embeddings are the central idea in multi-modal Vector Search. For the very same reasons vector embeddings are very important for multi-modal LLMs.

Note: Vectorization and Embeddings are quite central in classic NLP (Natural Language Processing) also, please review the following material we posted here and here years ago.

When it comes to customization of LLMs there’re 2 main approaches:

  • Using Embeddings
  • Fine-tuning
Approaches for customizing LLMs

Typically when you are thinking about customizing LLM, what you really want is to incorporate your data to provide more relevant context and new knowledge to the standard LLM model. When using Fine-tuning approach you effectively end up with a model copy which you retrain and have to redeploy. This an extensive approach which typically means extra efforts for retraining and extra cost for LLM hosting. When using Embeddings approach you essentially expand your prompt (the overall context) by providing additional information in form of vector embeddings. The difference between expanding your prompt by providing additional information as text versus using vector embeddings is that embeddings are much more efficient and compressed representation than text. You can only provide a limited amount of text in your prompt per input token restrictions per model as described here, tokens translate into number of characters, that’s why your prompts can’t be infinitely large. For example, GPT-4 model supports 8192 max input tokens and the GPT-4–32k model supports up to 32768 tokens. GPT-35-turbo model supports 4096 max input tokens and the GPT-35-turbo-16k model supports up to 16384 tokens.

Now to close the logical loop, if we come back to the topic of using your own data (for example, Search index) with LLMs, there’s a great capability in Azure AI Studio which allows you to leverage your own data with Chat endpoint (currently in Preview) as shown below:

Bring your own data (Preview)

Specifically, you can integrate LLM with external data sources with additional knowledge such as Azure Cognitive Search (Search index), Azure Blob Storage or Upload files:

Bring your own data (Preview) from Azure Cognitive Search

In this section we dove deeper into how to combine well-established Search patterns with Generative AI emerging patterns for building robust and successful solutions.

Copilot stack x Copilot ecosystem = AI advantage

Now let’s talk about how to take full AI advantage and get the most out of it in the modern day when virtually everybody is doing AI.

We believe that “Copilot stack x Copilot ecosystem = AI advantage” statement coined by Microsoft’s Satya Nadella really captures the essence of the successful approach.

If we quickly look at the recent releases of Generative AI technology and its rapid evolution, you will clearly see how Generative AI ecosystem has been growing before your own eyes. Features which were custom built after the initial release of first Generative AI models (for example, memory) are being added as platform capabilities to ease applications development. Creative integrations with great existing products led to innovation breakthroughs such as Bing Chat, Microsoft 365 Copilot, Windows Copilot, Dynamics 365 Copilot, etc. Needless to say that, for example, developers have already been using GitHub Copilot for a while, and now we are getting even more powerful GitHub Copilot X to boost developer’s productivity even more.

This wave of innovation created a flourishing ecosystem with already many Generative AI Apps and Plugins available. Following the principles “everything is an app” and “everything is a Web API” we are observing rapid creation and adoption of vertical solutions and industry applications leveraging Generative AI technology which all contribute to the ambitious pursuit towards AGI (Artificial General Intelligence).

Core horizontal capability to Vertical industry solutions leveraging Generative AI technology

AI Orchestration

To support a sustainable growth of this ecosystem we need a solid foundational platform (and/or multiple available options) that developers can build upon while focusing on business problems and not solely on plumbing. Much like for Microservices world everybody still keeps benefiting from foundational platforms like Kubernetes and Dapr (Distributed Application Runtime), organizations and companies leading the way in Generative AI space offer robust and convenient platform capabilities to speed up the development of modern application which leverage Generative AI.

In this section we will focus our attention on Microsoft’s Semantic Kernel and LangChain project.

Semantic Kernel allows you to integrate cutting-edge LLM technology quickly and easily into your apps and provides AI Orhestration layer capabilities. Semantic Kernel is an open-source SDK that lets you easily combine AI services like OpenAI, Azure OpenAI, and Hugging Face with conventional programming languages like C#, Python and more.

By using multiple AI models, plugins, and memory all together within Semantic Kernel, you can create sophisticated pipelines that allow AI to automate complex tasks for users.

Microsoft Semantic Kernel capabilities

As described here Semantic Kernel is Microsoft’s contribution to Generative AI space and is designed to support enterprise app developers who want to integrate AI into their existing apps. At the same time to also simplify the creation of AI apps, OSS projects like LangChain have emerged.

LangChain technically provides a similar set of core platform capabilities comparing to Semantic Kernel. And the decision about what foundational platform to leverage as AI Orchestration layer for your product or service depends on the overall solution architecture. Also there’s a possibility to combine both worlds (ala Microservices style) at the expense of increased solution complexity.

LangChain capabilities

Azure Machine Learning

Azure Machine Learning is already a very well established platform for MLOps. And not surprisingly Azure Machine Learning has a say in Generative AI area. Traditionally Azure Machine Learning (AML) has been largely about ML pipelines: Data pipelines, Training pipelines, Deployment pipelines, etc. You may find this article interesting on the topic of AML which goes into details of running successful MLOps project(s).

With the raise of Generative AI Azure Machine Learning has got a number of exciting new capabilities including Prompt flow and ability to help with RAG (Retrieval Augmented Generation) data-driven scenarios. This is exactly the capabilities we’ll highlight in this section.

AML Prompt flow is a development tool designed to streamline the entire development cycle of AI applications powered by Large Language Models (LLMs). IT provides a comprehensive solution that simplifies the process of prototyping, experimenting, iterating, and deploying your AI applications.

Let’s look at AML Prompt flow closer and make some notes.

AML Studio portal

In AML Studio portal you can create multiple types of Prompt flows:

  • Standard flow: Harness the power of LLMs, customized Python code, Serp API, and more to craft your tailored prompt flow. Test the flow using custom datasets and seamlessly deploy as an endpoint for easy integration.
  • Chat flow: On top of the standard flow, this option provides the chat history support and user-friendly chat interface in the authoring/debugging UI.
  • Evaluation flow: Create an evaluation flow to measure how well the output matches the expected criteria and goals.
Types of flows in AML Prompt flow

Similar to other Azure Cognitive Services portals and visual capabilities for AML pipelines, AML Prompt flow also provides a convenient visual interface to visualize complex flows, easy debugging and rapid experimentation. Below we depict a Standard flow example:

Visual designer in AML Prompt flow — Standard flow

There’re some additional features available in Chat flow to support convenient chat interfaces:

Visual designer in AML Prompt flow — Chat flow

Rapid experimenting and prototyping for Prompt Engineering is one of the focuses of AML Prompt flow which enables great developer productivity:

AML Prompt flow — Prompt Engineering

First class integration with different LLMs, robust Prompt Engineering features, first class support of Python and integrations with many external data sources and Web APIs make Prompt flow a very compelling tool in your Generative AI toolbox. Note integrations with Vector Index Lookup and FAISS Index Lookup which are very handy for building multi-modal applications.

AML Prompt flow features

AML Prompt flow also allows you to go all the way to deployment of your flows as a Web APIs for consumption, just like how you would deploy custom MLOps AI/ML models.

AML Prompt flow deployment capabilities

The decision about whether you want to use Prompt flow as a backbone for your AI Orchestration layer or solely for Prompt Engineering purposes, or in combination with other components depends on the overall solution architecture. But certainly you have multiple options available for consideration. As it was already highlighted earlier you can leverage Prompt flow to supercharge RAG (Retrieval Augmented Generation) scenarios while using your own data alongside LLMs as described here. This once again underscored the importance of embeddings and vector representation of data for enabling multi-modal use cases.

While DevOps, DevSecOps and MLOps are already well established Engineering disciplines for software and AI/ML models development, there’re other Engineering disciplines which are rapidly emerging in the context of LLMs, and they are LMOps (Language Models Ops) and FMOps (Foundational Models Ops). FMOps is a critical aspect of the lifecycle management of the performance and quality of Open AI based systems. Conceptually it consists of several steps to ensure effective system operation and improvement. The evaluation framework plays an important role in the experimentation phase, facilitating rapid experimentation and providing valuable insights.

Note: Also please pay attention to the notion of MLLMs (Multi-modal LLMs) becoming ubiquitous nowadays.

Reference Solution Architecture

After reviewing Generative AI landscape with different options for Orchestrators, the role Azure Machine Learning (AML) may play in your solution and the importance of embeddings and vector representation of data for solution multi-modality, we can summarize our opinionated view at the Reference Solution Architecture for Copilots.

Based on the classic Software Engineering practices (Front-end, Back-end and Data layers) for clarity and simplicity, we can expand our Back-end layer into AI Orchestration and AI Infrastructure layers in the context of Generative AI. Foundational models are already provided by Azure OpenAI Service for building enterprise-grade solution. However, as we discussed earlier there’s some decision making to do to optimally choose technology stack for AI Orchestration layer depending on your overall solution architecture.

Reference Solution Architecture for Copilots

Note: Although you may consider to leverage foundational models from OpenAI directly, there’re numerous benefits for leveraging Azure OpenAI Service to harness the full power of Azure Cloud platform including security, performance and vast array of enterprise-grade capabilities for the success of your solution.

As Generative AI ecosystem is rapidly evolving with more and more platform capabilities, useful frameworks and handy libraries, the LLMs themselves are evolving too. And there’s certain confusion about this topic especially considering the number of different models in circulation. The good news is that after a sufficient period of time as LLMs got adopted more and more vertical solutions see the day light, there’s an organic standardization process that is taking place to determine a set of foundational models for enterprise-grade use (while you can’t, won’t and don’t want to stop the research progress and experimentation from stopping, obviously, with existing models being enhanced or new models being created constantly). Just to illustrate this aspect let’s look at the current Azure OpenAI Service models and Azure OpenAI Service legacy models, for example, for embeddings as presented below:

Azure OpenAI Studio Current models vs Legacy models

Also companies and enterprises building their Generative AI solutions using certain LLMs can sustainably manage their lifecycle with the help of FMOps (Foundational Models Ops) Engineering practices as appropriate.

Text Modality quick notes: ChatGPT

Text modality is naturally the most understood domain especially if you come from the background of building Enterprise Search solutions or implementing Knowledge Mining use cases. In this section we’ll solely touch upon some practical aspects of leveraging LLMs text modality on an example of OpenAI ChatGPT and Azure OpenAI ChatGPT capabilities.

OpenAI ChatGPT: “get instant answers, find creative inspiration, and learn something new” as it’s described on OpenAI web site. Much like with Azure OpenAI Service, by and large OpenAI portal allows you manage endpoints and consume capabilities as Web APIs.

OpenAI portal endpoints management

There’s a number of way you may want to consume this capability as Web API when you integrate it into your solution using specific programming language of your choice:

OpenAI ChatGPT playground

For visual reference below we also present an UX of Chat playground in Azure OpenAI Studio portal:

Azure OpenAI ChatGPT playground

Also you are in the full control how exactly you’d like to consume this Web API and what programming language you’d like to use:

Azure OpenAI ChatGPT playground — Sample code

As you can see everything is designed to be intuitive and allowing you to start developing your apps quickly without wasting any extra time.

Image Modality closer look: DALLE-2

Image modality is typically the next one that people attempt to master after text, it’s a little less understood atm, but it’s certainly very important to enable multi-modality use cases. Some industries are logically even more dependent on image modality than on text, for example, creative industries, digital art, etc. There’s a lot of revolutionary Generative AI image modality applications for commercial industries as well.

OpenAI DALLE2 is an AI system that can create realistic images and art from a description in natural language (text -> image).

OpenAI DALLE2 has the following important features:

  • Image generation: DALLE2 can create original, realistic images and art from a text description. It can combine concepts, attributes and styles.
  • Outpainting: DALLE2 can expand images beyond what’s in the original canvas, creating expansive new compositions.
  • Inpainting: DALLE2 can make realistic edits to existing images from a natural language caption. It can add and remove elements while taking shadows, reflections and textures into account.
  • Variations: DALLE2 can take an image and create different variations of it inspired by the original.
OpenAI DALLE2 features

This is how Azure OpenAI DALLE playground looks like:

Azure OpenAI DALLE playground

Similar to Completions endpoint, ChatGPT endpoint, DALLE capability can be consumed as Web API with sample code snippets generated for you:

Azure OpenAI DALLE playground — Sample code

You can also find similar documentation for consuming DALLE capability as Web API in OpenAI documentation as shown below:

OpenAI DALLE documentation

Just like with mastering text modality using Completions endpoint and ChatGPT endpoint, when using DALLE to generate images from text Prompt Engineering is a big deal. Let’s take a look at featured prompts on OpenAI DALLE page to have a better idea about effective prompting strategies:

var Tu = [{
prompt: "3D render of a cute tropical fish in an aquarium on a dark blue background, digital art",
large: !0
}, {
prompt: "An armchair in the shape of an avocado",
large: !1
}, {
prompt: "An expressive oil painting of a basketball player dunking, depicted as an explosion of a nebula",
large: !1
}, {
prompt: "A photo of a white fur monster standing in a purple room",
large: !1
}, {
prompt: "An oil painting by Matisse of a humanoid robot playing chess",
large: !1
}, {
prompt: "A photo of a silhouette of a person in a color lit desert at night",
large: !1
}, {
prompt: "A blue orange sliced in half laying on a blue floor in front of a blue wall",
large: !1
}, {
prompt: "A 3D render of an astronaut walking in a green desert",
large: !1
}, {
prompt: "A futuristic neon lit cyborg face",
large: !1
}, {
prompt: "A computer from the 90s in the style of vaporwave",
large: !1
}, {
prompt: "A van Gogh style painting of an American football player",
large: !0
}, {
prompt: "A cartoon of a monkey in space",
large: !1
}, {
prompt: "A plush toy robot sitting against a yellow wall",
large: !1
}, {
prompt: "A bowl of soup that is also a portal to another dimension, digital art",
large: !1
}, {
prompt: '"A sea otter with a pearl earring" by Johannes Vermeer',
large: !0
}, {
prompt: "A hand drawn sketch of a Porsche 911",
large: !1
}, {
prompt: "High quality photo of a monkey astronaut",
large: !1
}, {
prompt: "A cyberpunk monster in a control room",
large: !1
}, {
prompt: "A photo of Michelangelo's sculpture of David wearing headphones djing",
large: !1
}, {
prompt: "An abstract painting of artificial intelligence",
large: !1
}, {
prompt: "An Andy Warhol style painting of a french bulldog wearing sunglasses",
large: !1
}, {
prompt: "A photo of a Samoyed dog with its tongue out hugging a white Siamese cat",
large: !1
}, {
prompt: "A photo of a teddy bear on a skateboard in Times Square",
large: !1
}, {
prompt: "An abstract oil painting of a river",
large: !1
}, {
prompt: "A centered explosion of colorful powder on a black background",
large: !0
}, {
prompt: "A futuristic cyborg poster hanging in a neon lit subway station",
large: !1
}, {
prompt: "An oil pastel drawing of an annoyed cat in a spaceship",
large: !1
}, {
prompt: "A sunlit indoor lounge area with a pool with clear water and another pool with translucent pastel pink water, next to a big window, digital art",
large: !1
}, {
prompt: "A synthwave style sunset above the reflecting water of the sea, digital art",
large: !0
}, {
prompt: "A handpalm with a tree growing on top of it",
large: !1
}, {
prompt: "A cartoon of a cat catching a mouse",
large: !1
}, {
prompt: "A pencil and watercolor drawing of a bright city in the future with flying cars",
large: !1
}, {
prompt: "A Formula 1 car driving on a neon road",
large: !1
}, {
prompt: "3D render of a pink balloon dog in a violet room",
large: !1
}, {
prompt: "A photograph of a sunflower with sunglasses on in the middle of the flower in a field on a bright sunny day",
large: !1
}, {
prompt: "Two futuristic towers with a skybridge covered in lush foliage, digital art",
large: !1
}, {
prompt: "A hand-drawn sailboat circled by birds on the sea at sunrise",
large: !1
}, {
prompt: "A Shiba Inu dog wearing a beret and black turtleneck",
large: !1
}, {
prompt: "A comic book cover of a superhero wearing headphones",
large: !1
}, {
prompt: "An abstract visual of artificial intelligence",
large: !1
}, {
prompt: "A cat riding a motorcycle",
large: !1
}, {
prompt: "A 3D render of a rainbow colored hot air balloon flying above a reflective lake",
large: !1
}];

Also for the completeness of the picture it’s helpful to look at Surprise me prompts on OpenAI DALLE page:

Ku = [{
prompt: "A blue orange sliced in half laying on a blue floor in front of a blue wall",
tip: "Describe the context in which an item appears."
}, {
prompt: "A fortune-telling shiba inu reading your fate in a giant hamburger, digital art",
tip: "Add \u201cdigital art\u201d for striking and high-quality images."
}, {
prompt: "A photo of a white fur monster standing in a purple room",
tip: "Describe the material of an object or character."
}, {
prompt: "An astronaut lounging in a tropical resort in space, pixel art",
tip: "Mention styles like \u201cpixel art.\u201d"
}, {
prompt: "Panda mad scientist mixing sparkling chemicals, digital art",
tip: "Add \u201cdigital art\u201d for striking and high-quality images."
}, {
prompt: "An oil painting by Matisse of a humanoid robot playing chess",
tip: "Ask for images in the style of your favorite artist."
}, {
prompt: "A 3D render of an astronaut walking in a green desert",
tip: "Ask for 3D renders."
}, {
prompt: "An oil pastel drawing of an annoyed cat in a spaceship",
tip: "Ask for mediums like \u201coil pastel\u201d or \u201cpencil and watercolor.\u201d"
}, {
prompt: "An armchair in the shape of an avocado",
tip: "Ask for abstract or implausible images."
}, {
prompt: "An oil painting portrait of a capybara wearing medieval royal robes and an ornate crown on a dark background",
tip: "Add more specific details to get exactly what you want."
}, {
prompt: "A pencil and watercolor drawing of a bright city in the future with flying cars",
tip: "A single word like \u201cbright\u201d or \u201cdark\u201d can have a big impact."
}, {
prompt: "A bowl of soup that is also a portal to another dimension, digital art",
tip: "Ask for abstract or implausible images."
}, {
prompt: "A stained glass window depicting a robot",
tip: "Ask for contexts like \u201cstained glass window\u201d or \u201calbum art cover.\u201d"
}, {
prompt: "Synthwave sports car",
tip: "Mention styles like \u201csynthwave\u201d or \u201ccyberpunk.\u201d"
}, {
prompt: "An expressive oil painting of a basketball player dunking, depicted as an explosion of a nebula",
tip: "Combine interesting concepts."
}, {
prompt: "A photo of a teddy bear on a skateboard in Times Square",
tip: "Include a location as context for the image."
}, {
prompt: "A futuristic cyborg poster hanging in a neon lit subway station",
tip: "Describe the lighting to improve aesthetics."
}, {
prompt: "A van Gogh style painting of an American football player",
tip: "Ask for images in the style of your favorite artist."
}, {
prompt: "A cyberpunk illustration of the San Francisco Golden Gate Bridge, digital art",
tip: "Add \u201c, digital art\u201d for striking and high-quality images."
}, {
prompt: "3D render of a pink balloon dog in a violet room",
tip: "Ask for 3D renders."
}, {
prompt: "Teddy bears shopping for groceries in Japan, ukiyo-e",
tip: "Mention art styles like \u201cukiyo-e\u201d or \u201cImpressionist.\u201d"
}, {
prompt: "A bowl of soup that looks like a monster, knitted out of wool",
tip: "Ask for abstract or implausible images."
}, {
prompt: "Abstract pencil and watercolor art of a lonely robot holding a balloon",
tip: "Add more specific details to get exactly what you want."
}, {
prompt: '"A sea otter with a pearl earring" by Johannes Vermeer',
tip: "Ask for images in the style of your favorite artist."
}, {
prompt: "A stern-looking owl dressed as a librarian, digital art",
tip: "Add \u201cdigital art\u201d for striking and high-quality images."
}, {
prompt: "A macro 35mm photograph of two mice in Hawaii, they're each wearing tiny swimsuits and are carrying tiny surf boards, digital art",
tip: "Ask for photography styles like \u201cmacro 35mm film.\u201d"
}, {
prompt: "3D render of a cute tropical fish in an aquarium on a dark blue background, digital art",
tip: "Ask for 3D renders."
}, {
prompt: "A stained glass window depicting a hamburger and french fries",
tip: "Ask for contexts like \u201cstained glass window\u201d or \u201calbum art cover.\u201d"
}, {
prompt: "Teddy bears shopping for groceries, one-line drawing",
tip: "Mention styles like \u201cone-line drawing.\u201d"
}, {
prompt: "Crayon drawing of six cute colorful monsters with ice cream cone bodies on dark blue paper",
tip: "Ask for mediums like \u201ccrayon on dark paper\u201d or \u201cembroidered canvas.\u201d"
}, {
prompt: "A photo of a Samoyed dog with its tongue out hugging a white Siamese cat",
tip: "Add more specific details to get exactly what you want."
}, {
prompt: "A cat submarine chimera, digital art",
tip: "Add \u201cdigital art\u201d for striking and high-quality images."
}, {
prompt: "An astronaut lounging in a tropical resort in space, vaporwave",
tip: "Mention styles like \u201cvaporwave.\u201d"
}, {
prompt: "A painting of a fox in the style of Starry Night",
tip: "Ask for images in the style of your favorite artist."
}, {
prompt: "Photograph of an astronaut riding a horse",
tip: "Ask for abstract or implausible images."
}];

Now as we’ve seen some of the suggested prompts and have an idea about what and how we may want to ask DALLE, let’s pick one of these prompts and try to replicate the result “as advertised”. Specifically, we’ll go with “3D render of a cute tropical fish in an aquarium on a dark blue background, digital art” example.

3D render of a cute tropical fish in an aquarium on a dark blue background, digital art

What’s important to understand is that LLMs are probabilistic systems (non-deterministic) which means that when running the process multiple times for the same input you may receive different outputs. For example, we ran our selected prompt 2 times as depicted above, but only settled on one of the results after the 3rd try as depicted below.

More generation attempts

The result that we liked is not exactly what was highlighted on OpenAI web site, and that’s okay. And that’s the beauty of LLMs that you can always produce something new and better, and there’s no limit to perfection.

Now let’s look at how we can produce variations from the original with DALLE.

DALLE variations

The same principle applies here as well, you may need to try several times to get to the result you are looking for.

DALLE variations results

Finally, let’s look into Inpainting and Generation frames. In this example we want to modify the original image (we erased a part of it) and add a coral reef at the bottom of the fish tank.

DALLE Inpainting

It may also take several iterations to get to the result you like. In fact, a little cute coral reef in the far right image does look pretty.

DALLE Inpainting

Alternatively, we may want to substitute the whole “character” in the scene if we want, for example, a shark in there.

DALLE Inpainting cont’ed

After looking at several provided option we may choose one that we like

DALLE Inpainting cont’ed results

The bottomline of this is that through iterations and experimentation equipped with effective Prompting strategies you can achieve your goals in generating images with DALLE for your use cases. For better explainability and interpretability of results you may want to keep track of changes log and associated images for each step for your incremental improvements.

Multi-Modal GPT-4: Text + Image => Text & Beyond

Text and image modalities if supported at the same time already give us a multi-modal capability (text -> image, image -> text, text + image -> image, text + image -> text, text + image -> text + image). However everybody is hoping and waiting for more, for example, scenarios employing even more modalities such as multimedia (video and audio). And in OpenAI GPT-4 Developer Livestream we saw the future capabilities of GPT-4 multi-modality, specifically, text + image -> text, as it was highlighted by Greg Brockman.

OpenAI GPT-4 is OpenAI’s most advanced system today, producing safer and more useful responses.

OpenAI GPT-4 has the following important features:

  • Creativity: GPT-4 is more creative and collaborative than ever before. It can generate, edit and iterate with users on creative and technical writing tasks, such as composing songs, writing screenplays, or learning a user’s writing style.
  • Visual input: GPt-4 can accept images as inputs and generate captions, classifications and analyses.
  • Longer context: GPT-4 is capable of handling over 25000 words of text, allowing for use cases like long form content creation, extended conversations and document search and analysis.
  • Advanced reasoning: With broad general knowledge and domain expertise, GPt-4 can follow complex instructions in natural language and solve difficult problems with accuracy.
OpenAI GPT-4 features

Today you can start using GPT-4 by subscribing to premium (Plus) offering from OpenAI or by onboarding your organization or company for using Azure OpenAI GPT-4 (currently waitlist) here.

OpenAI ChatGPT featuring GPT-3 and GPT-4

However currently (as of summer 2023) we are still waiting for OpenAI multi-modal GPT-4 (text + image -> text & possibly more) to become available for use.

OpenAI ChatGPT featuring GPT-4

Using ASCII art doesn’t count even though it’s super cool :)

OpenAI ChatGPT featuring GPT-4

Exciting times ahead!

In Closing

Thank you for taking this journey with us until the end! In closing we’d like to re-emphasize on the importance of leveraging Your AI advantage by using complete, robust and reliable AI Platform for Copilots. We believe that Microsoft Azure AI Platform is great for building successful next-generation, AI-powered applications and Copilots powered by Generative AI! Please share your experiences in building Copilot apps!

Copilot stack x Copilot ecosystem = AI advantage

And last but not the least: Life is short, write good code and do it responsibly. Microsoft is committed to making sure AI systems are developed responsibly and in ways that warrant people’s trust.

Below are Microsoft Responsible AI Principles:

  • Fairness: How might an AI system allocate opportunities, resources, or information in ways that are fair to the humans who use it?
  • Reliability & Safety: How might the system function well for people across different use conditions and contexts, including ones it was not originally intended for?
  • Privacy & Security: How might the system be designed to support privacy and security?
  • Inclusiveness: How might the system be designed to be inclusive of people of all abilities?
  • Transparency: How might people misunderstand, misuse, or incorrectly estimate the capabilities of the system?
  • Accountability: How can we create oversight so that humans can be accountable and in control?
Microsoft Responsible AI Principles

These are all great questions to be asked when you are building your awesome Copilot apps. Please learn more about Microsoft Responsible AI principles and approach here and here.

Books you might like reading

--

--

Alex Anikiev
Alex Anikiev

Written by Alex Anikiev

Engineering & Data Science Leader

No responses yet