Generative AI: The Idea Behind CHATGPT, Dall-E, Midjourney and More

Generative AI - Midjourney Prompt

The world of art, communication, and how we perceive reality is rapidly transforming. If we look back at the history of human innovation, we might consider the invention of the wheel or the discovery of electricity as monumental leaps. Today, a new revolution is taking place—bridging the divide between human creativity and machine computation. That is Generative AI.

Generative models have blurred the line between humans and machines. With the advent of models like GPT-4, which employs transformer modules, we have stepped closer to natural and context-rich language generation. These advances have fueled applications in document creation, chatbot dialogue systems, and even synthetic music composition.

Recent Big-Tech decisions underscore its significance. Microsoft is already discontinuing its Cortana app this month to prioritize newer Generative AI innovations, like Bing Chat. Apple has also dedicated a significant portion of its $22.6 billion R&D budget to generative AI, as indicated by CEO Tim Cook.

A New Era of Models: Generative Vs. Discriminative

The story of Generative AI is not only about its applications but fundamentally about its inner workings. In the artificial intelligence ecosystem, two models exist: discriminative and generative.

Discriminative models are what most people encounter in daily life. These algorithms take input data, such as a text or an image, and pair it with a target output, like a word translation or medical diagnosis. They're about mapping and prediction.

Generative models, on the other hand, are creators. They don't just interpret or predict; they generate new, complex outputs from vectors of numbers that often aren't even related to real-world values.


Generative AI Types: Text to Text, Text to Image (GPT, DALL-E, Midjourney)

The Technologies Behind Generative Models

Generative models owe their existence to deep neural networks, sophisticated structures designed to mimic the human brain's functionality. By capturing and processing multifaceted variations in data, these networks serve as the backbone of numerous generative models.

How do these generative models come to life? Usually, they are built with deep neural networks, optimized to capture the multifaceted variations in data. A prime example is the Generative Adversarial Network (GAN), where two neural networks, the generator, and the discriminator, compete and learn from each other in a unique teacher-student relationship. From paintings to style transfer, from music composition to game-playing, these models are evolving and expanding in ways previously unimaginable.

This doesn't stop with GANs. Variational Autoencoders (VAEs), are another pivotal player in the generative model field. VAEs stand out for their ability to create photorealistic images from seemingly random numbers. How? Processing these numbers through a latent vector gives birth to art that mirrors the complexities of human aesthetics.

Generative AI Types: Text to Text, Text to Image

Transformers & LLM

The paper “Attention Is All You Need” by Google Brain marked a shift in the way we think about text modeling. Instead of complex and sequential architectures like Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs), the Transformer model introduced the concept of attention, which essentially meant focusing on different parts of the input text depending on the context. One of the main benefits of this was the ease of parallelization. Unlike RNNs which process text sequentially, making them harder to scale, Transformers can process parts of the text simultaneously, making training faster and more efficient on large datasets.

Transformer-model architecture

In a long text, not every word or sentence you read has the same importance. Some parts demand more attention based on the context. This ability to shift our focus based on relevance is what the attention mechanism mimics.

To understand this, think of a sentence: “Unite AI Publish AI and Robotics news.” Now, predicting the next word requires an understanding of what matters most in the previous context. The term ‘Robotics' might suggest the next word could be related to a specific advancement or event in the robotics field, while ‘Publish' might indicate the following context might delve into a recent publication or article.

Self-Attention Mechanism explanation on a demmo sentence
Self-Attention Illustration

Attention mechanisms in Transformers are designed to achieve this selective focus. They gauge the importance of different parts of the input text and decide where to “look” when generating a response. This is a departure from older architectures like RNNs that tried to cram the essence of all input text into a single ‘state' or ‘memory'.

The workings of attention can be likened to a key-value retrieval system. In trying to predict the next word in a sentence, each preceding word offers a ‘key' suggesting its potential relevance, and based on how well these keys match the current context (or query), they contribute a ‘value' or weight to the prediction.

These advanced AI deep learning models have seamlessly integrated into various applications, from Google's search engine enhancements with BERT to GitHub’s Copilot, which harnesses the capability of Large Language Models (LLMs) to convert simple code snippets into fully functional source codes.

Large Language Models (LLMs) like GPT-4, Bard, and LLaMA, are colossal constructs designed to decipher and generate human language, code, and more. Their immense size, ranging from billions to trillions of parameters, is one of the defining features. These LLMs are fed with copious amounts of text data, enabling them to grasp the intricacies of human language. A striking characteristic of these models is their aptitude for “few-shot” learning. Unlike conventional models which need vast amounts of specific training data, LLMs can generalize from a very limited number of examples (or “shots”)

State of Large Language Models (LLMs) as of post-mid 2023

Model Name Developer Parameters Availability and Access Notable Features & Remarks
GPT-4 OpenAI 1.5 Trillion Not Open Source, API Access Only Impressive performance on a variety of tasks can process images and text, maximum input length  32,768 tokens
GPT-3 OpenAI 175 billion Not Open Source, API Access Only Demonstrated few-shot and zero-shot learning capabilities. Performs text completion in natural language.
BLOOM BigScience 176 billion Downloadable Model, Hosted API Available Multilingual LLM developed by global collaboration. Supports 13 programming languages.
LaMDA Google 173 billion Not Open Source, No API or Download Trained on dialogue could learn to talk about virtually anything
MT-NLG Nvidia/Microsoft 530 billion API Access by application Utilizes transformer-based Megatron architecture for various NLP tasks.
LLaMA Meta AI 7B to 65B) Downloadable by application Intended to democratize AI by offering access to those in research, government, and academia.

How Are LLMs Used?

LLMs can be used in multiple ways, including:

  1. Direct Utilization: Simply using a pre-trained LLM for text generation or processing. For instance, using GPT-4 to write a blog post without any additional fine-tuning.
  2. Fine-Tuning: Adapting a pre-trained LLM for a specific task, a method known as transfer learning. An example would be customizing T5 to generate summaries for documents in a specific industry.
  3. Information Retrieval: Using LLMs, such as BERT or GPT, as part of larger architectures to develop systems that can fetch and categorize information.
Generative AI ChatGPT Fine Tuning
ChatGPT Fine Tuning Architecture

Multi-head Attention: Why One When You Can Have Many?

However, relying on a single attention mechanism can be limiting. Different words or sequences in a text can have varied types of relevance or associations. This is where multi-head attention comes in. Instead of one set of attention weights, multi-head attention employs multiple sets, allowing the model to capture a richer variety of relationships in the input text. Each attention “head” can focus on different parts or aspects of the input, and their combined knowledge is used for the final prediction.

ChatGPT: The most Popular Generative AI Tool

Starting with GPT's inception in 2018, the model was essentially built on the foundation of 12 layers, 12 attention heads, and 120 million parameters, primarily trained on a dataset called BookCorpus. This was an impressive start, offering a glimpse into the future of language models.

GPT-2, unveiled in 2019, boasted a four-fold increase in layers and attention heads. Significantly, its parameter count skyrocketed to 1.5 billion. This enhanced version derived its training from WebText, a dataset enriched with 40GB of text from various Reddit links.

GPT-3, launched in May 2020 had 96 layers, 96 attention heads, and a massive parameter count of 175 billion. What set GPT-3 apart was its diverse training data, encompassing CommonCrawl, WebText, English Wikipedia, book corpora, and other sources, combining for a total of 570 GB.

The intricacies of ChatGPT's workings remain a closely-guarded secret. However, a process termed ‘reinforcement learning from human feedback' (RLHF) is known to be pivotal. Originating from an earlier ChatGPT project, this technique was instrumental in honing the GPT-3.5 model to be more aligned with written instructions.

ChatGPT's training comprises a three-tiered approach:

  1. Supervised fine-tuning: Involves curating human-written conversational inputs and outputs to refine the underlying GPT-3.5 model.
  2. Reward modeling: Humans rank various model outputs based on quality, helping train a reward model that scores each output considering the conversation's context.
  3. Reinforcement learning: The conversational context serves as a backdrop where the underlying model proposes a response. This response is assessed by the reward model, and the process is optimized using an algorithm named proximal policy optimization (PPO).

For those just dipping their toes into ChatGPT, a comprehensive starting guide can be found here. If you're looking to delve deeper into prompt engineering with ChatGPT, we also have an advanced guide that light on the latest and State of the Art prompt techniques, available at ‘ChatGPT & Advanced Prompt Engineering: Driving the AI Evolution‘.

Diffusion & Multimodal Models

While models like VAEs and GANs generate their outputs through a single pass, hence locked into whatever they produce, diffusion models have introduced the concept of ‘iterative refinement‘. Through this method, they circle back, refining mistakes from previous steps, and gradually producing a more polished result.

Central to diffusion models is the art of “corruption” and “refinement”. In their training phase, a typical image is progressively corrupted by adding varying levels of noise. This noisy version is then fed to the model, which attempts to ‘denoise' or ‘de-corrupt' it. Through multiple rounds of this, the model becomes adept at restoration, understanding both subtle and significant aberrations.

Generative AI - Midjourney Prompt
Image Generated from Midjourney

The process of generating new images post-training is intriguing. Starting with a completely randomized input, it's continuously refined using the model's predictions. The intent is to attain a pristine image with the minimum number of steps. Controlling the level of corruption is done through a “noise schedule”, a mechanism that governs how much noise is applied at different stages. A scheduler, as seen in libraries like “diffusers“, dictates the nature of these noisy renditions based on established algorithms.

An essential architectural backbone for many diffusion models is the UNet—a convolutional neural network tailored for tasks requiring outputs mirroring the spatial dimension of inputs. It's a blend of downsampling and upsampling layers, intricately connected to retain high-resolution data, pivotal for image-related outputs.

Delving deeper into the realm of generative models, OpenAI's DALL-E 2 emerges as a shining example of the fusion of textual and visual AI capabilities. It employs a three-tiered structure:

DALL-E 2 showcases a three-fold architecture:

  1. Text Encoder: It transforms the text prompt into a conceptual embedding within a latent space. This model doesn't start from ground zero. It leans on OpenAI's Contrastive Language–Image Pre-training (CLIP) dataset as its foundation. CLIP serves as a bridge between visual and textual data by learning visual concepts using natural language. Through a mechanism known as contrastive learning, it identifies and matches images with their corresponding textual descriptions.
  2. The Prior: The text embedding derived from the encoder is then converted into an image embedding. DALL-E 2 tested both autoregressive and diffusion methods for this task, with the latter showcasing superior results. Autoregressive models, as seen in Transformers and PixelCNN, generate outputs in sequences. On the other hand, diffusion models, like the one used in DALL-E 2, transform random noise into predicted image embeddings with the help of text embeddings.
  3. The Decoder: The climax of the process, this part generates the final visual output based on the text prompt and the image embedding from the prior phase. DALL.E 2's decoder owes its architecture to another model, GLIDE, which can also produce realistic images from textual cues.
Architecture of DALL-E model (diffusion multi model)
Simplified Architecture of DALL-E Model

Python users interested in Langchain should check out our detailed tutorial covering everything from the fundamentals to advanced techniques.

Applications of Generative AI

Textual Domains

Beginning with text, Generative AI has been fundamentally altered by chatbots like ChatGPT. Relying heavily on Natural Language Processing (NLP) and large language models (LLMs), these entities are empowered to perform tasks ranging from code generation and language translation to summarization and sentiment analysis. ChatGPT, for instance, has seen widespread adoption, becoming a staple for millions. This is further augmented by conversational AI platforms, grounded in LLMs like GPT-4, PaLM, and BLOOM, that effortlessly produce text, assist in programming, and even offer mathematical reasoning.

From a commercial perspective, these models are becoming invaluable. Businesses employ them for a myriad of operations, including risk management, inventory optimization, and forecasting demands. Some notable examples include Bing AI, Google's BARD, and ChatGPT API.


The world of images has seen dramatic transformations with Generative AI, particularly since DALL-E 2's introduction in 2022. This technology, which can generate images from textual prompts, has both artistic and professional implications. For instance, midjourney has leveraged this tech to produce impressively realistic images. This recent post demystifies Midjourney in a detailed guide, elucidating both the platform and its prompt engineering intricacies. Furthermore, platforms like Alpaca AI and Photoroom AI utilize Generative AI for advanced image editing functionalities such as background removal, object deletion, and even face restoration.

Video Production

Video production, while still in its nascent stage in the realm of Generative AI, is showcasing promising advancements. Platforms like Imagen Video, Meta Make A Video, and Runway Gen-2 are pushing the boundaries of what's possible, even if truly realistic outputs are still on the horizon. These models offer substantial utility for creating digital human videos, with applications like Synthesia and SuperCreator leading the charge. Notably, Tavus AI offers a unique selling proposition by personalizing videos for individual audience members, a boon for businesses.

Code Creation

Coding, an indispensable aspect of our digital world, hasn’t remained untouched by Generative AI. Although ChatGPT is a favored tool, several other AI applications have been developed for coding purposes. These platforms, such as GitHub Copilot, Alphacode, and CodeComplete, serve as coding assistants and can even produce code from text prompts. What's intriguing is the adaptability of these tools. Codex, the driving force behind GitHub Copilot, can be tailored to an individual's coding style, underscoring the personalization potential of Generative AI.


Blending human creativity with machine computation, it has evolved into an invaluable tool, with platforms like ChatGPT and DALL-E 2 pushing the boundaries of what's conceivable. From crafting textual content to sculpting visual masterpieces, their applications are vast and varied.

As with any technology, ethical implications are paramount. While Generative AI promises boundless creativity, it's crucial to employ it responsibly, being aware of potential biases and the power of data manipulation.

With tools like ChatGPT becoming more accessible, now is the perfect time to test the waters and experiment. Whether you're an artist, coder, or tech enthusiast, the realm of Generative AI is rife with possibilities waiting to be explored. The revolution is not on the horizon; it's here and now. So, Dive in!

The post Generative AI: The Idea Behind CHATGPT, Dall-E, Midjourney and More appeared first on Unite.AI.

文 » A