Mastering Generative AI with Diffusion Models_ NVIDIA’s Cutting-Edge Course
Mastering Generative AI with Diffusion Models_ NVIDIA’s Cutting-Edge Course
● Generative AI in 70’s,80’s,90’s :
● DeepDreaming :
Concept of DeepDream
DeepDream essentially turns the network’s focus outward. Here's how the
process unfolds:
1. Input an Image: A base image is fed into a pre-trained neural network.
Popular models like Inception are often used.
2. Choose a Layer: A specific layer of the network is selected. Each layer
corresponds to different levels of abstraction:
Visual Effects
● Art: DeepDream creates unique, artistic visuals and has inspired various
creative projects.
● Visualization: It helps researchers and engineers understand what
features a neural network has learned.
● Entertainment: Surreal images generated by DeepDream are often shared
for their aesthetic and intriguing qualities.
=> Run a model like ImageNet in reverse to exaggerate the features the
model uses to classify an image
Key Concepts:
Adversarial Game:
Training Process:
1. The Generator generates fake data (e.g., images) from random noise.
2. The Discriminator evaluates both real data and fake data, classifying
them as real or fake.
3. The Generator is penalized if the Discriminator correctly classifies its
output as fake.
4. Both the Generator and Discriminator are updated using
backpropagation based on their respective losses:
○ The Generator tries to minimize its loss by producing more
realistic data.
○ The Discriminator tries to minimize its loss by correctly
classifying real and fake data.
The ultimate goal is for the Generator to create data that is so realistic that the
Discriminator can no longer reliably distinguish between real and fake data. At
this point, the Generator has successfully learned to generate high-quality data.
This adversarial process drives the improvement of both networks, with the
Generator producing more realistic outputs and the Discriminator refining its
ability to tell real from fake.
○ The architecture is shaped like a "U," with the encoder forming the
left half and the decoder forming the right half, connected by a
bottleneck in the middle.
4. Fully Convolutional:
○ Unlike classification networks that produce a single output, U-Net
is fully convolutional and outputs a segmentation map of the same
size as the input image.
Workflow of U-Net:
Applications:
Advantages:
● Efficient Use of Data: U-Net works well with limited training data by
leveraging data augmentation and skip connections.
● Precise Localization: Skip connections help preserve spatial context and
fine details in segmentation tasks.
● Versatility: It is adaptable to 2D and 3D image segmentation tasks.
Challenges:
Overall, U-Net has become a cornerstone model for image segmentation tasks
due to its elegant design and robust performance.
○ Encoder (Discriminator):
■ Captures hierarchical features by downsampling the input.
○ Bottleneck:
■ A compact representation of the input, capturing its abstract
features.
○ Decoder (Generator):
■ Upsamples features and reconstructs the image, applying
skip connections to merge details from the encoder.
● Pix2Pix:
○ A famous image-to-image translation framework where a U-Net
generator maps an input image (e.g., grayscale or edge maps) to an
output image (e.g., color or filled regions).
○ The discriminator evaluates whether the generated output matches
the input conditions.
● Super-Resolution:
Applications:
1. Dimensionality:
1. Autoencoders:
Skip connections are a neural network design technique where features from
earlier layers (encoder) are directly copied and concatenated or added to
corresponding layers in the later part of the network (decoder). These
connections "skip" one or more intermediate layers, allowing the network to pass
detailed spatial or low-level information directly to the deeper layers.
In the context of U-Nets, skip connections link the encoder and decoder
layers at the same spatial resolution to preserve important features lost
during downsampling. This design helps the model retain both low-level details
and high-level context.
How Skip Connections Work in U-Net
○ At each level, the feature maps from the encoder are copied and
passed to the corresponding decoder layer (at the same spatial
resolution).
○ These encoder features are concatenated or added to the
upsampled decoder features before further processing.
○ This allows the decoder to reuse spatial details from the encoder,
improving reconstruction.
1. Encoder:
Let:
1. Concatenation:
Fout=Concat(Fenc,Fdec)
This is common in U-Net, where features are stacked channel-wise.
2. Addition:
Fout=Fenc+Fdec
This approach is seen in residual networks (ResNets).
1. ResNets:
○ Skip connections are used to bypass entire blocks, improving
gradient flow and convergence.
2. DenseNets:
○ Features from all earlier layers are connected to every subsequent
layer for efficient feature reuse.
Summary
- Transposed Convolution :
-Add 2 rows of zeros in between each image row and add 2 columns of zeros in
between each image column
● Diffusion Models
Sure! Here's a summary of the entire explanation, starting from the first
question about diffusion models:
The goal of a diffusion model is to generate data (like images, audio, or text) by
simulating a process that gradually transforms random noise into structured
data. This is achieved by learning to reverse a noising process (diffusion) that
turns clean data into noise, and then gradually recovers the original data from
noise. Diffusion models are widely used in generative tasks due to their ability to
produce high-quality, realistic outputs.
● Starting with Clean Data: Begin with a clean data sample x0(e.g., an
image).
● Gradual Noise Addition: Over several steps, noise is progressively
added to the data, turning it into random noise over time.
● Mathematical Process: The forward process involves adding Gaussian
noise at each step, described by q(xt∣xt−1)q(x_t | x_{t-1}).
● Goal: The model learns to reverse the noising process, turning noisy data
xt into clean data x_{t-1} over multiple steps.
● Neural Network Model: A neural network (often U-Net) predicts the
clean data at each step based on the noisy input and timestep.
● Training: The model is trained to minimize the error between the
predicted clean data and the actual clean data.
3. Noise Schedule:
4. Architecture Details:
● U-Net Backbone: The neural network for reverse diffusion typically uses a
U-Net-style architecture, with encoder-decoder layers and skip
connections for better information flow.
● Attention Mechanisms: In some advanced models, attention layers are
added to help focus on important features during denoising.
5. Loss Function:
● The generated output from a diffusion model is not an exact replica of any
original image but a new, synthetic image that shares similar
characteristics to the training data. The output is generated by starting
with random noise and iteratively denoising it.
● The model learns to produce new data that resembles the original data
distribution but is not identical to any specific training example.
Summary:
3. Masked Convolutions: Since each pixel should only depend on the
previously generated pixels (and not the future pixels), PixelCNN uses
masked convolutions. These convolutions ensure that the network can
only "see" pixels that have already been generated when predicting the
next pixel in the sequence. The mask is applied in such a way that the
convolutional filter doesn't peek at the pixels that haven't been generated
yet.
4. Pixel Value Prediction: For each pixel, the model predicts a probability
distribution for its possible values. Typically, this could be done for each
color channel (RGB), and the model samples from the distribution for each
pixel.
Variants of PixelCNN:
Use Cases:
● Strengths:
-GANS :
-Difficult to optimize due to instable training
dynamics
-AutoRegressive Models
The model estimates a probability distribution for each pixel value (e.g., a
distribution over 256 possible intensity levels for grayscale or 256 levels
per channel for RGB).
-RGB Channels:
● For RGB images, the model treats the channels (Red, Green, Blue)
as dependent.
● It first generates the value for the Red channel, then conditions on
it to generate the Green channel, and finally conditions on both to
generate the Blue channel:
● For each pixel xi, the model produces a probability distribution over
the possible intensity levels (0–255 for each channel).
● The distribution may look something like this (for a single channel):
-This means the model picks the intensity level k with the highest
probability, effectively fixing the pixel to its most likely value.
● After the RGB values are determined, the pixel xiis fixed in the
output image. This value is now part of the context for generating
the next pixel.
-. Key Points
● For the Red (R) channel, the model outputs 256 feature maps.
Each feature map corresponds to one possible intensity value for the
pixel xi, ranging from 0 to 255
● These feature maps are processed to form a probability distribution:
P(R=k)=p(0),p(1),…,p(255), where p(k) represents the probability of
the Red channel having intensity k at the current pixel xi.
● The feature maps for the Red channel are combined and normalized
using a softmax function:
● Now moving to the next pixel :
-Diffusion Process:
* Normal Distribution :
Key Contributions
Another approach is for the model to directly predict the original clean
data x0from xt. This can be done by parameterizing the reverse process to
directly reconstruct x0:
However, this is less commonly used than noise prediction because noise
prediction often leads to more stable training and better results.
Where:
Steps:
1. Efficiency:
Architecture of LDMs
1. Autoencoder:
○ Compresses input data xx into latent space zz via
z=Encoder(x)z = \text{Encoder}(x).
○ Reconstructs xx from zz via x≈Decoder(z)x \approx
\text{Decoder}(z).
2. Latent Diffusion Model:
○ Performs the diffusion process in the latent space zz,
predicting noise or clean latent values during the reverse
process.
3. Conditioning Mechanisms:
○ Optional input conditioning (e.g., text embeddings) can be
concatenated with latent inputs or injected into the model
using cross-attention mechanisms.
1. Autoencoders:
1. Autoencoders:
Normalization :
● Why? If the numbers are too large or vary too much, training
becomes unstable.
● Normalization adjusts these numbers to be more consistent (e.g.,
closer to 0 and evenly spread).
Types of Normalization
What it does:
How it works:
1. Imagine you have a batch of 16 images (data points), and the layer
has 10 neurons.
2. For each neuron, BN computes:
○ The mean and variance of the activations across the 16
images (the batch).
○ Example: For neuron 1, it calculates the mean/variance of its
16 outputs.
● Works well with large batch sizes.
● Common in convolutional and fully connected layers.
Key Limitation:
What it does:
How it works:
What it does:
How it works:
1. Imagine the layer has 16 neurons, and you divide them into 4 groups
(each with 4 neurons).
2. For each group, GN computes:
○ The mean and variance across the activations in that group
for a single data point.
Key Limitation:
Both ReLU and GELU are activation functions used in neural networks.
They transform the output of a neuron to introduce non-linearity, allowing
the network to learn complex patterns.
Définition:
ReLU outputs the input directly if it’s positive; otherwise, it outputs zero.
Mathematically:
ReLU(x)=max(0,x)
Graph:
Advantages:
Disadvantages:
1. Dying Neurons: If x≤0, the gradient is zero, and the neuron can
become inactive (i.e., it always outputs 0).
2. Not Smooth: The function has a sharp corner at x=0 , which may
cause optimization challenges.
raph:
Advantages:
1. Smooth Transitions: Unlike ReLU, GELU does not have sharp
changes, which makes optimization smoother.
2. Adaptive Non-linearity: The behavior dynamically scales between
linear and nonlinear, depending on the input.
3. Better Performance in Some Models: Particularly effective in
transformers and attention-based architectures (e.g., BERT, GPT).
Disadvantages:
Link : https://distill.pub/2016/deconv-checkerboard/
https://ar5iv.labs.arxiv.org/html/1907.065157
-When we look very closely at images generated by neural networks, we often see a
strange checkerboard pattern of artifacts. These artifacts appear to be caused by
deconvolutions. We demonstrate that replacing deconvolution with a
"resize-convolution" causes these artifacts to disappear in a variety of contexts.
-When we have neural networks generate images, we often have them
build them up from low resolution, high-level descriptions. This allows
the network to describe the rough image and then fill in the details.
In order to do this, we need some way to go from a lower resolution image to a higher
one. We generally do this with the deconvolution operation. Roughly, deconvolution
layers allow the model to use every point in the small image to “paint” a square in the
larger one.
How It Works:
For example:
Advantages:
Disadvantages:
1. Positional Embeddings
2. Time Embeddings
3. Word Embeddings
4. Image Embeddings
5. Graph Embeddings
Encodes nodes, edges, or entire graphs into vectors for graph-based tasks.
7. Entity Embeddings
8. Spatial/Geographical Embeddings
Summary
Model Definition
Key Features
● Natural Language Supervision: It uses captions and descriptions
as an alternative to traditional labeled datasets.
● Zero-shot Learning: Eliminates the need for fine-tuning on specific
downstream tasks.
● Scalability: Demonstrates strong performance across over 30
computer vision benchmarks.
Training Methodology
Applications
Example:
Example:
4. Efficiency in Preprocessing
● CLIP embeddings align image and text data into the same feature
space.
● This alignment enables seamless interaction between modalities:
○ Text-to-Image: Find images matching a given text description.
○ Image-to-Text: Find descriptions matching a given image.
6. Flexibility in Querying
Example:
Real-World Applications
1. Zero-Shot Learning
Example:
Applications:
● Search engines: Query "a red sports car" to find matching images.
● Image captioning: Match image embeddings to textual descriptions.
3. Multimodal Understanding
Example:
6. Democratizing AI