0% found this document useful (0 votes)
7 views

Lecture 5 Diffusion - Models Part II Final

Uploaded by

huukhoadn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Lecture 5 Diffusion - Models Part II Final

Uploaded by

huukhoadn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

CAP6412

Advanced Computer Vision


Mubarak Shah
[email protected]
HEC-245
Lecture-5: Diffusion Models-Part-II

1/25/2023 CAP6412 - Lecture 1 Introduction 1


Diffusion models in vision: A survey
https://arxiv.org/pdf/2209.04747.pdf

Alin Croitoru Vlad Hondru Radu Tudor Ionescu Mubarak Shah


University of Bucharest, University of Bucharest, University of Bucharest, University of Central
Romania Romania Romania Florida, US
[email protected] [email protected] [email protected] [email protected]
High-level overview
• Diffusion models are probabilistic models used for image generation
• They involve reversing the process of gradually degrading the data
• Consist of two processes:
 The forward process: data is progressively destroyed by adding noise across
multiple time steps
 The reverse process: using a neural network, noise is sequentially removed
to obtain the original data

Standard Gaussian
Data distribution

reverse

forward
High-level overview

• Three categories:

 Denoising Diffusion Probabilistic Models (DDPM)

 Noise Conditioned Score Networks (NCSN)

 Stochastic Differential Equations (SDE)


Denoising Diffusion Probabilistic Models (DDPMs)

Forward process
𝑥 𝑥

… …

𝑥 ~𝑝(𝑥 ) 𝑥 ~𝒩(0, 𝐼)
Denoising Diffusion Probabilistic Models (DDPMs)

𝑥 𝑥

… …

𝑥 ~𝑝(𝑥 ) Reverse process 𝑥 ~𝒩(0, 𝐼)


Denoising Diffusion Probabilistic Models (DDPMs)

Forward process (Iterative) The image is


replaced with
noise
𝑥 ~𝑝 𝑥 𝑥 = 𝒩(𝑥 ; 1 − 𝛽 𝑥 , 𝛽 I) 𝛽 ≪ 1 , 𝑡 = 1, 𝑇

… …
𝑥 𝑥 𝑥 𝑥
Denoising Diffusion Probabilistic Models (DDPMs)

Forward process. Ancestral sampling (One Shot) Notations:

𝛽 = 𝛼
𝑥 ~𝑝 𝑥 𝑥 = 𝒩(𝑥 ; 𝛽 ⋅ 𝑥 , 1 − 𝛽 I) 𝛼 =1 − 𝛽

… …
𝑥 𝑥 𝑥 𝑥
DDPMs. Training objective
Remember that:

𝑥 𝑥 𝑥 𝑥

… …
𝑝 𝑥 𝑥 ≈𝑝 𝑥 𝑥 = 𝒩(𝑥 ;𝜇 𝑥 ,𝑡 ,Σ 𝑥 ,𝑡 )
Reverse process

Neural network Approximated by


weights a neural network
DDPMs. Training objective
Simplification:

𝑥 𝑥 𝑥 𝑥

… …
𝑝 𝑥 𝑥 ≈𝑝 𝑥 𝑥 = 𝒩(𝑥 ; 𝜇 𝑥 , 𝑡 , 𝜎 I)
Reverse process

Neural network Approximated by


Fix the variance instead of learning, and predict/learn the mean weights a neural network
DDPMs. Training objective
UNet-like neural network

𝜇 (𝑥 , 𝑡)

~𝒩 𝑥 , 𝜇 (𝑥 , 𝑡), 𝜎 I

𝑥
DDPMs. Training Algorithm

1
min 𝔼 ~ , ~𝒩 , 𝑧 − 𝑧 (𝑥 , 𝑡)
𝑇

Training algorithm:

Repeat 𝛽 = 𝛼
𝑥 ~𝑝 𝑥
𝑡~𝒰 1, … , 𝑇
𝑧 ~𝒩(0, I)
𝑥 = 𝛽 ⋅𝑥 + 1−𝛽 𝑧
𝜃 = 𝜃 − 𝑙𝑟 ⋅ ∇ ℒ
Until convergence
DDPMs. Training Algorithm

1
min 𝔼 ~ , ~𝒩 , 𝑧 − 𝑧 (𝑥 , 𝑡)
𝑇

ℒ 𝛽 = 𝛼

Training algorithm:

Repeat
𝑥 ~𝑝 𝑥 %We sample an image from our data set
𝑡~𝒰 1, … , 𝑇 %choose randomly a time step t of the forward process
𝑧 ~𝒩(0, I) %sample the noise z_t
𝑥 = 𝛽 ⋅𝑥 + 1−𝛽 𝑧 % Get noisy image
𝜃 = 𝜃 − 𝑙𝑟 ⋅ ∇ ℒ %Update neural network weights
Until convergence
DDPMs. Sampling

𝑥
𝑧 (𝑥 , 𝑡)

• Pass the current noisy image along with t to the neural network

• With the resultant compute the mean of the gaussian distribution


DDPMs. Sampling

𝑥
𝑧 (𝑥 , 𝑡)

Sample the image for the next iteration


𝜇 (𝑥 , 𝑡)

1 1 − 𝛼
~𝒩 𝑥 , 𝑥 − 𝑧 𝑥 ,𝑡 ,𝜎 I
𝛼
1−𝛽

𝑥
Outline

1. Motivation
2. High-level overview
3. Denoising diffusion probabilistic models
4. Noise Conditioned Score Network
5. Stochastic Differential Equations
6. Conditional Generation
7. Research directions
Score Function

• Direction we need to change the input x such that the


density becomes greater

• Mixture of two Gaussians in 2D

© Copyright 2022 Yang Song. Powered by Jekyll with al-folio theme.


Score Function
• Second formulation of Diffusion Model

• Langevin dynamics method


• Starts from a random sample

• Apply iterative updates with the score function to modify the sample

• Result will have a higher chance of being a sample of the true distribution p(x)
Naïve score-based model

• Score: gradient of the logarithm of the probability density with respect to the input

• Annealed Langevin dynamics

𝛾
𝑥 =𝑥 + ∇ log 𝑝 𝑥 + 𝛾⋅𝜔
2

Step size – controls the magnitude of the update in the direction of the score

Score – estimated by the score network

Noise – random gaussian noise N(0, I)


Naïve score-based model

• The score is approximated with a neural network

• Score network is trained using score matching


𝔼 ~ ( ) 𝑠 𝑥 − ∇ log 𝑝(𝑥)

• Denoising score matching:


 Add small noise to each sample of the data:

𝑥 ~ 𝒩 𝑥 , 𝑥, 𝜎 ⋅ 𝐼 = 𝑝 (𝑥 )
 Objective
~ ( )

 After training:
Naïve score-based model. Problems

• Manifold hypothesis: real data resides on low dimensional manifolds

• The score is undefined outside these low dimensional manifolds

• Data being concentrated in regions results in further issues:


Incorrectly estimating the score within the low-density regions

Langevin dynamics never converging to the high-density region

Credit images Yang Song: https://yangsong.net/blog/2021/score/


Naïve score-based model. Problems

Credit images Yang Song: https://yangsong.net/blog/2021/score/


Noise Conditioned Score Network (NCSNs)
• Solution:
Perturb the data with random Gaussian noise at different scales

Learn score estimations noisy distributions via a single score network

Credit images Yang Song: https://yang-song.net/blog/2021/score/


Noise Conditioned Score Network (NCSNs)
• Given a sequence of Gaussian noise scales σ1 < σ2 < · · · < σT such that:
o 𝑝 𝑥 ≈ 𝑝(𝑥 )
o Approximating the true data distribution

o 𝑝 𝑥 ≈ 𝒩(0, 𝐼)
o Almost equally with the standard gaussian distribution.

• And the forward process i.e. noise perturbation given by:


1 −1 𝑥 −𝑥
𝑝 𝑥 | 𝑥 = 𝒩 𝑥 ; 𝑥, 𝜎 ⋅ 𝐼 = ⋅ exp ⋅
𝜎 ⋅ 2𝜋 2 𝜎
• The gradient can be written as:
𝑥 −𝑥
∇ log 𝑝 𝑥 |𝑥 =−
𝜎
Noise Conditioned Score Network (NCSNs)

• Training the NCSN with denoising score matching, the following objective is minimized:

1 𝑥 −𝑥
ℒ = 𝔼 ( )𝔼 ( | ) 𝑠 𝑥 ,𝜎 +
𝑇 𝜎
Noise Conditioned Score Network (NCSNs)

• Training the NCSN with denoising score matching, the following objective is minimized:

1 𝑥 −𝑥
ℒ = 𝜆 𝜎 𝔼 ( )𝔼 ( | ) 𝑠 𝑥 ,𝜎 +
𝑇 𝜎

Weighting function
Noise Conditioned Score Network (NCSNs). Sampling
Annealed Langevin dynamics
Parameters:
– number of iterations for Langevin dynamics
…< - noise scales
- update magnitude
Algorithm:

for t do:
for do:

return
Noise Conditioned Score Network (NCSNs). Sampling
Annealed Langevin dynamics
Parameters:
– number of iterations for Langevin dynamics
…< - noise scales
- update magnitude
Algorithm:

; %sample some standard gaussian noise


for t do: %start from the largest noise scale, which is denoted by the time step
for do: %for N iterations execute the Langevin dynamics updates
% get noise
% update
% next iteration
return
DDPM vs NCSN. Losses

DDPM: ℒ = ∑ 𝔼 ~ , ~𝒩 , 𝑧 𝑥 ,𝑡 − 𝑧

NCSN: ℒ = ∑ 𝜆(𝜎 )𝔼 ~ , ~ ( | ) 𝑠 𝑥 ,𝜎 +

• In , the weighting function is missing because better sample quality when is set to 1.

We can rewrite the noise , as follows:

• So, learns to approximate a scaled negative noise .


DDPM vs NCSN. Sampling
DDPM: 𝑥 = 𝑥 − 𝑧 𝑥 ,𝑡 + 𝛽 ⋅𝑧

• Iterative updates are based on subtracting some form of noise from the noisy image.

NCSN: 𝑥 = 𝑥 + ⋅𝑠 𝑥 ,𝜎 + 𝛾 ⋅𝑧

• This is true also for NCSN because 𝑠 𝑥 , 𝜎 approximates the negative of the noise.

1 𝑥 −𝑥
ℒ = 𝔼 ( )𝔼 ( | ) 𝑠 𝑥 ,𝜎 + 1
𝑇 𝜎 min 𝔼 ~ , ~𝒩 , 𝑧 − 𝑧 (𝑥 , 𝑡)
𝑇

1 𝑥 −𝑥
ℒ = 𝔼 ( )𝔼 ( | ) 𝑠 𝑥 , 𝜎 − (− )
𝑇 𝜎 ℒ
𝑥 −𝑥
𝑧=
• Therefore, the generative processes defined by NCSN and DDPM are very similar. 𝜎
Outline

1. Motivation
2. High-level overview
3. Denoising diffusion probabilistic models
4. Noise Conditioned Score Network
5. Stochastic Differential Equations
6. Conditional Generation
7. Research directions
Stochastic Differential Equations (SDEs)

• A generalized framework that can be applied over the previous two methods

• However, the diffusion process is continuous, given by an SDE

• Works by the same principle:


 Gradually transforms the data distribution p(x0) into noise

 Reverse the process to obtain the original data distribution


Stochastic Differential Equations (SDEs)

• The forward diffusion process is represented by the following SDE:

𝜕𝑥 Notation for:
= 𝑓 𝑥, 𝑡 + 𝜎 𝑡 𝜔 ⟺ 𝜕𝑥 = 𝑓 𝑥, 𝑡 𝜕𝑡 + 𝜎(𝑡) ⋅ 𝜕𝜔
𝜕𝑡 𝒩(0, 𝜕𝑡)
Function for drift coefficient: gradually White Gaussian
nullifies the data x0 noise
Function for diffusion
coefficient: controls how much
Gaussian noise is added
Stochastic Differential Equations (SDEs)

• The reverse-time SDE is defined as:


𝜕𝑥 = [𝑓 𝑥, 𝑡 − 𝜎 𝑡 ⋅ 𝛻 𝑙𝑜𝑔 𝑝 𝑥 ]𝜕𝑡 + 𝜎(𝑡) ⋅ 𝜕𝜔

• The training objective is similar to NCSN, but adapted for continuous time:

ℒ∗ =𝔼 𝜆 𝑡 𝔼 ( )𝔼 ( | ) 𝑠 𝑥 , 𝑡 + 𝛻 𝑙𝑜𝑔 𝑝 𝑥 𝑥

• The score function is used in the reverse-time SDE:


 It employs a neural network to estimate the score function.
 Then uses a numerical SDE solver to generate samples.
Stochastic Differential Equations (SDEs). NCSN

• The process of NCSN:


𝑥 ~ 𝒩 𝑥 ;𝑥 , (𝜎 − 𝜎 )⋅𝐼 ⇒ 𝑥 = 𝑥 + (𝜎 − 𝜎 )⋅𝑧

• We can reformulate the above expression to look like a discretization of an SDE:


(𝜎 − 𝜎 )
𝑥 − 𝑥 = ⋅𝑧
𝑡 − (𝑡 − 1)

• Translating the above discretization in the continuous case:


𝜕𝜎 𝑡
𝜕𝑥 = 𝜕𝜔(𝑡)
𝜕𝑡

𝜕𝑥 = 𝑓 𝑥, 𝑡 𝜕𝑡 + 𝜎(𝑡) ⋅ 𝜕𝜔
Stochastic Differential Equations (SDEs). DDPM

• The process of DDPM:


𝑥 ~𝒩 𝑥 ; 1−𝛽 𝑥 ,𝛽 I ⇒ 𝑥 = 1−𝛽 𝑥 + 𝛽 ⋅𝑧

• If we consider time step size ∆𝑡 = , instead of 1, and 𝛽 𝑡 ∆𝑡 = 𝛽 :


𝑥 = 1 − 𝛽(𝑡)∆𝑡 𝑥 ∆ + 𝛽(𝑡)∆𝑡 ⋅ 𝑧
• Using Taylor expansion of 1 − 𝛽(𝑡)∆𝑡 :


𝑥 ≈(1− ) 𝑥 ∆ + 𝛽 𝑡 ∆𝑡 ⋅ 𝑧

𝛽 𝑡 ∆𝑡 𝛽 𝑡 ∆𝑡
𝑥 ≈𝑥 ∆ − 𝑥 ∆ + 𝛽 𝑡 ∆𝑡 ⋅ 𝑧 ⟺ 𝑥 − 𝑥 ∆ = − 𝑥 + 𝛽 𝑡 ∆𝑡 ⋅ 𝑧
2 2

• For the continuous case, the above becomes:


𝜕𝑥 = − 𝛽 𝑡 𝑥 𝜕𝑡 + 𝛽(𝑡)𝜕𝜔(𝑡)
Outline

1. Motivation
2. High-level overview
3. Denoising diffusion probabilistic models
4. Noise Conditioned Score Network
5. Stochastic Differential Equations
6. Conditional Generation
7. Research directions
Conditional generation.
Diffusion models estimate the score function, 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 to sample from a distribution 𝒑 𝒙 .
Sampling from 𝐩 𝒙 𝒚 requires the score function of this probability density, 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 ; y is condition.

Solution 1. Conditional training: train the model with an additional input 𝑦 to estimate 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 .

𝑠 𝑥 , 𝑡, 𝑦 ≈
𝛻 𝑙𝑜𝑔 𝑝 𝑥 𝑦

𝑦
Conditional generation. Classifier Guidance
Diffusion models estimate the score function, 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 to sample from a distribution 𝒑 𝒙 .
Sampling from 𝐩 𝒙 𝒚 requires the score function of this probability density, 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 .

Solution 2. Classifier guidance:

Bayes rule:
𝑝 𝑦 𝑥 ⋅ 𝑝 (𝑥 )
𝑝 𝑥 𝑦 = ⟺
𝑝 (𝑦)
Logarithm:
log 𝑝 𝑥 𝑦 = log 𝑝 𝑦 𝑥 + log 𝑝 𝑥 − log 𝑝 𝑦 ⟺
Gradient:
𝛻 log 𝑝 𝑥 𝑦 = 𝛻 log 𝑝 𝑦 𝑥 + 𝛻 log 𝑝 (𝑥 ) − 𝛻 log 𝑝 (𝑦) ⟺

𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 = 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒚 𝒙𝒕 + 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 (𝒙𝒕 )

Unconditional diffusion model


Classifier
Conditional generation. Classifier Guidance
Solution 2. Classifier guidance:
Guidance weight

𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 = 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒚 𝒙𝒕 + 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 (𝒙𝒕 )

𝑠=1 𝑠 = 10
Problem
• Need to have good gradients estimates at each step of denoising process

• Need a classifier that is robust to noise added in the image.

• Training of the classifier on noisy data, which can be problematic..


Conditional generation. Classifier-free Guidance
Solution 3. Classifier-free guidance

𝛻 𝑙𝑜𝑔 𝑝 𝑥 𝑦 = 𝑠 ⋅ 𝛻 𝑙𝑜𝑔 𝑝 𝑦 𝑥 + 𝛻 𝑙𝑜𝑔 𝑝 (𝑥 )

Bayes rule:
𝑝 𝑥 𝑦) ⋅ 𝑝 (𝑦)
𝑝 𝑦𝑥 =
𝑝 (𝑥 )
Logarithm:
log 𝑝 𝑦 𝑥 = log 𝑝 𝑥 𝑦 − log 𝑝 𝑥 + log 𝑝 (𝑦)

Gradient
𝛻 𝑙𝑜𝑔 𝑝 𝑦 𝑥 = 𝛻 𝑙𝑜𝑔 𝑝 𝑥 𝑦 − 𝛻 log 𝑝 (𝑥 )

from above 𝛻 𝑙𝑜𝑔 𝑝 𝑥 𝑦 = 𝑠 ⋅ 𝛻 𝑙𝑜𝑔 𝑝 𝑦 𝑥 + 𝛻 𝑙𝑜𝑔 𝑝 (𝑥 )

𝛻 𝑙𝑜𝑔 𝑝 𝑥 𝑦 = 𝑠 ⋅ (𝛻 𝑙𝑜𝑔 𝑝 𝑥 𝑦 − 𝛻 log 𝑝 (𝑥 )) + 𝛻 𝑙𝑜𝑔 𝑝 𝑥

𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 = 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 + 𝟏 − 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕

Learned by a single model


Conditional generation. Classifier-free Guidance

𝑠 𝑥 , 𝑡, 𝑦
≈𝑝 𝑥 𝑦

𝑦
Conditional generation. Classifier-free Guidance

𝑠 𝑥 , 𝑡, 𝑦/0 ≈
𝑝 𝑥 𝑦 /𝑝 (𝑥 |0)

𝑦/0
CLIP guidance
What is a CLIP model?

• Trained by contrastive cross-entropy loss:

• The optimal value of is


12
1

Slide from: Denoising Diffusion-based Generative Modeling: Foundations and Applications


Karsten Kreis Ruiqi Gao Arash Vahdat

Radford et al., “Learning Transferable Visual Models From Natural Language Supervision”, 2021.
Nichol et al., “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models”, 2021.
CLIP guidance
Replace the classifier in classifier guidance with a CLIP model

𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 = 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 + 𝟏 − 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕

𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒄 = 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒄 + 𝟏 − 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕

𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒄 = 𝒔 ⋅ 𝜵𝒙𝒕 (𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒄 − 𝒍𝒐𝒈 𝒑(𝒄)) + 𝟏 − 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕


CLIP model 12
2

𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒄 = 𝒔 ⋅ 𝜵𝒙𝒕 (𝒇 𝒙 . 𝒈(𝒄)) + 𝟏 − 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕

Slide from: Denoising Diffusion-based Generative Modeling: Foundations and Applications


Karsten Kreis Ruiqi Gao Arash Vahdat

Radford et al., “Learning Transferable Visual Models From Natural Language Supervision”, 2021.
Nichol et al., “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models”, 2021.
Outline

1. Motivation
2. High-level overview
3. Denoising diffusion probabilistic models
4. Noise Conditioned Score Network
5. Conditional Generation
6. Stochastic Differential Equations
7. Research directions
Research directions
Unconditional image generation:
• Sampling efficiency
• Image quality
Conditional image generation:
• Text-to-image generation
Complex tasks in computer vision:
• Image editing, even based on text
• Super-resolution
• Image segmentation
• Anomaly detection in medical images
• Video generation
Thank you !
Survey: Github:
https://arxiv.org/abs/2209.04747 https://github.com/CroitoruAlin/Diffusion-
Models-in-Vision-A-Survey

You might also like