0% found this document useful (0 votes)
50 views

Autoregressive Models

This document describes autoregressive generative models. It begins by explaining the chain rule of probability and how autoregressive models aim to learn a data distribution by modeling each conditional probability in the chain rule. It then provides an example using a binarized MNIST dataset, showing how each pixel is modeled conditioned on previous pixels in a raster scan order. The conditional distributions are modeled using parametrized functions like logistic regression, with the goal of learning the optimum parameter values from training data.

Uploaded by

Dang Khoa Ngo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

Autoregressive Models

This document describes autoregressive generative models. It begins by explaining the chain rule of probability and how autoregressive models aim to learn a data distribution by modeling each conditional probability in the chain rule. It then provides an example using a binarized MNIST dataset, showing how each pixel is modeled conditioned on previous pixels in a raster scan order. The conditional distributions are modeled using parametrized functions like logistic regression, with the goal of learning the optimum parameter values from training data.

Uploaded by

Dang Khoa Ngo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

SYS863-01

Deep Generative Modeling: Theory and Applications

Mohammadhadi (Hadi) Shateri


Email: [email protected]
Generative Models
Generative Model

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 2
Generative Models
Autoregressive (AR)

Autoregressive Generative Models:


Consider a D-dimensional random variable 𝑋 ∈ ℝ𝐷 with probability distribution 𝑝 𝑥 = 𝑝(𝑥1 , 𝑥2 , … , 𝑥𝐷 ).
Using the chain rule of probability we can say:
𝐷

𝑝 𝑥 = 𝑝(𝑥1 , 𝑥2 , … , 𝑥𝐷 ) = 𝑝 𝑥1 𝑝 𝑥2 |𝑥1 𝑝 𝑥3 |𝑥1 , 𝑥2 𝑝 𝑥4 |𝑥1 , 𝑥2 , 𝑥3 … 𝑝 𝑥𝐷 |𝑥1:𝐷−1 = ෑ 𝑝(𝑥𝑑 |𝑥1:𝑑−1 )


𝑑=1

where we define 𝑝 𝑥1 𝑥1:0 = 𝑝(𝑥1 ).

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 3
Generative Models
Autoregressive (AR)

Autoregressive Generative Models:


Example: consider binarized MNIST dataset for which 𝑋 ∈ {0,1}784 or in other words 𝑋𝑑 ∈ 0,1 ; 𝑑 = 1,2, … , 784.

𝑝 𝑥 = 𝑝(𝑥1 , 𝑥2 , … , 𝑥28 , 𝑥29 , … , 𝑥784 )


Here are two samples/realizations of random variable 𝑋:

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 4
Generative Models
Autoregressive (AR)

Autoregressive Generative Models:


Example: consider binarized MNIST dataset for which 𝑋 ∈ {0,1}784 or in other words 𝑋𝑑 ∈ 0,1 ; 𝑑 = 1,2, … , 784.

𝑝 𝑥 = 𝑝(𝑥1 , 𝑥2 , … , 𝑥28 , 𝑥29 , … , 𝑥784 )


Starting from the most
Here are two samples/realizations of random variable 𝑋: top-left pixel…

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 5
Generative Models
Autoregressive (AR)

Autoregressive Generative Models:


Example: consider binarized MNIST dataset for which 𝑋 ∈ {0,1}784 or in other words 𝑋𝑑 ∈ 0,1 ; 𝑑 = 1,2, … , 784.

𝑝 𝑥 = 𝑝(𝑥1 , 𝑥2 , … , 𝑥28 , 𝑥29 , … , 𝑥784 )


Each time we move one
Here are two samples/realizations of random variable 𝑋: pixel to the right…

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 6
Generative Models
Autoregressive (AR)

Autoregressive Generative Models:


Example: consider binarized MNIST dataset for which 𝑋 ∈ {0,1}784 or in other words 𝑋𝑑 ∈ 0,1 ; 𝑑 = 1,2, … , 784.

𝑝 𝑥 = 𝑝(𝑥1 , 𝑥2 , … , 𝑥28 , 𝑥29 , … , 𝑥784 )


Until reaching to the
Here are two samples/realizations of random variable 𝑋: end of that row…

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 7
Generative Models
Autoregressive (AR)

Autoregressive Generative Models:


Example: consider binarized MNIST dataset for which 𝑋 ∈ {0,1}784 or in other words 𝑋𝑑 ∈ 0,1 ; 𝑑 = 1,2, … , 784.

𝑝 𝑥 = 𝑝(𝑥1 , 𝑥2 , … , 𝑥28 , 𝑥29 , … , 𝑥784 )


Then go to the next row
Here are two samples/realizations of random variable 𝑋: and repeat it to the end.

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 8
Generative Models
Autoregressive (AR)

Autoregressive Generative Models:


Example: consider binarized MNIST dataset for which 𝑋 ∈ {0,1}784 or in other words 𝑋𝑑 ∈ 0,1 ; 𝑑 = 1,2, … , 784.

𝑝 𝑥 = 𝑝(𝑥1 , 𝑥2 , … , 𝑥28 , 𝑥29 , … , 𝑥784 )


This is called a raster
Here are two samples/realizations of random variable 𝑋: scan order of images!

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 9
Generative Models
Autoregressive (AR)

Note: raster scan order of an image is ordering of pixels by rows (row by row and pixel by pixel within every
row) thus, considering the chain rule, every pixel depends on all the pixels above and to the left of it.

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 10
Generative Models
Autoregressive (AR)

Autoregressive Generative Models:


Consider a D-dimensional random variable 𝑋 ∈ ℝ𝐷 with probability distribution 𝑝 𝑥 = 𝑝(𝑥1 , 𝑥2 , … , 𝑥𝐷 ).
Using the chain rule of probability we can say:
𝐷

𝑝 𝑥 = 𝑝(𝑥1 , 𝑥2 , … , 𝑥𝐷 ) = 𝑝 𝑥1 𝑝 𝑥2 |𝑥1 𝑝 𝑥3 |𝑥1 , 𝑥2 𝑝 𝑥4 |𝑥1 , 𝑥2 , 𝑥3 … 𝑝 𝑥𝐷 |𝑥1:𝐷−1 = ෑ 𝑝(𝑥𝑑 |𝑥1:𝑑−1 )


𝑑=1

where we define 𝑝 𝑥1 𝑥1:0 = 𝑝(𝑥1 ).

Autoregressive Generative Models aim to learn the probability distribution 𝑝 𝑥 = 𝑝(𝑥1 , 𝑥2 , … , 𝑥𝐷 ) by


modeling each of the conditional probabilities in the chain rule 𝑝 𝑥 = ς𝐷
𝑑=1 𝑝(𝑥𝑑 |𝑥1:𝑑−1 ).

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 11
Generative Models
Autoregressive (AR)

Autoregressive Generative Models:


Example: given binarized MNIST dataset (shown with random variable 𝑋 ∈ {0,1}784 ) that includes images of handwritten
digits with size 28×28 where each pixel can either be 0 (black) or 1(white).

𝑝 𝑥 = 𝑝 𝑥1 , 𝑥2 , … , 𝑥784 = 𝑝 𝑥1 𝑝 𝑥2 |𝑥1 𝑝 𝑥3 |𝑥1 , 𝑥2 … 𝑝 𝑥𝑑 |𝑥1 , 𝑥2 , … , 𝑥𝑑−1 … 𝑝 𝑥784 |𝑥1:784−1

Now we need to model (effectively) each conditional probability.

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 12
Generative Models
Autoregressive (AR)

Example: (Cont.)

𝑝 𝑥 = 𝑝 𝑥1 , 𝑥2 , … , 𝑥784 = 𝑝 𝑥1 𝑝 𝑥2 |𝑥1 𝑝 𝑥3 |𝑥1 , 𝑥2 … 𝑝 𝑥𝑑 |𝑥1 , 𝑥2 , … , 𝑥𝑑−1 … 𝑝 𝑥784 |𝑥1:784−1

Each conditional distributions are modeled by a parametrized function and then we use the given training data to learn
the optimum values of parameters!

We will use the approach presented here Brendan J. Frey. Graphical models for machine learning and digital communication. MIT press, 1998

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 13
Generative Models
Autoregressive (AR)

Example: (Cont.)

𝑝 𝑥 = 𝑝 𝑥1 , 𝑥2 , … , 𝑥784 = 𝑝 𝑥1 𝑝 𝑥2 |𝑥1 𝑝 𝑥3 |𝑥1 , 𝑥2 … 𝑝 𝑥𝑑 |𝑥1 , 𝑥2 , … , 𝑥𝑑−1 … 𝑝 𝑥784 |𝑥1:784−1

𝑝 𝑥1 ; 𝑤 (1) : 𝑝 𝑥1 = 1; 𝑤 (1) = 𝑤 (1) , 𝑝 𝑥1 = 0; 𝑤 (1) = 1 − 𝑤 1 0 ≤ 𝑤 (1) ≤ 1

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 14
Generative Models
Autoregressive (AR)

Example: (Cont.)

𝑝 𝑥 = 𝑝 𝑥1 , 𝑥2 , … , 𝑥784 = 𝑝 𝑥1 𝑝 𝑥2 |𝑥1 𝑝 𝑥3 |𝑥1 , 𝑥2 … 𝑝 𝑥𝑑 |𝑥1 , 𝑥2 , … , 𝑥𝑑−1 … 𝑝 𝑥784 |𝑥1:784−1

𝑝 𝑥1 ; 𝑤 (1) : 𝑝 𝑥1 = 1; 𝑤 (1) = 𝑤 (1) , 𝑝 𝑥1 = 0; 𝑤 (1) = 1 − 𝑤 1 0 ≤ 𝑤 (1) ≤ 1

(2) (2)
𝑝 𝑥2 |𝑥1 ; 𝒘(2) : 𝑝 𝑥2 = 1|𝑥1 ; 𝒘(2) = 𝜎 𝑤0 + 𝑤1 𝑥1

Note: The conditional distributions are modeled by a parametrized functions (here logistic regression), i.e., a logistic
regression is used to predict the next pixel given all the previous pixels. This is what we mean by an autoregressive model.

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 15
Generative Models
Autoregressive (AR)

Example: (Cont.)

𝑝 𝑥 = 𝑝 𝑥1 , 𝑥2 , … , 𝑥784 = 𝑝 𝑥1 𝑝 𝑥2 |𝑥1 𝑝 𝑥3 |𝑥1 , 𝑥2 … 𝑝 𝑥𝑑 |𝑥1 , 𝑥2 , … , 𝑥𝑑−1 … 𝑝 𝑥784 |𝑥1:784−1

𝑝 𝑥1 ; 𝑤 (1) : 𝑝 𝑥1 = 1; 𝑤 (1) = 𝑤 (1) , 𝑝 𝑥1 = 0; 𝑤 (1) = 1 − 𝑤 1 0 ≤ 𝑤 (1) ≤ 1

(2) (2)
𝑝 𝑥2 |𝑥1 ; 𝒘(2) : 𝑝 𝑥2 = 1|𝑥1 ; 𝒘(2) = 𝜎 𝑤0 + 𝑤1 𝑥1

1
Sigmoid (logistic) function: 𝜎 𝑧 =
1+𝑒 −𝑧

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 16
Generative Models
Autoregressive (AR)

Example: (Cont.)

𝑝 𝑥 = 𝑝 𝑥1 , 𝑥2 , … , 𝑥784 = 𝑝 𝑥1 𝑝 𝑥2 |𝑥1 𝑝 𝑥3 |𝑥1 , 𝑥2 … 𝑝 𝑥𝑑 |𝑥1 , 𝑥2 , … , 𝑥𝑑−1 … 𝑝 𝑥784 |𝑥1:784−1

𝑝 𝑥1 ; 𝑤 (1) : 𝑝 𝑥1 = 1; 𝑤 (1) = 𝑤 (1) , 𝑝 𝑥1 = 0; 𝑤 (1) = 1 − 𝑤 1 0 ≤ 𝑤 (1) ≤ 1

(2) (2)
𝑝 𝑥2 |𝑥1 ; 𝒘(2) : 𝑝 𝑥2 = 1|𝑥1 ; 𝒘(2) = 𝜎 𝑤0 + 𝑤1 𝑥1

(3) (3) (3)


𝑝 𝑥3 |𝑥1 , 𝑥2 ; 𝒘(3) :𝑝 𝑥3 = 1|𝑥1 , 𝑥2 ; 𝒘(3) = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 17
Generative Models
Autoregressive (AR)

Example: (Cont.)

𝑝 𝑥 = 𝑝 𝑥1 , 𝑥2 , … , 𝑥784 = 𝑝 𝑥1 𝑝 𝑥2 |𝑥1 𝑝 𝑥3 |𝑥1 , 𝑥2 … 𝑝 𝑥𝑑 |𝑥1 , 𝑥2 , … , 𝑥𝑑−1 … 𝑝 𝑥784 |𝑥1:784−1

𝑝 𝑥1 ; 𝑤 (1) : 𝑝 𝑥1 = 1; 𝑤 (1) = 𝑤 (1) , 𝑝 𝑥1 = 0; 𝑤 (1) = 1 − 𝑤 1 0 ≤ 𝑤 (1) ≤ 1

(2) (2)
𝑝 𝑥2 |𝑥1 ; 𝒘(2) : 𝑝 𝑥2 = 1|𝑥1 ; 𝒘(2) = 𝜎 𝑤0 + 𝑤1 𝑥1

(3) (3) (3)


𝑝 𝑥3 |𝑥1 , 𝑥2 ; 𝒘(3) :𝑝 𝑥3 = 1|𝑥1 , 𝑥2 ; 𝒘(3) = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2

(𝑑) (𝑑)
𝑝 𝑥𝑑 |𝑥1 , 𝑥2 , … , 𝑥𝑑−1 ; 𝒘(𝑑) :𝑝 𝑥𝑑 = 1|𝑥1:𝑑−1 ; 𝒘(𝑑) = 𝜎 𝑤0 + σ𝑑−1
𝑖=1 𝑤𝑖 𝑥𝑖

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 18
Generative Models
Autoregressive (AR)

Example: (Cont.)

𝑝 𝑥 = 𝑝 𝑥1 , 𝑥2 , … , 𝑥784 = 𝑝 𝑥1 𝑝 𝑥2 |𝑥1 𝑝 𝑥3 |𝑥1 , 𝑥2 … 𝑝 𝑥𝑑 |𝑥1 , 𝑥2 , … , 𝑥𝑑−1 … 𝑝 𝑥784 |𝑥1:784−1

𝑝 𝑥1 ; 𝑤 (1) : 𝑝 𝑥1 = 1; 𝑤 (1) = 𝑤 (1) , 𝑝 𝑥1 = 0; 𝑤 (1) = 1 − 𝑤 1 0 ≤ 𝑤 (1) ≤ 1

(2) (2)
𝑝 𝑥2 |𝑥1 ; 𝒘(2) : 𝑝 𝑥2 = 1|𝑥1 ; 𝒘(2) = 𝜎 𝑤0 + 𝑤1 𝑥1

(3) (3) (3)


𝑝 𝑥3 |𝑥1 , 𝑥2 ; 𝒘(3) :𝑝 𝑥3 = 1|𝑥1 , 𝑥2 ; 𝒘(3) = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2

(𝑑) (𝑑)
𝑝 𝑥𝑑 |𝑥1 , 𝑥2 , … , 𝑥𝑑−1 ; 𝒘(𝑑) :𝑝 𝑥𝑑 = 1|𝑥1:𝑑−1 ; 𝒘(𝑑) = 𝜎 𝑤0 + σ𝑑−1
𝑖=1 𝑤𝑖 𝑥𝑖

(784) (784)
𝑝 𝑥784 |𝑥1:783; 𝒘(784) :𝑝 𝑥784 = 1|𝑥1:783 ; 𝒘(784) = 𝜎 𝑤0 + σ783
𝑖=1 𝑤𝑖 𝑥𝑖

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 19
Generative Models
Autoregressive (AR)

Autoregressive Generative Models: Fully Visible Sigmoid Belief Network


Modelling these conditional distributions 𝑝 𝑥 = 𝑝(𝑥1 , 𝑥2 , … , 𝑥𝐷 ) = ς𝐷
𝑑=1 𝑝(𝑥𝑑 |𝑥1:𝑑−1 ) using parametrized Sigmoid is

called Fully Visible Sigmoid Belief Network (FVSBN).

𝑥ො1 𝑥ො2 𝑥ො3 𝑥ො4

𝑥1 𝑥2 𝑥3 𝑥4

❑ How many parameters the model has? Assume 𝑋 ∈ {0,1}𝐷


(𝐷+1)×𝐷
There are 1 parameter for 𝑝(𝑥1 ), 2 for 𝑝(𝑥2 |𝑥1 ), …, D parameters for 𝑝(𝑥𝐷 |𝑥1:𝐷−1 ) thus: #parameters = 1 + 2 + ⋯ + 𝐷 = .
2
Note: This is a function of 𝐷 2 .

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 20
Generative Models
Autoregressive (AR)

Example: Assume we are given binarized 2×2 images (shown with random variable 𝑋 ∈ {0,1}4), i.e., each pixel can either
be 0 or 1. The probability distribution 𝑝 𝑥 is modelled by parameterized Sigmoid functions as below:

𝑝 𝑥 = 𝑝 𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 = 𝑝 𝑥1 𝑝 𝑥2 |𝑥1 𝑝 𝑥3 |𝑥1 , 𝑥2 𝑝 𝑥4 |𝑥1 , 𝑥2 , 𝑥3

𝑝 𝑥1 = 1; 𝑤 (1) = 𝑤 (1)
(2) (2)
𝑝 𝑥2 = 1|𝑥1 ; 𝒘(2) = 𝜎 𝑤0 + 𝑤1 𝑥1
(3) (3) (3)
𝑝 𝑥3 = 1|𝑥1 , 𝑥2 ; 𝒘(3) = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
(4) (4) (4) (4)
𝑝 𝑥4 = 1|𝑥1 , 𝑥2 , 𝑥3 ; 𝒘(4) = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3

Q1: Find the likelihood for a given sample 𝑥 = (𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 ) = 1,1,0,0 .


Q2: Assume the optimum parameters are given as 𝑤 1∗ = 0.9, 𝒘(2∗) = [−0.5, 0.1] , 𝒘(3∗) = [1, 0.5, −1] , 𝒘(4∗) = [−2, 1.5, 0.5, 1].
How a new sample 𝑥ො = (𝑥ො1 , 𝑥ො2 , 𝑥ො3 , 𝑥ො4 ) can be generated?
Note: in Python to sample a random value from Bernoulli distribution with parameter α we can use numpy.random.choice( 0,1 , p = [1 − α, α]).

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 21
Generative Models
Autoregressive (AR)

Solution:

𝑝 𝑥1 = 1, 𝑥2 = 1, 𝑥3 = 0, 𝑥4 = 0 = 𝑝 𝑥1 = 1 𝑝 𝑥2 = 1|𝑥1 = 1 𝑝 𝑥3 = 0|𝑥1 = 1, 𝑥2 = 1 𝑝 𝑥4 = 0|𝑥1 = 1, 𝑥2 = 1, 𝑥3 = 0


(2) (2) (3) (3) (3) (4) (4) (4) (4)
= 𝑤 (1) 𝜎 𝑤0 + 𝑤1 1 𝜎 𝑤0 + 𝑤1 1 + 𝑤2 1 𝜎 𝑤0 + 𝑤1 1 + 𝑤2 1 + 𝑤3 0

Note: The optimum values of these parameters (𝑤 (1), 𝒘(2) , 𝒘(3) , 𝒘(4) ) are found by maximizing the (log) likelihood over
the training data samples. It is equivalent to minimizing the negative (log) likelihood!

• Sample 𝑥ො1 ~𝑝 𝑥1 ; 𝑤 (1) randomly (e.g., numpy.random.choice( 0,1 , 𝑝 = [1 − 𝑤 1


, 𝑤 (1) ])).

• Sample 𝑥ො2 ~𝑝 𝑥2 |𝑥1 = 𝑥ො1 ; 𝒘(2) randomly.


• Sample 𝑥ො3 ~𝑝 𝑥3 |𝑥1 = 𝑥ො1 , 𝑥2 = 𝑥ො2 ; 𝒘(3) randomly.

• Sample 𝑥ො4 ~𝑝 𝑥4 |𝑥1 = 𝑥ො1 , 𝑥2 = 𝑥ො2 , 𝑥3 = 𝑥ො3 ; 𝒘(4) randomly.

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 22
Generative Models
Autoregressive (AR)

Autoregressive Generative Models: Fully Visible Sigmoid Belief Network

Here is a result of FVSBN trained on the Caltech 101 Silhouettes dataset. The result is from Gan et al., PMLR 2015.

From Figure 4 of article Gan et al., PMLR 2015: Training data (on left) and synthesized samples (on right).
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 23
Generative Models
Autoregressive (AR)

Autoregressive Generative Models: Neural Autoregressive Density Estimator (NADE)


The FVSBN can be improved by using a neural network. In addition, weight sharing can be used. The resulting model is
called Neural Autoregressive Density Estimator or NADE Larochelle et al., JMLR 2011.
𝐷

𝑝 𝑥 = 𝑝(𝑥1 , 𝑥2 , … , 𝑥𝐷 ) = 𝑝 𝑥1 𝑝 𝑥2 |𝑥1 𝑝 𝑥3 |𝑥1 , 𝑥2 𝑝 𝑥4 |𝑥1 , 𝑥2 , 𝑥3 … 𝑝 𝑥𝐷 |𝑥1:𝐷−1 = ෑ 𝑝(𝑥𝑑 |𝑥1:𝑑−1 )


𝑑=1

𝑥1

𝑥2

𝑥3 𝑥 Neural Network 𝑝(𝑥𝑑 |𝑥1:𝑑−1 )


𝑥𝐷
The 𝑑 th output should be just a function of
the previous 𝑑 − 1 inputs.

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 24
Generative Models
Autoregressive (AR)

Autoregressive Generative Models: Neural Autoregressive Density Estimator (NADE)


The FVSBN can be improved by using a neural network. In addition, weight sharing can be used. The resulting model is
called Neural Autoregressive Density Estimator or NADE Larochelle et al., JMLR 2011.

Let see how a simple NADE can be formulated: The goal is to use a neural network to model all of the conditional
probabilities in 𝑝 𝑥 = 𝑝 𝑥1 𝑝 𝑥2 |𝑥1 𝑝 𝑥3 |𝑥1 , 𝑥2 𝑝 𝑥4 |𝑥1 , 𝑥2 , 𝑥3 … 𝑝 𝑥𝐷 |𝑥1:𝐷−1 . Which means we need a neural
network that receives 𝐷 inputs 𝑥1 , 𝑥2 , …, 𝑥𝐷 and returns 𝐷 outputs (each corresponding to one of the terms in the joint
factorization). BUT, we need to be careful since the 𝑑 th output should be just a function of the previous 𝑑 − 1 inputs (why is
it so? Because looking at the 𝑑 th term in the factorization we have 𝑝 𝑥𝑑 𝑥1:𝑑−1 ). The next thing that we should be careful about is

that we cannot simply use a fully connected layer! Why? Because in the fully connected layer every nodes in the hidden
layer sees all the inputs which means all the nodes in the next layers after that up to the output layer are a function of all
the inputs! We should avoid such a thing happens. BUT how? NADE has a solution for that …

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 25
Generative Models
Autoregressive (AR)

Autoregressive Generative Models: Neural Autoregressive Density Estimator (NADE)

Masking the input Shared NN


Hidden
𝑾 representation
𝑥1
Hadamard product 𝑉𝑑
𝑥2 (element-wise multiplication)

𝑥3 𝑥 ⊙ ℎ𝑑 𝑝(𝑥𝑑 |𝑥1:𝑑−1 )






𝑥𝐷 …
𝑚𝑑

ℎ𝑑 = 𝜎 𝑾(𝑥 ⊙ 𝑚𝑑 ) + 𝑏
1 1 … 1 0 … 0
𝑝 𝑥𝑑 |𝑥1:𝑑−1 = 𝜎 𝑉𝑑 𝑇 ℎ𝑑 + 𝑐𝑑
𝑑−1

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 26
Generative Models
Autoregressive (AR)

Autoregressive Generative Models: Neural Autoregressive Density Estimator (NADE)

First, associated to each output, a hidden representation is computed using


only the relevant inputs (e.g., by masking inputs).

𝑾
𝑥1 𝑉1 𝑝 𝑥1
ℎ𝑑 = 𝜎 𝑾(𝑥 ⊙ 𝑚𝑑 ) + 𝑏 𝑾 is an 𝑀 × 𝐷 matrix (shared). … ℎ1

𝑥2
𝑉2 𝑝 𝑥2 |𝑥1
𝑥1 … ℎ2

𝑥 = … is an 𝐷 × 1 vector. 𝑉3 𝑝 𝑥3 |𝑥1 , 𝑥2
Then, the outputs are computed as follows: 𝑥3 … ℎ3
𝑥𝐷




𝑚𝑑 is an 𝐷 × 1 mask vector that its 𝑉𝐷
𝑝 𝑥𝑑 |𝑥1:𝑑−1 = 𝜎 𝑉𝑑 𝑇 ℎ𝑑 + 𝑐𝑑 first (𝑑 − 1) elements are all 1 and 𝑥𝐷 … ℎ𝐷 𝑝 𝑥𝐷 |𝑥1:𝐷−1
the rest are 0. 𝑀 units
Note: The number of parameters is a function 𝑏 is an 𝑀 × 1 vector (shared).
of 𝐷𝑀 and not 𝐷2 !
𝑉𝑑 is an 𝑀 × 1 vector.
𝑐𝑑 is a scalar.

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 27
Generative Models
Autoregressive (AR)

Autoregressive Generative Models: Neural Autoregressive Density Estimator (NADE)


𝑁
How NADE can be trained? assume a set of 𝑁 training data samples 𝒮 = 𝑥 (𝑖) 𝑖=1
is given.

Using maximum likelihood or equivalently by minimizing the negative log-likelihood (NLL).

𝑁 𝑁 𝐷
1 1
ℒ 𝜃 = 𝔼𝑋 𝑵𝑳𝑳 = 𝔼𝑋 −log 𝑝 𝑋 = ෍ −log 𝑝 𝑥 (𝑖) = ෍ ෍ − log 𝑝 𝑥 (𝑖) 𝑑 |𝑥 (𝑖)1:𝑑−1
𝑁 𝑁
𝑖=1 𝑖=1 𝑑=1

where 𝜃 = {𝑾, 𝑏, 𝑉1 , 𝑐1 , 𝑉2 , 𝑐2 , … , 𝑉𝐷 , 𝑐𝐷 } includes the parameters of the NADE that we aim to learn.

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 28
Generative Models
Autoregressive (AR)

Example: Consider a simple case where 𝑋 ∈ {0,1}3 , i.e. binary sequences with length 𝐷 = 3. Compute the loss function of
the NADE for a training instance 𝑥 (1) = (𝑥 (1)1 , 𝑥 (1) 2 , 𝑥 (1) 3 ) = 1,0,0 .

𝑁 𝑁 𝐷
1 1
ℒ 𝜃 = 𝔼𝑋 𝑵𝑳𝑳 = 𝔼𝑋 −log 𝑝 𝑋 = ෍ −log 𝑝 𝑥 (𝑖) = ෍ ෍ − log 𝑝 𝑥 (𝑖) 𝑑 |𝑥 (𝑖)1:𝑑−1
𝑁 𝑁
𝑖=1 𝑖=1 𝑑=1

Solution:
3

ℒ(𝜃) = ෍ − log 𝑝 𝑥 (1) 𝑑 |𝑥 (1)1:𝑑−1 = − log 𝑝 𝑥 (1)1 = 1 − log 𝑝 𝑥 (1) 2 = 0|𝑥 (1)1 − log 𝑝 𝑥 (1) 3 = 0|𝑥 (1)1 , 𝑥 (1) 2

𝑥=𝑥(1)
𝑑=1

= − log 𝜎 𝑉1 𝑇 ℎ1 + 𝑐1 − log 1 − 𝜎 𝑉2 𝑇 ℎ2 + 𝑐2 − log 1 − 𝜎 𝑉3 𝑇 ℎ3 + 𝑐3

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 29
Generative Models
Autoregressive (AR)

Autoregressive Generative Models: Neural Autoregressive Density Estimator (NADE)

Looking more closely at the loss function of the NADE …

𝑁 𝑁 𝐷
1 (𝑖)
1
ℒ 𝜃 = 𝔼𝑋 𝑵𝑳𝑳 = 𝔼𝑋 −log 𝑝 𝑋 = ෍ −log 𝑝 𝑥 = ෍ ෍ − log 𝑝 𝑥 (𝑖) 𝑑 |𝑥 (𝑖)1:𝑑−1
𝑁 𝑁
𝑖=1 𝑖=1 𝑑=1
Cross entropy
calculated for pixel 𝑑!

Note that this is nothing but the (averaged) summation of the cross-entropies. In other words, the cross-entropy is applied
to each output (which is binary) and then the (averaged) summation of these cross-entropies are computed.

For example, assume that for a given training instance we have 𝑥3 = 0 which means the true probability distribution (at the third output)
is 𝑝 = 𝑝 𝑥3 = 0 𝑥1 , 𝑥2 = 1, 𝑝 𝑥3 = 1 𝑥1 , 𝑥2 = 0 while the predicted one is 𝑞 = 1 − 𝜎 𝑉3 𝑇 ℎ3 + 𝑐3 , 𝜎 𝑉3 𝑇 ℎ3 + 𝑐3 . So the cross-entropy
would be as: 𝐶𝐸 𝑝, 𝑞 = −1 × log 1 − 𝜎 𝑉3 𝑇 ℎ3 + 𝑐3 − 0 × log 𝜎 𝑉3 𝑇 ℎ3 + 𝑐3

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 30
Generative Models
Autoregressive (AR)

Example: Consider a simple case where 𝑋 ∈ {0,1}3 , i.e. binary sequences with length 𝐷 = 3. Compute the loss function of
the NADE for a training instance 𝑥 (1) = (𝑥 (1)1 , 𝑥 (1) 2 , 𝑥 (1) 3 ) = 1,0,0 .

𝑁 𝑁 𝐷
1 1
ℒ 𝜃 = 𝔼𝑋 𝑵𝑳𝑳 = 𝔼𝑋 −log 𝑝 𝑋 = ෍ −log 𝑝 𝑥 (𝑖) = ෍ ෍ − log 𝑝 𝑥 (𝑖) 𝑑 |𝑥 (𝑖)1:𝑑−1
𝑁 𝑁
𝑖=1 𝑖=1 𝑑=1

Method 1: (negative log-likelihood)


3

ℒ(𝜃) = ෍ − log 𝑝 𝑥 (1) 𝑑 |𝑥 (1)1:𝑑−1 = − log 𝑝 𝑥 (1)1 = 1 − log 𝑝 𝑥 (1) 2 = 0|𝑥 (1)1 − log 𝑝 𝑥 (1) 3 = 0|𝑥 (1)1 , 𝑥 (1) 2

𝑥=𝑥(1)
𝑑=1

= − log 𝜎 𝑉1 𝑇 ℎ1 + 𝑐1 − log 1 − 𝜎 𝑉2 𝑇 ℎ2 + 𝑐2 − log 1 − 𝜎 𝑉3 𝑇 ℎ3 + 𝑐3

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 31
Generative Models
Autoregressive (AR)

Example: Consider a simple case where 𝑋 ∈ {0,1}3 , i.e. binary sequences with length 𝐷 = 3. Compute the loss function of
the NADE for a training instance 𝑥 (1) = (𝑥 (1)1 , 𝑥 (1) 2 , 𝑥 (1) 3 ) = 1,0,0 .

𝑁 𝑁 𝐷
1 1
ℒ 𝜃 = 𝔼𝑋 𝑵𝑳𝑳 = 𝔼𝑋 −log 𝑝 𝑋 = ෍ −log 𝑝 𝑥 (𝑖) = ෍ ෍ − log 𝑝 𝑥 (𝑖) 𝑑 |𝑥 (𝑖)1:𝑑−1
𝑁 𝑁
𝑖=1 𝑖=1 𝑑=1

Method 2 : (cross-entropy)

𝑥 (1)1 = 1 →: 𝑝 = 0,1 , 𝑞 = 1 − 𝜎 𝑉1 𝑇 ℎ1 + 𝑐1 , 𝜎 𝑉1 𝑇 ℎ1 + 𝑐1 → 𝐶𝐸1 𝑝, 𝑞 = − log 𝜎 𝑉1 𝑇 ℎ1 + 𝑐1

𝑥 (1) 2 = 0 →: 𝑝 = 1,0 , 𝑞 = 1 − 𝜎 𝑉2 𝑇 ℎ2 + 𝑐2 , 𝜎 𝑉2 𝑇 ℎ2 + 𝑐2 → 𝐶𝐸2 𝑝, 𝑞 = − log 1 − 𝜎 𝑉2 𝑇 ℎ2 + 𝑐2

𝑥 (1) 3 = 0 →: 𝑝 = 1,0 , 𝑞 = 1 − 𝜎 𝑉3 𝑇 ℎ3 + 𝑐3 , 𝜎 𝑉3 𝑇 ℎ3 + 𝑐3 → 𝐶𝐸3 𝑝, 𝑞 = − log 1 − 𝜎 𝑉3 𝑇 ℎ3 + 𝑐3

⇒ ℒ(𝜃) = 𝐶𝐸1 𝑝, 𝑞 + 𝐶𝐸2 𝑝, 𝑞 + 𝐶𝐸3 𝑝, 𝑞



𝑥=𝑥(1)

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 32
Generative Models
Autoregressive (AR)

Autoregressive Generative Models: Neural Autoregressive Density Estimator (NADE)

To summarize:
𝑾
𝑥1 𝑉1 𝑝 𝑥1
For each 𝑑 = 1, 2, … , 𝐷: 𝑀 × 𝐷 matrix … ℎ1

𝑥2 … ℎ2
𝑉2 𝑝 𝑥2 |𝑥1
ℎ𝑑 = 𝜎 𝑾(𝑥 ⊙ 𝑚𝑑 ) + 𝑏 𝑀 × 1 vector
𝑉3 𝑝 𝑥3 |𝑥1 , 𝑥2
𝑥3 … ℎ3
𝑝 𝑥𝑑 |𝑥1:𝑑−1 = 𝜎 𝑉𝑑 𝑇ℎ + 𝑐𝑑
𝑑




𝑉𝐷
𝑥𝐷 … ℎ𝐷 𝑝 𝑥𝐷 |𝑥1:𝐷−1
𝐷
𝑀 units
ℒ 𝜃 = 𝔼𝑋 𝑵𝑳𝑳 = 𝔼𝑋 −log 𝑝 𝑋 = 𝔼𝑋 ෍ − log 𝑝 𝑋𝑑 |𝑋1:𝑑−1
𝑑=1

Cross entropy
calculated for pixel 𝑑

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 33
Generative Models
Autoregressive (AR)

Autoregressive Generative Models: Neural Autoregressive Density Estimator (NADE)


Here is a result of Neural Autoregressive Density Estimator (NADE) from Larochelle et al., JMLR 2011 where hidden kayers
of size 500 (𝑀 =500) is used.

From Figure 4 of article Larochelle et al., JMLR


2011 : Result of NADE with one hidden layer
of size 500 with Sigmoid activation on
binarized MNIST dataset. (Left) generated data
and (right) training data . See more details and
more examples in the original article.

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 34
Generative Models
Autoregressive (AR)

Autoregressive Generative Models:


So far, Fully Visible Sigmoid Belief Network (FVSBN) and Neural Autoregressive Density Estimator (NADE) for modeling 𝑝 𝑥 =
𝑝(𝑥1 , 𝑥2 , … , 𝑥𝐷 ) the probability distribution of 𝐷-dimensional data 𝑋 ∈ 0,1 𝐷 .

Question 1: Can we apply these models to non-binary discrete random variables? For example 𝑋 ∈ 0,1,2, … , 255 𝐷 .
- Yes, we can simply model this multinomial distribution using Softmax activation at the output (instead of Sigmoid function).

For each 𝑑 = 1, 2, … , 𝐷: 𝑀 × 𝐷 matrix 𝑾


𝑥1 …
𝑉1 𝑝 𝑥1
ℎ1
ℎ𝑑 = 𝜎 𝑾(𝑥 ⊙ 𝑚𝑑 ) + 𝑏 𝑀 × 256 vector
𝑉2
𝑥2 … ℎ2 𝑝 𝑥2 |𝑥1
𝑝 𝑥𝑑 |𝑥1:𝑑−1 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑉𝑑 𝑇 ℎ𝑑 + 𝑐𝑑 𝑥3 … ℎ3
𝑉3 𝑝 𝑥3 |𝑥1 , 𝑥2
𝐷




ℒ 𝜃 = 𝔼𝑋 𝑵𝑳𝑳 = 𝔼𝑋 −log 𝑝 𝑋 = 𝔼𝑋 ෍ − log 𝑝 𝑋𝑑 |𝑋1:𝑑−1 𝑉𝐷
𝑥𝐷 … ℎ𝐷 𝑝 𝑥𝐷 |𝑥1:𝐷−1
𝑑=1 𝑀 units
Cross entropy
calculated for pixel 𝑑
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 35
Generative Models
Autoregressive (AR)

Autoregressive Generative Models:


Question 2: What about the continuous random variables? e.g., speech signals where 𝑋 ∈ ℝ𝐷 .
- In this case, we can assume 𝑝 𝑥𝑑 |𝑥1:𝑑−1 is a mixture of 𝐾 Gaussians i.e., 𝑝 𝑥𝑑 |𝑥1:𝑑−1 = σ𝐾 2
𝑗=1 𝜋𝑑,𝑗 𝒩 𝑥𝑑 ; 𝜇𝑑,𝑗 , 𝜎 𝑑,𝑗 where the
parameters 𝜋𝑑,𝑗 , 𝜇𝑑,𝑗 , 𝜎2𝑑,𝑗 are learned from the 𝑥1:𝑑−1 by a neural network. This model is called Real-valued Neural Autoregressive
Density Estimator (RNADE). For more details please see the article by Uria et al., NeurIPS 2013.

For each 𝑑 = 1, 2, … , 𝐷: 𝑀 × 𝐷 matrix 𝐾

ℎ𝑑 = 𝜎 𝑾(𝑥 ⊙ 𝑚𝑑 ) + 𝑏 𝑝 𝑥𝑑 |𝑥1:𝑑−1 = ෍ 𝜋𝑑,𝑗 𝒩 𝑥𝑑 ; 𝜇𝑑,𝑗 , 𝜎2𝑑,𝑗


𝑗=1
𝑉𝜋1
𝜋𝑑 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥( 𝑉𝜋 𝑇ℎ + 𝑐𝜋 𝑾
𝑑 𝑑 𝑑) 𝑥1 … ℎ1
𝑉𝜇1
𝑉𝜋𝑑, 𝑉𝜇
𝑑, and 𝑉𝜎
𝑑 are 𝑀 × 𝐾. 𝑉𝜎1
𝜇𝑑 = 𝑉𝜇𝑑 𝑇 ℎ𝑑 + 𝑐𝜇𝑑
𝑐𝜋𝑑 , 𝑐𝜇𝑑 , and 𝑐𝜎𝑑 are 𝐾 ×1. 𝑥2 … ℎ2

log 𝜎𝑑 = 𝑉𝜎𝑑 𝑇 ℎ𝑑 + 𝑐𝜎𝑑


𝑥3 … ℎ3



𝑉𝜋𝐷
ℒ 𝜃 = 𝔼𝑋 𝑵𝑳𝑳 = 𝔼𝑋 −log 𝑝 𝑋 = 𝔼𝑋 ෍ − log 𝑝 𝑋𝑑 |𝑋1:𝑑−1 𝑥𝐷 … ℎ𝐷
𝑉𝜇𝐷
𝑉𝜎𝐷
𝑑=1
𝑀 units

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 36
Generative Models**
Autoregressive (AR)

Autoregressive Generative Models: Masked Autoencoder for Distribution Estimation (MADE)


On surface, both Fully Visible Sigmoid Belief Network (FVSBN) and Neural Autoregressive Density Estimator (NADE) look
very similar to an Autoencoder (going from 𝐷 inputs to 𝐷 outputs). In work by Germain et al., ICML 2015 it was shown
how autoencoder neural networks can be modified to enable generation. The main idea involved the masking of
autoencoder parameters to enforce autoregressive constraints. This resulted in a model named as Masked Autoencoder
for Distribution Estimation or MADE.
Hadamard product
(element-wise multiplication)
𝑝 𝑥2 |𝑥1 𝑝 𝑥3 |𝑥1 , 𝑥2
𝑥ො1 𝑥ො2 𝑥ො3 𝑝 𝑥1 Mask

𝑥ො = 𝜎 𝑐 + 𝑉ℎ[2] 1 2 3
V V
𝑥ො = 𝜎 𝑐 + 𝑉 ⊙ 𝑀 𝑉 ℎ[2]

ℎ[2] = 𝑓 [2] 𝑏[2] + 𝑊 [2] ℎ[1] 1 1 2 2


W [2] W [2]
ℎ[2] = 𝑓 [2] 𝑏[2] + 𝑊 2
⊙ 𝑀 2 ℎ[1]
ℎ[1] = 𝑓 [1] 𝑏[1] + 𝑊 [1] 𝑥 2 1 1 2
ℎ[1] = 𝑓 [1] 𝑏[1] + 𝑊 1
⊙𝑀 1 𝑥
W [1] W [1]
1 2 3
𝑥1 𝑥2 𝑥3 𝑥1 𝑥2 𝑥3

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 37
Generative Models**
Autoregressive (AR)

Autoregressive Generative Models: Masked Autoencoder for Distribution Estimation (MADE)


The main question is how the masks can be formed to ensure autoregressive constraints? Below this is discussed based on
the work by Germain et al., ICML 2015.

o Assume an ordering in 𝐷-dimensional data 𝑥, e.g., 𝑥2 , 𝑥1 , 𝑥3 , … and keep it for both input and output nodes. (in example below:
𝑥1 , 𝑥2 , 𝑥3 )
o For the hidden layers, assign each node a number between 1 and 𝐷 − 1.
o Each nodes at the output can be connected to the ones at
previous layers with (strictly) smaller node number. 𝑝 𝑥2 |𝑥1 𝑝 𝑥3 |𝑥1 , 𝑥2 Hadamard product
𝑝 𝑥1 (element-wise multiplication)

o Each nodes at the hidden layers can be connected to the ones 1 2 3 Mask
at the previous layer with smaller or equal node number. V

1 1 2 2
𝑥ො = 𝜎 𝑐 + 𝑉 ⊙ 𝑀 𝑉 ℎ[2]

1 1 0 0 1 1 0 W [2]
0 0 0 0
ℎ[2] = 𝑓 [2] 𝑏[2] + 𝑊 2 ⊙ 𝑀 2 ℎ[1]
= 0 1 2 1 1 2
𝑀 1
= 1 0 0 , 𝑀 2 1 0 , 𝑀 𝑉
= 1 1 0 0
1 0 0 1 1 1 1
1 1 1 1 W [1]
1 1 0 1 1 1 1
1 2 3 ℎ[1] = 𝑓 [1] 𝑏[1] + 𝑊 1
⊙𝑀1 𝑥
𝑥1 𝑥2 𝑥3

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 38
Generative Models**
Autoregressive (AR)

Autoregressive Generative Models: Masked Autoencoder for Distribution Estimation (MADE)


The main question is how the masks can be formed to ensure autoregressive constraints? Below this is discussed based on
the work by Germain et al., ICML 2015.

➢ Training is done as the one for fully connected network except applying the mask (to ensure autoregressive constraints).
➢ The loss function is as before for the Neural Autoregressive Density Estimator (NADE).

𝑝 𝑥2 |𝑥1 𝑝 𝑥3 |𝑥1 , 𝑥2
𝑥ො1 𝑥ො2 𝑥ො3 0 0 0 0 𝑝 𝑥1
𝑀𝑉 = 1 1 0 0
1 1 2 3
1 1 1
V V
0 1 1 0
𝑀2 = 0 1 1 0 1 1 2 2
1 1 1 1
W [2] ⊙ 1 1 1 1 ≡ W [2]
2 1 1 2
1 1 0
W [1] 𝑀1 = 1 0 0
W [1]
1 0 0
1 1 0 1 2 3
𝑥1 𝑥2 𝑥3 𝑥1 𝑥2 𝑥3

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 39
Generative Models**
Autoregressive (AR)

Autoregressive Generative Models: Masked Autoencoder for Distribution Estimation (MADE)


In masked autoencoder for distribution estimation (MADE) the generation can be done (similar to before) as follows:

𝑝 𝑥2 |𝑥1 𝑝 𝑥3 |𝑥1 , 𝑥2 𝑝 𝑥2 |𝑥1 𝑝 𝑥3 |𝑥1 , 𝑥2 𝑝 𝑥2 |𝑥1 𝑝 𝑥3 |𝑥1 , 𝑥2


𝑝 𝑥1 𝑝 𝑥1 𝑝 𝑥1
1 2 3 1 2 3 1 2 3
V V V

1 1 2 2 1 1 2 2 1 1 2 2
W [2] W [2] W [2]
2 1 1 2 2 1 1 2 2 1 1 2

W [1] W [1] W [1]


1 2 3 1 2 3 1 2 3
𝑥1 𝑥2 𝑥3 𝑥1 𝑥2 𝑥3 𝑥1 𝑥2 𝑥3

Sample a value of 𝑥1 from 𝑝 𝑥1 Feed 𝑥1 to the network and Feed 𝑥1 and 𝑥2 to the network
compute 𝑝 𝑥2 |𝑥1 . Then sample and compute 𝑝 𝑥3 |𝑥1 , 𝑥2 . Then
𝑥2 from 𝑝 𝑥2 |𝑥1 . sample 𝑥3 from 𝑝 𝑥3 |𝑥1 , 𝑥2 .

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 40
Generative Models
Autoregressive (AR)

Autoregressive Generative Models: Lets turn into another modeling based on CNNs
Can we do this sequential modeling using convolutional layers (Conv 1D)?

RECAP: 1D Convolution is an operation applied to 1D inputs where each output is the weighted sum of the nearby inputs.

Lets assume 1D input 𝒙 = [𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , 𝑥5 , 𝑥6 , 𝑥7 ] and a 1D convolution filter of size three 𝒘 = [𝑤1 , 𝑤2 , 𝑤3 ]𝑇 .
input filter output
⊗ Filter/Kernel size = 3
𝑥1 𝑤1

𝑥2 𝑤2 𝑧1 𝑧1 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3

𝑥3 𝑤3

𝑥4
Note that normally there is a bias term (i.e., 𝑧1 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 + 𝑏)
𝑥5
which for simplicity is ignored here.
𝑥6
𝑥7
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 41
Generative Models
Autoregressive (AR)

Autoregressive Generative Models: Lets turn into another modeling based on CNNs
Can we do this sequential modeling using convolutional layers (Conv 1D)?

RECAP: 1D Convolution is an operation applied to 1D inputs where each output is the weighted sum of the nearby inputs.

Lets assume 1D input 𝒙 = [𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , 𝑥5 , 𝑥6 , 𝑥7 ] and a 1D convolution filter of size three 𝒘 = [𝑤1 , 𝑤2 , 𝑤3 ]𝑇 .
input filter output
Filter/Kernel size = 3
𝑥1

𝑥2 𝑤1 𝑧1 𝑧1 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3

𝑥3 𝑤2 𝑧2 𝑧2 = 𝑤1 𝑥2 + 𝑤2 𝑥3 + 𝑤3 𝑥4

𝑥4 𝑤3


𝑥5

𝑥6
𝑥7
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 42
Generative Models
Autoregressive (AR)

Autoregressive Generative Models: Lets turn into another modeling based on CNNs
Can we do this sequential modeling using convolutional layers (Conv 1D)?

Yes, we can but we need to ensure this operation is done in a causal manner. Why? This is the condition enforced by the
autoregressive representation of the data probability distribution.
𝐷

𝑝 𝑥 = 𝑝(𝑥1 , 𝑥2 , … , 𝑥𝐷 ) = 𝑝 𝑥1 𝑝 𝑥2 |𝑥1 𝑝 𝑥3 |𝑥1 , 𝑥2 𝑝 𝑥4 |𝑥1 , 𝑥2 , 𝑥3 … 𝑝 𝑥𝐷 |𝑥1:𝐷−1 = ෑ 𝑝(𝑥𝑑 |𝑥1:𝑑−1 )


𝑑=1

Prediction at pixel/time 𝑑 − 1 cannot depend on any future


timesteps/pixels 𝑥𝑑 , 𝑥𝑑+1 , … , 𝑥𝐷 .

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 43
Generative Models
Autoregressive (AR)

Autoregressive Generative Models:


What is the causal 1D convolutions (Causal Conv 1D)? regular Conv 1D with a proper padding (and masking).

𝒕=𝟓
𝒕=𝟒
𝒕=𝟐

𝒕=𝟑
𝒕=𝟏

Filter/Kernel size = 2 Q1. If the 𝑥5 is not used at the input then

𝑥ො1 𝑥ො2 𝑥ො3 𝑥ො4 𝑥ො5 where we are using it?!

𝒕=𝟓
𝒕=𝟒
𝒕=𝟐

𝒕=𝟑
𝒕=𝟏
Q2. In this model the receptive field is 2, in
output
𝑥ො1 𝑥ො2 𝑥ො3 𝑥ො4 𝑥ො5 other words the output at timestep 𝑡
Causal Conv 1D depends on the input at 𝑡 − 1 and 𝑡 − 2.
How can we increase the receptive field to
input
0 0 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 have a better modeling?

𝑥1 𝑥2 𝑥3 𝑥4 𝑥5

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 44
Generative Models
Autoregressive (AR)

Autoregressive Generative Models:


What is the causal 1D convolutions (Causal Conv 1D)? regular Conv 1D with a proper padding (and masking).

𝒕=𝟓 Filter/Kernel size = 2


𝒕=𝟒
𝒕=𝟐

𝒕=𝟑
𝒕=𝟏

Filter/Kernel size = 2

𝒕=𝟓
𝒕=𝟒
𝒕=𝟐

𝒕=𝟑
𝒕=𝟏
𝑥ො1 𝑥ො2 𝑥ො3 𝑥ො4 𝑥ො5
𝑥ො1 𝑥ො2 𝑥ො3 𝑥ො4 𝑥ො5

𝒕=𝟓
𝒕=𝟒
𝒕=𝟐

𝒕=𝟑
𝒕=𝟏
output
𝑥ො1 𝑥ො2 𝑥ො3 𝑥ො4 𝑥ො5

Causal Conv 1D 0

input
0 0 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 0 0 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5

𝑥1 𝑥2 𝑥3 𝑥4 𝑥5

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 45
Generative Models
Autoregressive (AR)

Autoregressive Generative Models:


What is the causal 1D convolutions (Causal Conv 1D)? regular Conv 1D with a proper padding (and masking).

𝒕=𝟓 Filter/Kernel size = 2


𝒕=𝟒
𝒕=𝟐

𝒕=𝟑
𝒕=𝟏

Filter/Kernel size = 2

𝒕=𝟓
𝒕=𝟒
𝒕=𝟐

𝒕=𝟑
𝒕=𝟏
𝑥ො1 𝑥ො2 𝑥ො3 𝑥ො4 𝑥ො5
𝑥ො1 𝑥ො2 𝑥ො3 𝑥ො4 𝑥ො5

𝒕=𝟓
𝒕=𝟒
𝒕=𝟐

𝒕=𝟑
𝒕=𝟏
output
𝑥ො1 𝑥ො2 𝑥ො3 𝑥ො4 𝑥ො5

Causal Conv 1D 0

input
0 0 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 0 0 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5

𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 The receptive field is now three!

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 46
Generative Models
Autoregressive (AR)

Autoregressive Generative Models:


Adding more layers (and/or increasing the filter size) increases the receptive field (which is #layers + filter length -1). But it
increases the computational cost! Any other solution for increasing the receptive field?

𝑥ො𝑇
Output

Hidden Layer

Hidden Layer

Hidden Layer

Input

𝑥𝑇−1
Oord et al., 2016

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 47
Generative Models
Autoregressive (AR)

Autoregressive Generative Models:


One can use dilated convolution to increase the receptive field of convolution.

𝑥ො𝑇
Output
Dilation =8

Hidden Layer
Dilation =4

Hidden Layer
Dilation =2

Hidden Layer
Dilation =1

Input

𝑥𝑇−1

Oord et al., 2016

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 48
Generative Models
Autoregressive (AR)

Autoregressive Generative Models:


Using Softmax activation at the output for modeling the conditional distribution (even for the Audio Oord et al., 2016 ).

𝑝(𝑥𝑡 |𝑥1:𝑡−1 )

Output
Dilation =8

Hidden Layer
Dilation =4

Hidden Layer
Dilation =2

Hidden Layer
Dilation =1

Input

𝑥𝑡−1

Oord et al., 2016

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 49
Generative Models
Autoregressive (AR)

Autoregressive Generative Models:


Two additional tricks to improve the strength of Causal Conv 1D includes: Oord et al., 2016

Gated activation unit


Learnable filter weights

𝑧 = tanh(𝑊𝑓,𝑘 ∗ 𝑥)⨀𝜎 𝑊𝑔,𝑘 ∗ 𝑥

Causal Conv 1D layer Causal Conv 1D layer Layer k

𝑥
Input to layer

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 50
Generative Models
Autoregressive (AR)

Autoregressive Generative Models:


Two additional tricks to improve the strength of Causal Conv 1D includes: Oord et al., 2016
Output of layer
Using skip connection
(residual block) He et. al, 2015 +

residual/skip/shortcut connection
𝑧 = tanh(𝑊𝑓,𝑘 ∗ 𝑥)⨀𝜎 𝑊𝑔,𝑘 ∗ 𝑥

Causal Conv 1D layer Causal Conv 1D layer Layer k

𝑥
Input to layer

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 51
Generative Models
Autoregressive (AR)

Autoregressive Generative Models: WaveNet


WaveNet developed by DeepMind Oord et al., 2016 as a deep generative model for audio and other sequential data. Its
strength lies in its ability to generate high-quality, realistic audio and its potential for various applications, such as speech
synthesis, music generation, and audio effects processing.

o Changing speaker gender

o Adding emotion, accent etc.

o Generating music (trained on a dataset of classical piano music)

https://www.deepmind.com/blog/wavenet-a-generative-model-for-raw-audio

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 52
Generative Models
Autoregressive (AR)

Autoregressive Generative Models: PixelCNN


The idea of causal one-dimensional convolution (or causal Conv 1D) can be extended to Conv 2D by masking the 2D filter
weights (to ensure causality). This leads to causal Conv 2D or masked convolution Oord et al., NeurIPS 2016.
Input image Feature map

𝑤11 𝑤12 𝑤13 1 1 1


𝑥𝑑 𝑥ො𝑑
𝑤21 𝑤22 𝑤23 ⊙ 1 0 0

𝑤31 𝑤32 𝑤33 0 0 0

kernel Mask

Note that masked convolution was introduced few months before Conv 1D (by the same author) and it is more precise to
say that Conv 1D was introduced based on masked convolution.

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 53
Generative Models
Autoregressive (AR)

Autoregressive Generative Models: PixelCNN


PixelCNN proposed by Oord et al., NeurIPS 2016 is an autoregressive generative model for image data and generates
images pixel by pixel, conditioning on the previously generated pixels. The key idea behind PixelCNN is to model the
conditional probability distribution of each pixel given the previously generated pixels (in its receptive field). This is
achieved by using masked convolutions.
𝑝

Using a Softmax to find a multinomial distribution over 256 possible values.


Output Layer
0 255

1 1 1

First Hidden Layer 1 1 0 Mask type II This is applied to all the subsequent convolutional layers.
0 0 0

1 1 1

Input 1 0 0
Mask type I This is applied only to the first convolutional layer to ensure autoregressive constraints.
0 0 0

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 54
Generative Models
Autoregressive (AR)

Autoregressive Generative Models: PixelCNN


Note: The receptive field can be increased by adding more hidden layers. However, due to the masked convolution, this
can lead to creating blind spots which means some pixels never contribute to the output pixel.

In other words:

Output Layer ×
× ×
𝑥𝑑
First Hidden Layer

Blind spot for the determined output. It happens no


matter how many layers we add. So it is modeling
Input
𝑝 𝑥13 |𝑥1:9, 𝑥11:12 not 𝑝 𝑥13 |𝑥1:12 ! Blind spots in modeling position t for a 3×3 kernel.

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 55
Generative Models
Autoregressive (AR)

Autoregressive Generative Models: PixelCNN


Note 1: To prevent having blind spots, each convolution is split into vertical and horizontal stacks where the vertical stack
looks at the rows above the targeted pixel while the horizontal stack looks at the pixel to the left of the target pixel in the
same row. For more details look at the original article Oord et al., NeurIPS 2016.

Note 2: In implementation of PixelCNN, a sort of gated non-linearity function/activation (rooted from LSTM) is used to
improve the performance. In addition, to improve gradient flow and speed up training, residual connections are often
used in PixelCNN models. These connections bypass one or more layers in the model and allow gradients to flow directly
to earlier layers.

Note 3: In the case of colored images, each pixel’s color channels are modeled sequentially, with the B channel
conditioned on (R,G) and the G channel conditioned solely on the R. Thus, PixelCNN takes images of size 𝑁 × 𝑁 × 3 at its
input and returns as output the predictions of size 𝑁 × 𝑁 × 3 × 256.

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 56
Generative Models
Autoregressive (AR)

Autoregressive Generative Models: PixelCNN


Note: That there are various extensions of PixelCNN such as PixelCNN++ by Salimans et al., ICLR2017 and PixelRNN
proposed by Oord et al., ICML2016 which are not covered in this course.

MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 57

You might also like