Autoregressive Models
Autoregressive Models
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 2
Generative Models
Autoregressive (AR)
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 3
Generative Models
Autoregressive (AR)
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 4
Generative Models
Autoregressive (AR)
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 5
Generative Models
Autoregressive (AR)
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 6
Generative Models
Autoregressive (AR)
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 7
Generative Models
Autoregressive (AR)
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 8
Generative Models
Autoregressive (AR)
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 9
Generative Models
Autoregressive (AR)
Note: raster scan order of an image is ordering of pixels by rows (row by row and pixel by pixel within every
row) thus, considering the chain rule, every pixel depends on all the pixels above and to the left of it.
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 10
Generative Models
Autoregressive (AR)
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 11
Generative Models
Autoregressive (AR)
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 12
Generative Models
Autoregressive (AR)
Example: (Cont.)
Each conditional distributions are modeled by a parametrized function and then we use the given training data to learn
the optimum values of parameters!
We will use the approach presented here Brendan J. Frey. Graphical models for machine learning and digital communication. MIT press, 1998
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 13
Generative Models
Autoregressive (AR)
Example: (Cont.)
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 14
Generative Models
Autoregressive (AR)
Example: (Cont.)
(2) (2)
𝑝 𝑥2 |𝑥1 ; 𝒘(2) : 𝑝 𝑥2 = 1|𝑥1 ; 𝒘(2) = 𝜎 𝑤0 + 𝑤1 𝑥1
Note: The conditional distributions are modeled by a parametrized functions (here logistic regression), i.e., a logistic
regression is used to predict the next pixel given all the previous pixels. This is what we mean by an autoregressive model.
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 15
Generative Models
Autoregressive (AR)
Example: (Cont.)
(2) (2)
𝑝 𝑥2 |𝑥1 ; 𝒘(2) : 𝑝 𝑥2 = 1|𝑥1 ; 𝒘(2) = 𝜎 𝑤0 + 𝑤1 𝑥1
1
Sigmoid (logistic) function: 𝜎 𝑧 =
1+𝑒 −𝑧
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 16
Generative Models
Autoregressive (AR)
Example: (Cont.)
(2) (2)
𝑝 𝑥2 |𝑥1 ; 𝒘(2) : 𝑝 𝑥2 = 1|𝑥1 ; 𝒘(2) = 𝜎 𝑤0 + 𝑤1 𝑥1
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 17
Generative Models
Autoregressive (AR)
Example: (Cont.)
(2) (2)
𝑝 𝑥2 |𝑥1 ; 𝒘(2) : 𝑝 𝑥2 = 1|𝑥1 ; 𝒘(2) = 𝜎 𝑤0 + 𝑤1 𝑥1
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 18
Generative Models
Autoregressive (AR)
Example: (Cont.)
(2) (2)
𝑝 𝑥2 |𝑥1 ; 𝒘(2) : 𝑝 𝑥2 = 1|𝑥1 ; 𝒘(2) = 𝜎 𝑤0 + 𝑤1 𝑥1
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 19
Generative Models
Autoregressive (AR)
𝑥1 𝑥2 𝑥3 𝑥4
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 20
Generative Models
Autoregressive (AR)
Example: Assume we are given binarized 2×2 images (shown with random variable 𝑋 ∈ {0,1}4), i.e., each pixel can either
be 0 or 1. The probability distribution 𝑝 𝑥 is modelled by parameterized Sigmoid functions as below:
𝑝 𝑥1 = 1; 𝑤 (1) = 𝑤 (1)
(2) (2)
𝑝 𝑥2 = 1|𝑥1 ; 𝒘(2) = 𝜎 𝑤0 + 𝑤1 𝑥1
(3) (3) (3)
𝑝 𝑥3 = 1|𝑥1 , 𝑥2 ; 𝒘(3) = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
(4) (4) (4) (4)
𝑝 𝑥4 = 1|𝑥1 , 𝑥2 , 𝑥3 ; 𝒘(4) = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 21
Generative Models
Autoregressive (AR)
Solution:
Note: The optimum values of these parameters (𝑤 (1), 𝒘(2) , 𝒘(3) , 𝒘(4) ) are found by maximizing the (log) likelihood over
the training data samples. It is equivalent to minimizing the negative (log) likelihood!
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 22
Generative Models
Autoregressive (AR)
Here is a result of FVSBN trained on the Caltech 101 Silhouettes dataset. The result is from Gan et al., PMLR 2015.
From Figure 4 of article Gan et al., PMLR 2015: Training data (on left) and synthesized samples (on right).
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 23
Generative Models
Autoregressive (AR)
𝑥1
𝑥2
𝑥𝐷
The 𝑑 th output should be just a function of
the previous 𝑑 − 1 inputs.
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 24
Generative Models
Autoregressive (AR)
Let see how a simple NADE can be formulated: The goal is to use a neural network to model all of the conditional
probabilities in 𝑝 𝑥 = 𝑝 𝑥1 𝑝 𝑥2 |𝑥1 𝑝 𝑥3 |𝑥1 , 𝑥2 𝑝 𝑥4 |𝑥1 , 𝑥2 , 𝑥3 … 𝑝 𝑥𝐷 |𝑥1:𝐷−1 . Which means we need a neural
network that receives 𝐷 inputs 𝑥1 , 𝑥2 , …, 𝑥𝐷 and returns 𝐷 outputs (each corresponding to one of the terms in the joint
factorization). BUT, we need to be careful since the 𝑑 th output should be just a function of the previous 𝑑 − 1 inputs (why is
it so? Because looking at the 𝑑 th term in the factorization we have 𝑝 𝑥𝑑 𝑥1:𝑑−1 ). The next thing that we should be careful about is
that we cannot simply use a fully connected layer! Why? Because in the fully connected layer every nodes in the hidden
layer sees all the inputs which means all the nodes in the next layers after that up to the output layer are a function of all
the inputs! We should avoid such a thing happens. BUT how? NADE has a solution for that …
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 25
Generative Models
Autoregressive (AR)
𝑥3 𝑥 ⊙ ℎ𝑑 𝑝(𝑥𝑑 |𝑥1:𝑑−1 )
…
…
…
…
…
…
…
𝑥𝐷 …
𝑚𝑑
ℎ𝑑 = 𝜎 𝑾(𝑥 ⊙ 𝑚𝑑 ) + 𝑏
1 1 … 1 0 … 0
𝑝 𝑥𝑑 |𝑥1:𝑑−1 = 𝜎 𝑉𝑑 𝑇 ℎ𝑑 + 𝑐𝑑
𝑑−1
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 26
Generative Models
Autoregressive (AR)
𝑾
𝑥1 𝑉1 𝑝 𝑥1
ℎ𝑑 = 𝜎 𝑾(𝑥 ⊙ 𝑚𝑑 ) + 𝑏 𝑾 is an 𝑀 × 𝐷 matrix (shared). … ℎ1
𝑥2
𝑉2 𝑝 𝑥2 |𝑥1
𝑥1 … ℎ2
𝑥 = … is an 𝐷 × 1 vector. 𝑉3 𝑝 𝑥3 |𝑥1 , 𝑥2
Then, the outputs are computed as follows: 𝑥3 … ℎ3
𝑥𝐷
…
…
…
𝑚𝑑 is an 𝐷 × 1 mask vector that its 𝑉𝐷
𝑝 𝑥𝑑 |𝑥1:𝑑−1 = 𝜎 𝑉𝑑 𝑇 ℎ𝑑 + 𝑐𝑑 first (𝑑 − 1) elements are all 1 and 𝑥𝐷 … ℎ𝐷 𝑝 𝑥𝐷 |𝑥1:𝐷−1
the rest are 0. 𝑀 units
Note: The number of parameters is a function 𝑏 is an 𝑀 × 1 vector (shared).
of 𝐷𝑀 and not 𝐷2 !
𝑉𝑑 is an 𝑀 × 1 vector.
𝑐𝑑 is a scalar.
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 27
Generative Models
Autoregressive (AR)
𝑁 𝑁 𝐷
1 1
ℒ 𝜃 = 𝔼𝑋 𝑵𝑳𝑳 = 𝔼𝑋 −log 𝑝 𝑋 = −log 𝑝 𝑥 (𝑖) = − log 𝑝 𝑥 (𝑖) 𝑑 |𝑥 (𝑖)1:𝑑−1
𝑁 𝑁
𝑖=1 𝑖=1 𝑑=1
where 𝜃 = {𝑾, 𝑏, 𝑉1 , 𝑐1 , 𝑉2 , 𝑐2 , … , 𝑉𝐷 , 𝑐𝐷 } includes the parameters of the NADE that we aim to learn.
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 28
Generative Models
Autoregressive (AR)
Example: Consider a simple case where 𝑋 ∈ {0,1}3 , i.e. binary sequences with length 𝐷 = 3. Compute the loss function of
the NADE for a training instance 𝑥 (1) = (𝑥 (1)1 , 𝑥 (1) 2 , 𝑥 (1) 3 ) = 1,0,0 .
𝑁 𝑁 𝐷
1 1
ℒ 𝜃 = 𝔼𝑋 𝑵𝑳𝑳 = 𝔼𝑋 −log 𝑝 𝑋 = −log 𝑝 𝑥 (𝑖) = − log 𝑝 𝑥 (𝑖) 𝑑 |𝑥 (𝑖)1:𝑑−1
𝑁 𝑁
𝑖=1 𝑖=1 𝑑=1
Solution:
3
ℒ(𝜃) = − log 𝑝 𝑥 (1) 𝑑 |𝑥 (1)1:𝑑−1 = − log 𝑝 𝑥 (1)1 = 1 − log 𝑝 𝑥 (1) 2 = 0|𝑥 (1)1 − log 𝑝 𝑥 (1) 3 = 0|𝑥 (1)1 , 𝑥 (1) 2
ฬ
𝑥=𝑥(1)
𝑑=1
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 29
Generative Models
Autoregressive (AR)
𝑁 𝑁 𝐷
1 (𝑖)
1
ℒ 𝜃 = 𝔼𝑋 𝑵𝑳𝑳 = 𝔼𝑋 −log 𝑝 𝑋 = −log 𝑝 𝑥 = − log 𝑝 𝑥 (𝑖) 𝑑 |𝑥 (𝑖)1:𝑑−1
𝑁 𝑁
𝑖=1 𝑖=1 𝑑=1
Cross entropy
calculated for pixel 𝑑!
Note that this is nothing but the (averaged) summation of the cross-entropies. In other words, the cross-entropy is applied
to each output (which is binary) and then the (averaged) summation of these cross-entropies are computed.
For example, assume that for a given training instance we have 𝑥3 = 0 which means the true probability distribution (at the third output)
is 𝑝 = 𝑝 𝑥3 = 0 𝑥1 , 𝑥2 = 1, 𝑝 𝑥3 = 1 𝑥1 , 𝑥2 = 0 while the predicted one is 𝑞 = 1 − 𝜎 𝑉3 𝑇 ℎ3 + 𝑐3 , 𝜎 𝑉3 𝑇 ℎ3 + 𝑐3 . So the cross-entropy
would be as: 𝐶𝐸 𝑝, 𝑞 = −1 × log 1 − 𝜎 𝑉3 𝑇 ℎ3 + 𝑐3 − 0 × log 𝜎 𝑉3 𝑇 ℎ3 + 𝑐3
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 30
Generative Models
Autoregressive (AR)
Example: Consider a simple case where 𝑋 ∈ {0,1}3 , i.e. binary sequences with length 𝐷 = 3. Compute the loss function of
the NADE for a training instance 𝑥 (1) = (𝑥 (1)1 , 𝑥 (1) 2 , 𝑥 (1) 3 ) = 1,0,0 .
𝑁 𝑁 𝐷
1 1
ℒ 𝜃 = 𝔼𝑋 𝑵𝑳𝑳 = 𝔼𝑋 −log 𝑝 𝑋 = −log 𝑝 𝑥 (𝑖) = − log 𝑝 𝑥 (𝑖) 𝑑 |𝑥 (𝑖)1:𝑑−1
𝑁 𝑁
𝑖=1 𝑖=1 𝑑=1
ℒ(𝜃) = − log 𝑝 𝑥 (1) 𝑑 |𝑥 (1)1:𝑑−1 = − log 𝑝 𝑥 (1)1 = 1 − log 𝑝 𝑥 (1) 2 = 0|𝑥 (1)1 − log 𝑝 𝑥 (1) 3 = 0|𝑥 (1)1 , 𝑥 (1) 2
ฬ
𝑥=𝑥(1)
𝑑=1
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 31
Generative Models
Autoregressive (AR)
Example: Consider a simple case where 𝑋 ∈ {0,1}3 , i.e. binary sequences with length 𝐷 = 3. Compute the loss function of
the NADE for a training instance 𝑥 (1) = (𝑥 (1)1 , 𝑥 (1) 2 , 𝑥 (1) 3 ) = 1,0,0 .
𝑁 𝑁 𝐷
1 1
ℒ 𝜃 = 𝔼𝑋 𝑵𝑳𝑳 = 𝔼𝑋 −log 𝑝 𝑋 = −log 𝑝 𝑥 (𝑖) = − log 𝑝 𝑥 (𝑖) 𝑑 |𝑥 (𝑖)1:𝑑−1
𝑁 𝑁
𝑖=1 𝑖=1 𝑑=1
Method 2 : (cross-entropy)
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 32
Generative Models
Autoregressive (AR)
To summarize:
𝑾
𝑥1 𝑉1 𝑝 𝑥1
For each 𝑑 = 1, 2, … , 𝐷: 𝑀 × 𝐷 matrix … ℎ1
𝑥2 … ℎ2
𝑉2 𝑝 𝑥2 |𝑥1
ℎ𝑑 = 𝜎 𝑾(𝑥 ⊙ 𝑚𝑑 ) + 𝑏 𝑀 × 1 vector
𝑉3 𝑝 𝑥3 |𝑥1 , 𝑥2
𝑥3 … ℎ3
𝑝 𝑥𝑑 |𝑥1:𝑑−1 = 𝜎 𝑉𝑑 𝑇ℎ + 𝑐𝑑
𝑑
…
…
…
𝑉𝐷
𝑥𝐷 … ℎ𝐷 𝑝 𝑥𝐷 |𝑥1:𝐷−1
𝐷
𝑀 units
ℒ 𝜃 = 𝔼𝑋 𝑵𝑳𝑳 = 𝔼𝑋 −log 𝑝 𝑋 = 𝔼𝑋 − log 𝑝 𝑋𝑑 |𝑋1:𝑑−1
𝑑=1
Cross entropy
calculated for pixel 𝑑
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 33
Generative Models
Autoregressive (AR)
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 34
Generative Models
Autoregressive (AR)
Question 1: Can we apply these models to non-binary discrete random variables? For example 𝑋 ∈ 0,1,2, … , 255 𝐷 .
- Yes, we can simply model this multinomial distribution using Softmax activation at the output (instead of Sigmoid function).
…
…
…
ℒ 𝜃 = 𝔼𝑋 𝑵𝑳𝑳 = 𝔼𝑋 −log 𝑝 𝑋 = 𝔼𝑋 − log 𝑝 𝑋𝑑 |𝑋1:𝑑−1 𝑉𝐷
𝑥𝐷 … ℎ𝐷 𝑝 𝑥𝐷 |𝑥1:𝐷−1
𝑑=1 𝑀 units
Cross entropy
calculated for pixel 𝑑
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 35
Generative Models
Autoregressive (AR)
…
𝑥3 … ℎ3
…
…
𝑉𝜋𝐷
ℒ 𝜃 = 𝔼𝑋 𝑵𝑳𝑳 = 𝔼𝑋 −log 𝑝 𝑋 = 𝔼𝑋 − log 𝑝 𝑋𝑑 |𝑋1:𝑑−1 𝑥𝐷 … ℎ𝐷
𝑉𝜇𝐷
𝑉𝜎𝐷
𝑑=1
𝑀 units
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 36
Generative Models**
Autoregressive (AR)
𝑥ො = 𝜎 𝑐 + 𝑉ℎ[2] 1 2 3
V V
𝑥ො = 𝜎 𝑐 + 𝑉 ⊙ 𝑀 𝑉 ℎ[2]
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 37
Generative Models**
Autoregressive (AR)
o Assume an ordering in 𝐷-dimensional data 𝑥, e.g., 𝑥2 , 𝑥1 , 𝑥3 , … and keep it for both input and output nodes. (in example below:
𝑥1 , 𝑥2 , 𝑥3 )
o For the hidden layers, assign each node a number between 1 and 𝐷 − 1.
o Each nodes at the output can be connected to the ones at
previous layers with (strictly) smaller node number. 𝑝 𝑥2 |𝑥1 𝑝 𝑥3 |𝑥1 , 𝑥2 Hadamard product
𝑝 𝑥1 (element-wise multiplication)
o Each nodes at the hidden layers can be connected to the ones 1 2 3 Mask
at the previous layer with smaller or equal node number. V
1 1 2 2
𝑥ො = 𝜎 𝑐 + 𝑉 ⊙ 𝑀 𝑉 ℎ[2]
1 1 0 0 1 1 0 W [2]
0 0 0 0
ℎ[2] = 𝑓 [2] 𝑏[2] + 𝑊 2 ⊙ 𝑀 2 ℎ[1]
= 0 1 2 1 1 2
𝑀 1
= 1 0 0 , 𝑀 2 1 0 , 𝑀 𝑉
= 1 1 0 0
1 0 0 1 1 1 1
1 1 1 1 W [1]
1 1 0 1 1 1 1
1 2 3 ℎ[1] = 𝑓 [1] 𝑏[1] + 𝑊 1
⊙𝑀1 𝑥
𝑥1 𝑥2 𝑥3
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 38
Generative Models**
Autoregressive (AR)
➢ Training is done as the one for fully connected network except applying the mask (to ensure autoregressive constraints).
➢ The loss function is as before for the Neural Autoregressive Density Estimator (NADE).
𝑝 𝑥2 |𝑥1 𝑝 𝑥3 |𝑥1 , 𝑥2
𝑥ො1 𝑥ො2 𝑥ො3 0 0 0 0 𝑝 𝑥1
𝑀𝑉 = 1 1 0 0
1 1 2 3
1 1 1
V V
0 1 1 0
𝑀2 = 0 1 1 0 1 1 2 2
1 1 1 1
W [2] ⊙ 1 1 1 1 ≡ W [2]
2 1 1 2
1 1 0
W [1] 𝑀1 = 1 0 0
W [1]
1 0 0
1 1 0 1 2 3
𝑥1 𝑥2 𝑥3 𝑥1 𝑥2 𝑥3
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 39
Generative Models**
Autoregressive (AR)
1 1 2 2 1 1 2 2 1 1 2 2
W [2] W [2] W [2]
2 1 1 2 2 1 1 2 2 1 1 2
Sample a value of 𝑥1 from 𝑝 𝑥1 Feed 𝑥1 to the network and Feed 𝑥1 and 𝑥2 to the network
compute 𝑝 𝑥2 |𝑥1 . Then sample and compute 𝑝 𝑥3 |𝑥1 , 𝑥2 . Then
𝑥2 from 𝑝 𝑥2 |𝑥1 . sample 𝑥3 from 𝑝 𝑥3 |𝑥1 , 𝑥2 .
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 40
Generative Models
Autoregressive (AR)
Autoregressive Generative Models: Lets turn into another modeling based on CNNs
Can we do this sequential modeling using convolutional layers (Conv 1D)?
RECAP: 1D Convolution is an operation applied to 1D inputs where each output is the weighted sum of the nearby inputs.
Lets assume 1D input 𝒙 = [𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , 𝑥5 , 𝑥6 , 𝑥7 ] and a 1D convolution filter of size three 𝒘 = [𝑤1 , 𝑤2 , 𝑤3 ]𝑇 .
input filter output
⊗ Filter/Kernel size = 3
𝑥1 𝑤1
⊗
𝑥2 𝑤2 𝑧1 𝑧1 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3
⊗
𝑥3 𝑤3
𝑥4
Note that normally there is a bias term (i.e., 𝑧1 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 + 𝑏)
𝑥5
which for simplicity is ignored here.
𝑥6
𝑥7
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 41
Generative Models
Autoregressive (AR)
Autoregressive Generative Models: Lets turn into another modeling based on CNNs
Can we do this sequential modeling using convolutional layers (Conv 1D)?
RECAP: 1D Convolution is an operation applied to 1D inputs where each output is the weighted sum of the nearby inputs.
Lets assume 1D input 𝒙 = [𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , 𝑥5 , 𝑥6 , 𝑥7 ] and a 1D convolution filter of size three 𝒘 = [𝑤1 , 𝑤2 , 𝑤3 ]𝑇 .
input filter output
Filter/Kernel size = 3
𝑥1
⊗
𝑥2 𝑤1 𝑧1 𝑧1 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3
⊗
𝑥3 𝑤2 𝑧2 𝑧2 = 𝑤1 𝑥2 + 𝑤2 𝑥3 + 𝑤3 𝑥4
⊗
𝑥4 𝑤3
…
𝑥5
…
𝑥6
𝑥7
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 42
Generative Models
Autoregressive (AR)
Autoregressive Generative Models: Lets turn into another modeling based on CNNs
Can we do this sequential modeling using convolutional layers (Conv 1D)?
Yes, we can but we need to ensure this operation is done in a causal manner. Why? This is the condition enforced by the
autoregressive representation of the data probability distribution.
𝐷
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 43
Generative Models
Autoregressive (AR)
𝒕=𝟓
𝒕=𝟒
𝒕=𝟐
𝒕=𝟑
𝒕=𝟏
𝒕=𝟓
𝒕=𝟒
𝒕=𝟐
𝒕=𝟑
𝒕=𝟏
Q2. In this model the receptive field is 2, in
output
𝑥ො1 𝑥ො2 𝑥ො3 𝑥ො4 𝑥ො5 other words the output at timestep 𝑡
Causal Conv 1D depends on the input at 𝑡 − 1 and 𝑡 − 2.
How can we increase the receptive field to
input
0 0 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 have a better modeling?
𝑥1 𝑥2 𝑥3 𝑥4 𝑥5
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 44
Generative Models
Autoregressive (AR)
𝒕=𝟑
𝒕=𝟏
Filter/Kernel size = 2
𝒕=𝟓
𝒕=𝟒
𝒕=𝟐
𝒕=𝟑
𝒕=𝟏
𝑥ො1 𝑥ො2 𝑥ො3 𝑥ො4 𝑥ො5
𝑥ො1 𝑥ො2 𝑥ො3 𝑥ො4 𝑥ො5
𝒕=𝟓
𝒕=𝟒
𝒕=𝟐
𝒕=𝟑
𝒕=𝟏
output
𝑥ො1 𝑥ො2 𝑥ො3 𝑥ො4 𝑥ො5
Causal Conv 1D 0
input
0 0 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 0 0 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5
𝑥1 𝑥2 𝑥3 𝑥4 𝑥5
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 45
Generative Models
Autoregressive (AR)
𝒕=𝟑
𝒕=𝟏
Filter/Kernel size = 2
𝒕=𝟓
𝒕=𝟒
𝒕=𝟐
𝒕=𝟑
𝒕=𝟏
𝑥ො1 𝑥ො2 𝑥ො3 𝑥ො4 𝑥ො5
𝑥ො1 𝑥ො2 𝑥ො3 𝑥ො4 𝑥ො5
𝒕=𝟓
𝒕=𝟒
𝒕=𝟐
𝒕=𝟑
𝒕=𝟏
output
𝑥ො1 𝑥ො2 𝑥ො3 𝑥ො4 𝑥ො5
Causal Conv 1D 0
input
0 0 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 0 0 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 46
Generative Models
Autoregressive (AR)
𝑥ො𝑇
Output
Hidden Layer
Hidden Layer
Hidden Layer
Input
𝑥𝑇−1
Oord et al., 2016
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 47
Generative Models
Autoregressive (AR)
𝑥ො𝑇
Output
Dilation =8
Hidden Layer
Dilation =4
Hidden Layer
Dilation =2
Hidden Layer
Dilation =1
Input
𝑥𝑇−1
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 48
Generative Models
Autoregressive (AR)
𝑝(𝑥𝑡 |𝑥1:𝑡−1 )
Output
Dilation =8
Hidden Layer
Dilation =4
Hidden Layer
Dilation =2
Hidden Layer
Dilation =1
Input
𝑥𝑡−1
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 49
Generative Models
Autoregressive (AR)
𝑥
Input to layer
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 50
Generative Models
Autoregressive (AR)
residual/skip/shortcut connection
𝑧 = tanh(𝑊𝑓,𝑘 ∗ 𝑥)⨀𝜎 𝑊𝑔,𝑘 ∗ 𝑥
𝑥
Input to layer
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 51
Generative Models
Autoregressive (AR)
https://www.deepmind.com/blog/wavenet-a-generative-model-for-raw-audio
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 52
Generative Models
Autoregressive (AR)
kernel Mask
Note that masked convolution was introduced few months before Conv 1D (by the same author) and it is more precise to
say that Conv 1D was introduced based on masked convolution.
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 53
Generative Models
Autoregressive (AR)
1 1 1
First Hidden Layer 1 1 0 Mask type II This is applied to all the subsequent convolutional layers.
0 0 0
1 1 1
Input 1 0 0
Mask type I This is applied only to the first convolutional layer to ensure autoregressive constraints.
0 0 0
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 54
Generative Models
Autoregressive (AR)
In other words:
Output Layer ×
× ×
𝑥𝑑
First Hidden Layer
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 55
Generative Models
Autoregressive (AR)
Note 2: In implementation of PixelCNN, a sort of gated non-linearity function/activation (rooted from LSTM) is used to
improve the performance. In addition, to improve gradient flow and speed up training, residual connections are often
used in PixelCNN models. These connections bypass one or more layers in the model and allow gradients to flow directly
to earlier layers.
Note 3: In the case of colored images, each pixel’s color channels are modeled sequentially, with the B channel
conditioned on (R,G) and the G channel conditioned solely on the R. Thus, PixelCNN takes images of size 𝑁 × 𝑁 × 3 at its
input and returns as output the predictions of size 𝑁 × 𝑁 × 3 × 256.
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 56
Generative Models
Autoregressive (AR)
MH. Shateri, SYS863-01: Deep Generative Modeling: Theory and Applications (Summer 2023) 57