0% found this document useful (0 votes)
101 views

Lecture7 PDF

An autoencoder is a type of neural network that encodes an input into a hidden representation and then decodes the hidden representation to reconstruct the input. If an undercomplete autoencoder (where the hidden dimension is less than the input dimension) can perfectly reconstruct the input from the hidden representation, it means the hidden representation captures all the important characteristics of the input. This is analogous to principal component analysis. The document then discusses regularization techniques for autoencoders like denoising, sparse, and contractive autoencoders.

Uploaded by

vicky.sajnani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views

Lecture7 PDF

An autoencoder is a type of neural network that encodes an input into a hidden representation and then decodes the hidden representation to reconstruct the input. If an undercomplete autoencoder (where the hidden dimension is less than the input dimension) can perfectly reconstruct the input from the hidden representation, it means the hidden representation captures all the important characteristics of the input. This is analogous to principal component analysis. The document then discusses regularization techniques for autoencoders like denoising, sparse, and contractive autoencoders.

Uploaded by

vicky.sajnani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 228

CS7015 (Deep Learning) : Lecture 7

Autoencoders and relation to PCA, Regularization in autoencoders, Denoising


autoencoders, Sparse autoencoders, Contractive autoencoders

Mitesh M. Khapra

Department of Computer Science and Engineering


Indian Institute of Technology Madras

1/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.1: Introduction to Autoencoders

2/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i

W∗
h
W
xi

3/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
An autoencoder is a special type of
x̂i feed forward neural network which
does the following
W∗
h
W
xi

3/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
An autoencoder is a special type of
x̂i feed forward neural network which
does the following
W∗
Encodes its input xi into a hidden
h representation h
W
xi

3/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
An autoencoder is a special type of
x̂i feed forward neural network which
does the following
W∗
Encodes its input xi into a hidden
h representation h
W
xi

h = g(W xi + b)

3/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
An autoencoder is a special type of
x̂i feed forward neural network which
does the following
W∗
Encodes its input xi into a hidden
h representation h
W Decodes the input again from this
hidden representation
xi

h = g(W xi + b)

3/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
An autoencoder is a special type of
x̂i feed forward neural network which
does the following
W∗
Encodes its input xi into a hidden
h representation h
W Decodes the input again from this
hidden representation
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

3/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
An autoencoder is a special type of
x̂i feed forward neural network which
does the following
W∗
Encodes its input xi into a hidden
h representation h
W Decodes the input again from this
hidden representation
xi
The model is trained to minimize a
certain loss function which will ensure
that x̂i is close to xi (we will see some
h = g(W xi + b)
such loss functions soon)
x̂i = f (W ∗ h + c)

3/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i

W∗
h
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

4/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case where
x̂i dim(h) < dim(xi )

W∗
h
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

4/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case where
x̂i dim(h) < dim(xi )
If we are still able to reconstruct x̂i
W∗
perfectly from h, then what does it
h say about h?

W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

4/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case where
x̂i dim(h) < dim(xi )
If we are still able to reconstruct x̂i
W∗
perfectly from h, then what does it
h say about h?
h is a loss-free encoding of xi . It cap-
W
tures all the important characteristics
xi of xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

4/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case where
x̂i dim(h) < dim(xi )
If we are still able to reconstruct x̂i
W∗
perfectly from h, then what does it
h say about h?
h is a loss-free encoding of xi . It cap-
W
tures all the important characteristics
xi of xi
Do you see an analogy with PCA?
h = g(W xi + b)
x̂i = f (W ∗ h + c)

4/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case where
x̂i dim(h) < dim(xi )
If we are still able to reconstruct x̂i
W∗
perfectly from h, then what does it
h say about h?
h is a loss-free encoding of xi . It cap-
W
tures all the important characteristics
xi of xi
Do you see an analogy with PCA?
h = g(W xi + b)
x̂i = f (W ∗ h + c)

An autoencoder where dim(h) < dim(xi ) is


called an under complete autoencoder

4/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i

W∗
h
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗
h
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
Such an identity encoding is useless
xi in practice as it does not really tell us
anything about the important char-
h = g(W xi + b) acteristics of the data
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
Such an identity encoding is useless
xi in practice as it does not really tell us
anything about the important char-
h = g(W xi + b) acteristics of the data
x̂i = f (W ∗ h + c)

An autoencoder where dim(h) ≥ dim(xi ) is


called an over complete autoencoder

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The Road Ahead

6/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The Road Ahead
Choice of f (xi ) and g(xi )

6/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The Road Ahead
Choice of f (xi ) and g(xi )
Choice of loss function

6/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The Road Ahead
Choice of f (xi ) and g(xi )
Choice of loss function

7/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i = f (W ∗ h + c)

W∗
h = g(W xi + b)

W
xi

0 1 1 0 1 (binary inputs)

8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are binary
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗
h = g(W xi + b)

W
xi

0 1 1 0 1 (binary inputs)

8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are binary
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)

W
xi

0 1 1 0 1 (binary inputs)

8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are binary
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
xi

0 1 1 0 1 (binary inputs)

8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are binary
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi

0 1 1 0 1 (binary inputs)

8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are binary
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi
x̂i = logistic(W ∗ h + c)

0 1 1 0 1 (binary inputs)

8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are binary
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi
x̂i = logistic(W ∗ h + c)

0 1 1 0 1 (binary inputs) Logistic as it naturally restricts all


outputs to be between 0 and 1

8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are binary
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi
x̂i = logistic(W ∗ h + c)

0 1 1 0 1 (binary inputs) Logistic as it naturally restricts all


outputs to be between 0 and 1

g is typically chosen as the sigmoid


function

8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i = f (W ∗ h + c)

W∗
h = g(W xi + b)

W
xi

0.25 0.5 1.25 3.5 4.5


(real valued inputs)

9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗
h = g(W xi + b)

W
xi

0.25 0.5 1.25 3.5 4.5


(real valued inputs)

9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)

W
xi

0.25 0.5 1.25 3.5 4.5


(real valued inputs)

9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
xi

0.25 0.5 1.25 3.5 4.5


(real valued inputs)

9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi

0.25 0.5 1.25 3.5 4.5


(real valued inputs)

9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi
x̂i = logistic(W ∗ h + c)

0.25 0.5 1.25 3.5 4.5


(real valued inputs)

9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi
x̂i = logistic(W ∗ h + c)

0.25 0.5 1.25 3.5 4.5 What will logistic and tanh do?
(real valued inputs)

9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi
x̂i = logistic(W ∗ h + c)

0.25 0.5 1.25 3.5 4.5 What will logistic and tanh do?
(real valued inputs) They will restrict the reconstruc-
ted x̂i to lie between [0,1] or [-1,1]
whereas we want x̂i ∈ Rn

9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi
x̂i = logistic(W ∗ h + c)

0.25 0.5 1.25 3.5 4.5 What will logistic and tanh do?
(real valued inputs) They will restrict the reconstruc-
ted x̂i to lie between [0,1] or [-1,1]
Again, g is typically chosen as the
whereas we want x̂i ∈ Rn
sigmoid function

9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The Road Ahead
Choice of f (xi ) and g(xi )
Choice of loss function

10/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i

W∗

xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

11/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are real
x̂i valued

W∗

xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

11/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are real
x̂i valued
The objective of the autoencoder is to recon-
W∗ struct x̂i to be as close to xi as possible

xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

11/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are real
x̂i valued
The objective of the autoencoder is to recon-
W∗ struct x̂i to be as close to xi as possible

h This can be formalized using the following


objective function:
W m n
1 XX
min (x̂ij − xij )2
xi ∗
W,W ,c,b m i=1 j=1

h = g(W xi + b)
x̂i = f (W ∗ h + c)

11/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are real
x̂i valued
The objective of the autoencoder is to recon-
W∗ struct x̂i to be as close to xi as possible

h This can be formalized using the following


objective function:
W m n
1 XX
min (x̂ij − xij )2
xi ∗
W,W ,c,b m i=1 j=1
m
1 X
h = g(W xi + b) i.e., min (x̂i − xi )T (x̂i − xi )

W,W ,c,b m i=1
x̂i = f (W ∗ h + c)

11/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are real
x̂i valued
The objective of the autoencoder is to recon-
W∗ struct x̂i to be as close to xi as possible

h This can be formalized using the following


objective function:
W m n
1 XX
min (x̂ij − xij )2
xi ∗
W,W ,c,b m i=1 j=1
m
1 X
h = g(W xi + b) i.e., min (x̂i − xi )T (x̂i − xi )

W,W ,c,b m i=1
x̂i = f (W ∗ h + c)
We can then train the autoencoder just like
a regular feedforward network using back-
propagation

11/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are real
x̂i valued
The objective of the autoencoder is to recon-
W∗ struct x̂i to be as close to xi as possible

h This can be formalized using the following


objective function:
W m n
1 XX
min (x̂ij − xij )2
xi ∗
W,W ,c,b m i=1 j=1
m
1 X
h = g(W xi + b) i.e., min (x̂i − xi )T (x̂i − xi )

W,W ,c,b m i=1
x̂i = f (W ∗ h + c)
We can then train the autoencoder just like
a regular feedforward network using back-
propagation
∂L (θ) ∂L (θ)
All we need is a formula for ∂W ∗
and ∂W
which we will see now

11/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi )
h2 = x̂i
a2

W∗
h1
a1
W
h0 = xi

12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi )
h2 = x̂i
a2

W∗
h1
a1
W
h0 = xi

Note that the loss function is


shown for only one training
example.

12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi ) ∂L (θ) ∂L (θ) ∂h2 ∂a2
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2

W∗
h1
a1
W
h0 = xi

Note that the loss function is


shown for only one training
example.

12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi ) ∂L (θ) ∂L (θ) ∂h2 ∂a2
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2
∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
W∗ ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
h1
a1
W
h0 = xi

Note that the loss function is


shown for only one training
example.

12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi ) ∂L (θ) ∂L (θ) ∂h2 ∂a2
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2
∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
W∗ ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
h1
a1 We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
W
h0 = xi

Note that the loss function is


shown for only one training
example.

12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi ) ∂L (θ) ∂L (θ) ∂h2 ∂a2
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2
∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
W∗ ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
h1
a1 We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
W
h0 = xi ∂L (θ) ∂L (θ)
=
∂h2 ∂x̂i

Note that the loss function is


shown for only one training
example.

12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi ) ∂L (θ) ∂L (θ) ∂h2 ∂a2
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2
∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
W∗ ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
h1
a1 We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
W
h0 = xi ∂L (θ) ∂L (θ)
=
∂h2 ∂x̂i
= ∇x̂i {(x̂i − xi )T (x̂i − xi )}
Note that the loss function is
shown for only one training
example.

12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi ) ∂L (θ) ∂L (θ) ∂h2 ∂a2
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2
∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
W∗ ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
h1
a1 We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
W
h0 = xi ∂L (θ) ∂L (θ)
=
∂h2 ∂x̂i
= ∇x̂i {(x̂i − xi )T (x̂i − xi )}
Note that the loss function is = 2(x̂i − xi )
shown for only one training
example.

12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i = f (W ∗ h + c)

W∗

h = g(W xi + b)

W
xi

0 1 1 0 1 (binary inputs)

13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary

W∗

h = g(W xi + b)

W
xi

0 1 1 0 1 (binary inputs)

13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary

W∗ We use a sigmoid decoder which will


produce outputs between 0 and 1, and
h = g(W xi + b) can be interpreted as probabilities.

W
xi

0 1 1 0 1 (binary inputs)

13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary

W∗ We use a sigmoid decoder which will


produce outputs between 0 and 1, and
h = g(W xi + b) can be interpreted as probabilities.

W For a single n-dimensional ith input we


can use the following loss function
xi X n
min{− (xij log x̂ij + (1 − xij ) log(1 − x̂ij ))}
j=1
0 1 1 0 1 (binary inputs)

13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary

W∗ We use a sigmoid decoder which will


produce outputs between 0 and 1, and
h = g(W xi + b) can be interpreted as probabilities.

W For a single n-dimensional ith input we


can use the following loss function
xi X n
min{− (xij log x̂ij + (1 − xij ) log(1 − x̂ij ))}
j=1
0 1 1 0 1 (binary inputs)
What value of x̂ij will minimize this
function?

13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary

W∗ We use a sigmoid decoder which will


produce outputs between 0 and 1, and
h = g(W xi + b) can be interpreted as probabilities.

W For a single n-dimensional ith input we


can use the following loss function
xi X n
min{− (xij log x̂ij + (1 − xij ) log(1 − x̂ij ))}
j=1
0 1 1 0 1 (binary inputs)
What value of x̂ij will minimize this
function?
If xij = 1 ?

13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary

W∗ We use a sigmoid decoder which will


produce outputs between 0 and 1, and
h = g(W xi + b) can be interpreted as probabilities.

W For a single n-dimensional ith input we


can use the following loss function
xi X n
min{− (xij log x̂ij + (1 − xij ) log(1 − x̂ij ))}
j=1
0 1 1 0 1 (binary inputs)
What value of x̂ij will minimize this
function?
If xij = 1 ?
If xij = 0 ?

13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary

W∗ We use a sigmoid decoder which will


produce outputs between 0 and 1, and
h = g(W xi + b) can be interpreted as probabilities.

W For a single n-dimensional ith input we


can use the following loss function
xi X n
min{− (xij log x̂ij + (1 − xij ) log(1 − x̂ij ))}
j=1
0 1 1 0 1 (binary inputs) ∂L (θ)
Again we need is a formula for ∂W ∗ and
What value of x̂ij will minimize this ∂L (θ)
∂W to use backpropagation
function?
If xij = 1 ?
If xij = 0 ?

13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary

W∗ We use a sigmoid decoder which will


produce outputs between 0 and 1, and
h = g(W xi + b) can be interpreted as probabilities.

W For a single n-dimensional ith input we


can use the following loss function
xi X n
min{− (xij log x̂ij + (1 − xij ) log(1 − x̂ij ))}
j=1
0 1 1 0 1 (binary inputs) ∂L (θ)
Again we need is a formula for ∂W ∗ and
What value of x̂ij will minimize this ∂L (θ)
∂W to use backpropagation
function?
If xij = 1 ?
If xij = 0 ?
Indeed the above function will be
minimized when x̂ij = xij !
13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
n
L (θ) = −
P
(xij log x̂ij + (1 − xij ) log(1 − x̂ij ))
j=1
h2 = x̂i
a2

W∗
h1
a1
W
h0 = xi

14/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
n ∂L (θ) ∂L (θ) ∂h2 ∂a2
L (θ) = − =
P
(xij log x̂ij + (1 − xij ) log(1 − x̂ij )) ∗
j=1 ∂W ∂h2 ∂a2 ∂W ∗
h2 = x̂i
a2

W∗
h1
a1
W
h0 = xi

14/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
n ∂L (θ) ∂L (θ) ∂h2 ∂a2
L (θ) = − =
P
(xij log x̂ij + (1 − xij ) log(1 − x̂ij )) ∗
j=1 ∂W ∂h2 ∂a2 ∂W ∗
h2 = x̂i
a2 ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
W∗
h1
a1
W
h0 = xi

14/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
n ∂L (θ) ∂L (θ) ∂h2 ∂a2
L (θ) = − =
P
(xij log x̂ij + (1 − xij ) log(1 − x̂ij )) ∗
j=1 ∂W ∂h2 ∂a2 ∂W ∗
h2 = x̂i
a2 ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
W∗
h1 We have already seen how to
a1 calculate the expressions in the
W square boxes when we learnt BP
h0 = xi

14/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
n ∂L (θ) ∂L (θ) ∂h2 ∂a2
L (θ) = − =
P
(xij log x̂ij + (1 − xij ) log(1 − x̂ij )) ∗
j=1 ∂W ∂h2 ∂a2 ∂W ∗
h2 = x̂i
a2 ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
W∗
h1 We have already seen how to
a1 calculate the expressions in the
W square boxes when we learnt BP
h0 = xi The first two terms on RHS can be
computed as:
∂L (θ) xij 1 − xij
=− +
∂h2j x̂ij 1 − x̂ij
∂h2j
= σ(a2j )(1 − σ(a2j ))
∂a2j

14/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
n ∂L (θ) ∂L (θ) ∂h2 ∂a2
L (θ) = − =
P
(xij log x̂ij + (1 − xij ) log(1 − x̂ij )) ∗
j=1 ∂W ∂h2 ∂a2 ∂W ∗
h2 = x̂i
a2 ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
W∗
h1 We have already seen how to
a1 calculate the expressions in the
W square boxes when we learnt BP
h0 = xi The first two terms on RHS can be
computed as:
∂L (θ) xij 1 − xij
=− +
1 − x̂ij
 ∂L (θ) 
∂h2j x̂ij
∂h
 ∂L 21
(θ) 
 ∂h2j
∂L (θ) 
 ∂h22  = σ(a2j )(1 − σ(a2j ))
= .  ∂a2j
∂h2  .. 
 
∂L (θ)
∂h2n
14/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.2: Link between PCA and Autoencoders

15/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now see that the encoder part
x̂i y PCA of an autoencoder is equivalent to
PCA if we

h ≡ u1 u2

x
xi P T X T XP =D

16/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now see that the encoder part
x̂i y PCA of an autoencoder is equivalent to
PCA if we
use a linear encoder
h ≡ u1 u2

x
xi P T X T XP =D

16/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now see that the encoder part
x̂i y PCA of an autoencoder is equivalent to
PCA if we
use a linear encoder
h ≡ u1 u2 use a linear decoder

x
xi P T X T XP =D

16/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now see that the encoder part
x̂i y PCA of an autoencoder is equivalent to
PCA if we
use a linear encoder
h ≡ u1 u2 use a linear decoder
use squared error loss function
x
xi P T X T XP =D

16/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now see that the encoder part
x̂i y PCA of an autoencoder is equivalent to
PCA if we
use a linear encoder
h ≡ u1 u2 use a linear decoder
use squared error loss function
normalize the inputs to
x
m
!
xi P T X T XP =D 1 1 X
x̂ij = √ xij − xkj
m m
k=1

16/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First let us consider the implication
x̂i y PCA of normalizing the inputs to
m
!
1 1 X
u1 u2 x̂ij = √ xij − xkj
h ≡ m m
k=1

x
xi P T X T XP =D

17/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First let us consider the implication
x̂i y PCA of normalizing the inputs to
m
!
1 1 X
u1 u2 x̂ij = √ xij − xkj
h ≡ m m
k=1
The operation in the bracket ensures
x
that the data now has 0 mean along
xi P T X T XP = D each dimension j (we are subtracting
the mean)

17/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First let us consider the implication
x̂i y PCA of normalizing the inputs to
m
!
1 1 X
u1 u2 x̂ij = √ xij − xkj
h ≡ m m
k=1
The operation in the bracket ensures
x
that the data now has 0 mean along
xi P T X T XP = D each dimension j (we are subtracting
the mean)
0
Let X be this zero mean data mat-
rix then what the above normaliza-
0
tion gives us is X = √1m X

17/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First let us consider the implication
x̂i y PCA of normalizing the inputs to
m
!
1 1 X
u1 u2 x̂ij = √ xij − xkj
h ≡ m m
k=1
The operation in the bracket ensures
x
that the data now has 0 mean along
xi P T X T XP = D each dimension j (we are subtracting
the mean)
0
Let X be this zero mean data mat-
rix then what the above normaliza-
0
tion gives us is X = √1m X
Now (X)T X = m 1
(X 0 )T X 0 is the co-
variance matrix (recall that covari-
ance matrix plays an important role
in PCA) 17/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i y PCA

h ≡ u1 u2

x
xi P T X T XP =D

18/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First we will show that if we use lin-
x̂i y PCA ear decoder and a squared error loss
function then

h ≡ u1 u2

x
xi P T X T XP =D

18/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First we will show that if we use lin-
x̂i y PCA ear decoder and a squared error loss
function then
The optimal solution to the following
h ≡ u1 u2 objective function

x
xi P T X T XP =D

18/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First we will show that if we use lin-
x̂i y PCA ear decoder and a squared error loss
function then
The optimal solution to the following
h ≡ u1 u2 objective function
m n
x 1 XX
xi (xij − x̂ij )2
P T X T XP = D m
i=1 j=1

18/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First we will show that if we use lin-
x̂i y PCA ear decoder and a squared error loss
function then
The optimal solution to the following
h ≡ u1 u2 objective function
m n
x 1 XX
xi (xij − x̂ij )2
P T X T XP = D m
i=1 j=1

is obtained when we use a linear en-


coder.

18/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1

19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1
This is equivalent to

19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1
This is equivalent to
min (kX − HW ∗ kF )2
W ∗H

19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1
This is equivalent to v
um X
n
uX
min (kX − HW ∗ kF )2 kAkF = t a2ij
W ∗H
i=1 j=1

19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1
This is equivalent to v
um X
n
uX
min (kX − HW ∗ kF )2 kAkF = t a2ij
W ∗H
i=1 j=1

(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we
are ignoring the biases)

19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1
This is equivalent to v
um X
n
uX
min (kX − HW ∗ kF )2 kAkF = t a2ij
W ∗H
i=1 j=1

(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we
are ignoring the biases)
From SVD we know that optimal solution to the above problem is given by

HW ∗ = U. ,≤k Σk,k V.T,≤k

19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1
This is equivalent to v
um X
n
uX
min (kX − HW ∗ kF )2 kAkF = t a2ij
W ∗H
i=1 j=1

(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we
are ignoring the biases)
From SVD we know that optimal solution to the above problem is given by

HW ∗ = U. ,≤k Σk,k V.T,≤k

By matching variables one possible solution is

H = U. ,≤k Σk,k
W ∗ = V.T,≤k

19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)
= XV ΣT U T U (ΣΣT )−1 U T U. ,≤k Σk,k ((ABC)−1 = C −1 B −1 A−1 )

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)
= XV ΣT U T U (ΣΣT )−1 U T U. ,≤k Σk,k ((ABC)−1 = C −1 B −1 A−1 )
= XV ΣT (ΣΣT )−1 U T U. ,≤k Σk,k (U T U = I)

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)
= XV ΣT U T U (ΣΣT )−1 U T U. ,≤k Σk,k ((ABC)−1 = C −1 B −1 A−1 )
= XV ΣT (ΣΣT )−1 U T U. ,≤k Σk,k (U T U = I)
−1
= XV ΣT ΣT Σ−1 U T U. ,≤k Σk,k ((AB)−1 = B −1 A−1 )

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)
= XV ΣT U T U (ΣΣT )−1 U T U. ,≤k Σk,k ((ABC)−1 = C −1 B −1 A−1 )
= XV ΣT (ΣΣT )−1 U T U. ,≤k Σk,k (U T U = I)
−1
= XV ΣT ΣT Σ−1 U T U. ,≤k Σk,k ((AB)−1 = B −1 A−1 )
= XV Σ−1 I. ,≤k Σk,k (U T U. ,≤k = I. ,≤k )

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)
= XV ΣT U T U (ΣΣT )−1 U T U. ,≤k Σk,k ((ABC)−1 = C −1 B −1 A−1 )
= XV ΣT (ΣΣT )−1 U T U. ,≤k Σk,k (U T U = I)
−1
= XV ΣT ΣT Σ−1 U T U. ,≤k Σk,k ((AB)−1 = B −1 A−1 )
= XV Σ−1 I. ,≤k Σk,k (U T U. ,≤k = I. ,≤k )
= XV I. ,≤k (Σ−1 I. ,≤k = Σ−1
k,k )

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)
= XV ΣT U T U (ΣΣT )−1 U T U. ,≤k Σk,k ((ABC)−1 = C −1 B −1 A−1 )
= XV ΣT (ΣΣT )−1 U T U. ,≤k Σk,k (U T U = I)
−1
= XV ΣT ΣT Σ−1 U T U. ,≤k Σk,k ((AB)−1 = B −1 A−1 )
= XV Σ−1 I. ,≤k Σk,k (U T U. ,≤k = I. ,≤k )
= XV I. ,≤k (Σ−1 I. ,≤k = Σ−1
k,k )
H = XV. ,≤k

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)
= XV ΣT U T U (ΣΣT )−1 U T U. ,≤k Σk,k ((ABC)−1 = C −1 B −1 A−1 )
= XV ΣT (ΣΣT )−1 U T U. ,≤k Σk,k (U T U = I)
−1
= XV ΣT ΣT Σ−1 U T U. ,≤k Σk,k ((AB)−1 = B −1 A−1 )
= XV Σ−1 I. ,≤k Σk,k (U T U. ,≤k = I. ,≤k )
= XV I. ,≤k (Σ−1 I. ,≤k = Σ−1
k,k )
H = XV. ,≤k

Thus H is a linear transformation of X and W = V. ,≤k


20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We have encoder W = V.,≤k

21/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of X T X

21/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of X T X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix

21/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of X T X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
We saw earlier that, if entries of X are normalized by

21/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of X T X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
We saw earlier that, if entries of X are normalized by

m
!
1 1 X
x̂ij = √ xij − xkj
m m
k=1

21/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of X T X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
We saw earlier that, if entries of X are normalized by

m
!
1 1 X
x̂ij = √ xij − xkj
m m
k=1

then X T X is indeed the covariance matrix

21/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of X T X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
We saw earlier that, if entries of X are normalized by

m
!
1 1 X
x̂ij = √ xij − xkj
m m
k=1

then X T X is indeed the covariance matrix


Thus, the encoder matrix for linear autoencoder(W ) and the projection
matrix(P ) for PCA could indeed be the same. Hence proved

21/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Remember
The encoder of a linear autoencoder is equivalent to PCA if we

22/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder

22/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder

22/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder
use a squared error loss function

22/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder
use a squared error loss function
and normalize the inputs to

22/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder
use a squared error loss function
and normalize the inputs to

m
!
1 1 X
x̂ij = √ xij − xkj
m m
k=1

22/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.3: Regularization in autoencoders
(Motivation)

23/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i

W∗
h
W
xi

24/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
While poor generalization could hap-
x̂i pen even in undercomplete autoen-
coders it is an even more serious prob-
W∗ lem for overcomplete auto encoders
h
W
xi

24/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
While poor generalization could hap-
x̂i pen even in undercomplete autoen-
coders it is an even more serious prob-
W∗ lem for overcomplete auto encoders
h Here, (as stated earlier) the model
can simply learn to copy xi to h and
W then h to x̂i
xi

24/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
While poor generalization could hap-
x̂i pen even in undercomplete autoen-
coders it is an even more serious prob-
W∗ lem for overcomplete auto encoders
h Here, (as stated earlier) the model
can simply learn to copy xi to h and
W then h to x̂i
xi To avoid poor generalization, we need
to introduce regularization

24/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The simplest solution is to add a L2 -
x̂i regularization term to the objective
function
W∗
m n
1 XX
h min (x̂ij − xij )2 + λkθk2
θ,w,w∗ ,b,c m
i=1 j=1
W
xi

25/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The simplest solution is to add a L2 -
x̂i regularization term to the objective
function
W∗
m n
1 XX
h min (x̂ij − xij )2 + λkθk2
θ,w,w∗ ,b,c m
i=1 j=1
W
This is very easy to implement and
xi
just adds a term λW to the gradient
∂L (θ)
∂W (and similarly for other para-
meters)

25/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Another trick is to tie the weights of
x̂i the encoder and decoder

W∗
h
W
xi

26/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Another trick is to tie the weights of
x̂i the encoder and decoder i.e., W ∗ =
WT
W∗
h
W
xi

26/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Another trick is to tie the weights of
x̂i the encoder and decoder i.e., W ∗ =
WT
W∗
This effectively reduces the capacity
h of Autoencoder and acts as a regular-
izer
W
xi

26/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.4: Denoising Autoencoders

27/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A denoising encoder simply corrupts
the input data using a probabilistic
x̂i
xij |xij )) before feeding it
process (P (e
to the network

x̃i
xij |xij )
P (e
xi

28/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A denoising encoder simply corrupts
the input data using a probabilistic
x̂i
xij |xij )) before feeding it
process (P (e
to the network
A simple P (e xij |xij ) used in practice
h
is the following

x̃i
xij |xij )
P (e
xi

28/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A denoising encoder simply corrupts
the input data using a probabilistic
x̂i
xij |xij )) before feeding it
process (P (e
to the network
A simple P (e xij |xij ) used in practice
h
is the following

P (e
xij = 0|xij ) = q
x̃i
xij |xij )
P (e
xi

28/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A denoising encoder simply corrupts
the input data using a probabilistic
x̂i
xij |xij )) before feeding it
process (P (e
to the network
A simple P (e xij |xij ) used in practice
h
is the following

P (e
xij = 0|xij ) = q
x̃i
xij = xij |xij ) = 1 − q
P (e
xij |xij )
P (e
xi

28/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A denoising encoder simply corrupts
the input data using a probabilistic
x̂i
xij |xij )) before feeding it
process (P (e
to the network
A simple P (e xij |xij ) used in practice
h
is the following

P (e
xij = 0|xij ) = q
x̃i
xij = xij |xij ) = 1 − q
P (e
xij |xij )
P (e
In other words, with probability q the
xi input is flipped to 0 and with probab-
ility (1 − q) it is retained as it is

28/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
How does this help ?
x̂i

x̃i
xij |xij )
P (e
xi

29/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
How does this help ?
x̂i This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
m n
h 1 XX
arg min (x̂ij − xij )2
θ m
i=1 j=1

x̃i
xij |xij )
P (e
xi

29/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
How does this help ?
x̂i This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
m n
h 1 XX
arg min (x̂ij − xij )2
θ m
i=1 j=1

x̃i It no longer makes sense for the model


to copy the corrupted xei into h(xei )
xij |xij )
P (e and then into x̂i (the objective func-
xi tion will not be minimized by doing
so)

29/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
How does this help ?
x̂i This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
m n
h 1 XX
arg min (x̂ij − xij )2
θ m
i=1 j=1

x̃i It no longer makes sense for the model


to copy the corrupted xei into h(xei )
xij |xij )
P (e and then into x̂i (the objective func-
xi tion will not be minimized by doing
so)
Instead the model will now have to
capture the characteristics of the data
correctly.

29/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
How does this help ?
x̂i This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
m n
h 1 XX
arg min (x̂ij − xij )2
θ m
i=1 j=1

x̃i It no longer makes sense for the model


to copy the corrupted xei into h(xei )
xij |xij )
P (e and then into x̂i (the objective func-
xi tion will not be minimized by doing
so)
For example, it will have to learn to Instead the model will now have to
reconstruct a corrupted xij correctly by capture the characteristics of the data
relying on its interactions with other correctly.
elements of xi
29/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now see a practical application in which AEs are used and then compare
Denoising Autoencoders with regular autoencoders

30/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
0 1 2 3 9
Task: Hand-written digit
recognition

|xi | = 784 = 28 × 28

28*28

Figure: Basic approach(we use raw data as input


Figure: MNIST Data features)

31/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i ∈ R784
Task: Hand-written digit
recognition

h ∈ Rd

|xi | = 784 = 28 × 28

Figure: MNIST Data

Figure: AE approach (first learn important


characteristics of data)
32/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Task: Hand-written digit 0 1 2 3 9
recognition

h ∈ Rd

|xi | = 784 = 28 × 28

Figure: MNIST Data

Figure: AE approach (and then train a classifier on


top of this hidden representation)
33/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now see a way of visualizing AEs and use this visualization to compare
different AEs

34/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We can think of each neuron as a filter which
x̂i will fire (or get maximally) activated for a cer-
tain input configuration xi

xi

35/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We can think of each neuron as a filter which
x̂i will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h
h1 = σ(W1T xi ) [ignoring bias b]

xi Where W1 is the trained vector of weights con-


necting the input to the first hidden neuron

35/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We can think of each neuron as a filter which
x̂i will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h
h1 = σ(W1T xi ) [ignoring bias b]

xi Where W1 is the trained vector of weights con-


necting the input to the first hidden neuron
What values of xi will cause h1 to be max-
imum (or maximally activated)

35/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We can think of each neuron as a filter which
x̂i will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h
h1 = σ(W1T xi ) [ignoring bias b]

xi Where W1 is the trained vector of weights con-


necting the input to the first hidden neuron
What values of xi will cause h1 to be max-
imum (or maximally activated)
Suppose we assume that our inputs are nor-
malized so that kxi k = 1

35/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We can think of each neuron as a filter which
x̂i will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h
h1 = σ(W1T xi ) [ignoring bias b]

xi Where W1 is the trained vector of weights con-


necting the input to the first hidden neuron
What values of xi will cause h1 to be max-
imum (or maximally activated)
max {W1T xi } Suppose we assume that our inputs are nor-
xi
malized so that kxi k = 1
s.t. ||xi ||2 = xTi xi = 1

35/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We can think of each neuron as a filter which
x̂i will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h
h1 = σ(W1T xi ) [ignoring bias b]

xi Where W1 is the trained vector of weights con-


necting the input to the first hidden neuron
What values of xi will cause h1 to be max-
imum (or maximally activated)
max {W1T xi } Suppose we assume that our inputs are nor-
xi
malized so that kxi k = 1
s.t. ||xi ||2 = xTi xi = 1
W1
Solution: xi = p
W1T W1

35/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Thus the inputs
x̂i

W1 W2 Wn
xi = q ,q ,... p
h W1T W1 W2T W2 WnT Wn

will respectively cause hidden neurons 1 to n


xi to maximally fire

max {W1T xi }
xi

s.t. ||xi ||2 = xTi xi = 1


W1
Solution: xi = p
W1T W1

36/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Thus the inputs
x̂i

W1 W2 Wn
xi = q ,q ,... p
h W1T W1 W2T W2 WnT Wn

will respectively cause hidden neurons 1 to n


xi to maximally fire
Let us plot these images (xi ’s) which maxim-
ally activate the first k neurons of the hidden
representations learned by a vanilla autoen-
max {W1T xi } coder and different denoising autoencoders
xi

s.t. ||xi ||2 = xTi xi = 1


W1
Solution: xi = p
W1T W1

36/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Thus the inputs
x̂i

W1 W2 Wn
xi = q ,q ,... p
h W1T W1 W2T W2 WnT Wn

will respectively cause hidden neurons 1 to n


xi to maximally fire
Let us plot these images (xi ’s) which maxim-
ally activate the first k neurons of the hidden
representations learned by a vanilla autoen-
max {W1T xi } coder and different denoising autoencoders
xi

s.t. ||xi ||2 = xTi xi = 1


These xi ’s are computed by the above formula
W1
using the weights (W1 , W2 . . . Wk ) learned by
Solution: xi = p the respective autoencoders
W1T W1

36/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Figure: Vanilla AE Figure: 25% Denoising Figure: 50% Denoising
(No noise) AE (q=0.25) AE (q=0.5)

The vanilla AE does not learn many meaningful patterns

37/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Figure: Vanilla AE Figure: 25% Denoising Figure: 50% Denoising
(No noise) AE (q=0.25) AE (q=0.5)

The vanilla AE does not learn many meaningful patterns


The hidden neurons of the denoising AEs seem to act like pen-stroke detectors
(for example, in the highlighted neuron the black region is a stroke that you
would expect in a ’0’ or a ’2’ or a ’3’ or a ’8’ or a ’9’)

37/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Figure: Vanilla AE Figure: 25% Denoising Figure: 50% Denoising
(No noise) AE (q=0.25) AE (q=0.5)

The vanilla AE does not learn many meaningful patterns


The hidden neurons of the denoising AEs seem to act like pen-stroke detectors
(for example, in the highlighted neuron the black region is a stroke that you
would expect in a ’0’ or a ’2’ or a ’3’ or a ’8’ or a ’9’)
As the noise increases the filters become more wide because the neuron has to
rely on more adjacent pixels to feel confident about a stroke
37/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We saw one form of P (e xij |xij ) which flips a
x̂i fraction q of the inputs to zero

x̃i
xij |xij )
P (e
xi

38/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We saw one form of P (e xij |xij ) which flips a
x̂i fraction q of the inputs to zero
Another way of corrupting the inputs is to add
a Gaussian noise to the input
h
eij = xij + N (0, 1)
x

x̃i
xij |xij )
P (e
xi

38/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We saw one form of P (e xij |xij ) which flips a
x̂i fraction q of the inputs to zero
Another way of corrupting the inputs is to add
a Gaussian noise to the input
h
eij = xij + N (0, 1)
x

x̃i We will now use such a denoising AE on a


different dataset and see their performance
xij |xij )
P (e
xi

38/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Figure: Weight decay
Figure: Data Figure: AE filters
filters

The hidden neurons essentially behave like edge detectors

39/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Figure: Weight decay
Figure: Data Figure: AE filters
filters

The hidden neurons essentially behave like edge detectors


PCA does not give such edge detectors

39/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.5: Sparse Autoencoders

40/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i

xi

41/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i A hidden neuron with sigmoid activation will
have values between 0 and 1

xi

41/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i A hidden neuron with sigmoid activation will
have values between 0 and 1
We say that the neuron is activated when its
h output is close to 1 and not activated when
its output is close to 0.

xi

41/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i A hidden neuron with sigmoid activation will
have values between 0 and 1
We say that the neuron is activated when its
h output is close to 1 and not activated when
its output is close to 0.
A sparse autoencoder tries to ensure the
xi neuron is inactive most of the times.

41/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i If the neuron l is sparse (i.e. mostly inactive)
then ρ̂l → 0

xi

The average value of the


activation of a neuron l is given
by
m
1 X
ρ̂l = h(xi )l
m
i=1

42/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i If the neuron l is sparse (i.e. mostly inactive)
then ρ̂l → 0
A sparse autoencoder uses a sparsity para-
h meter ρ (typically very close to 0, say, 0.005)
and tries to enforce the constraint ρ̂l = ρ

xi

The average value of the


activation of a neuron l is given
by
m
1 X
ρ̂l = h(xi )l
m
i=1

42/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i If the neuron l is sparse (i.e. mostly inactive)
then ρ̂l → 0
A sparse autoencoder uses a sparsity para-
h meter ρ (typically very close to 0, say, 0.005)
and tries to enforce the constraint ρ̂l = ρ
One way of ensuring this is to add the follow-
xi ing term to the objective function

k
The average value of the X ρ 1−ρ
Ω(θ) = ρ log + (1 − ρ) log
activation of a neuron l is given ρ̂l 1 − ρ̂l
l=1
by
m
1 X
ρ̂l = h(xi )l
m
i=1

42/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i If the neuron l is sparse (i.e. mostly inactive)
then ρ̂l → 0
A sparse autoencoder uses a sparsity para-
h meter ρ (typically very close to 0, say, 0.005)
and tries to enforce the constraint ρ̂l = ρ
One way of ensuring this is to add the follow-
xi ing term to the objective function

k
The average value of the X ρ 1−ρ
Ω(θ) = ρ log + (1 − ρ) log
activation of a neuron l is given ρ̂l 1 − ρ̂l
l=1
by
m
1 X When will this term reach its minimum value
ρ̂l = h(xi )l and what is the minimum value? Let us plot
m
i=1
it and check.

42/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
ρ = 0.2

Ω(θ)

0.2 ρ̂l

43/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
ρ = 0.2

Ω(θ)

0.2 ρ̂l

The function will reach its minimum value(s) when ρ̂l = ρ.

43/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Now,

Lˆ(θ) = L (θ) + Ω(θ)

44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Now,

Lˆ(θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss or


cross entropy loss and Ω(θ) is the
sparsity constraint.

44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Now,

Lˆ(θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss or


cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W

44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Now,

Lˆ(θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss or


cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
∂Ω(θ)
Let us see how to calculate ∂W .

44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss or


cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
∂Ω(θ)
Let us see how to calculate ∂W .

44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
Can be re-written as
k
X L (θ) is the squared error loss or
Ω(θ) = ρlogρ−ρlog ρ̂l +(1−ρ)log(1−ρ)−(1−ρ)log(1−ρ̂l ) cross entropy loss and Ω(θ) is the
l=1
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
∂Ω(θ)
Let us see how to calculate ∂W .

44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
Can be re-written as
k
X L (θ) is the squared error loss or
Ω(θ) = ρlogρ−ρlog ρ̂l +(1−ρ)log(1−ρ)−(1−ρ)log(1−ρ̂l ) cross entropy loss and Ω(θ) is the
l=1
sparsity constraint.
By Chain rule:
We already know how to calculate
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
Let us see how to calculate ∂W .

44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
Can be re-written as
k
X L (θ) is the squared error loss or
Ω(θ) = ρlogρ−ρlog ρ̂l +(1−ρ)log(1−ρ)−(1−ρ)log(1−ρ̂l ) cross entropy loss and Ω(θ) is the
l=1
sparsity constraint.
By Chain rule:
We already know how to calculate
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
Let us see how to calculate ∂W .
∂Ω(θ) h
∂Ω(θ) ∂Ω(θ)
iT
= ∂ ρ̂1
, ∂ ρ̂2 , . . . ∂Ω(θ)
∂ ρ̂k
∂ ρ̂

44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
Can be re-written as
k
X L (θ) is the squared error loss or
Ω(θ) = ρlogρ−ρlog ρ̂l +(1−ρ)log(1−ρ)−(1−ρ)log(1−ρ̂l ) cross entropy loss and Ω(θ) is the
l=1
sparsity constraint.
By Chain rule:
We already know how to calculate
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
Let us see how to calculate ∂W .
∂Ω(θ) h iT
= ∂Ω(θ)
∂ ρ̂1
, ∂Ω(θ)
∂ ρ̂2
, . . . ∂Ω(θ)
∂ ρ̂k
∂ ρ̂
For each neuron l ∈ 1 . . . k in hidden layer, we have

44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
Can be re-written as
k
X L (θ) is the squared error loss or
Ω(θ) = ρlogρ−ρlog ρ̂l +(1−ρ)log(1−ρ)−(1−ρ)log(1−ρ̂l ) cross entropy loss and Ω(θ) is the
l=1
sparsity constraint.
By Chain rule:
We already know how to calculate
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
Let us see how to calculate ∂W .
∂Ω(θ) h iT
= ∂Ω(θ)
∂ ρ̂1
, ∂Ω(θ)
∂ ρ̂2
, . . . ∂Ω(θ)
∂ ρ̂k
∂ ρ̂
For each neuron l ∈ 1 . . . k in hidden layer, we have
∂Ω(θ) ρ (1 − ρ)
=− +
∂ ρ̂l ρ̂l 1 − ρ̂l

44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
Can be re-written as
k
X L (θ) is the squared error loss or
Ω(θ) = ρlogρ−ρlog ρ̂l +(1−ρ)log(1−ρ)−(1−ρ)log(1−ρ̂l ) cross entropy loss and Ω(θ) is the
l=1
sparsity constraint.
By Chain rule:
We already know how to calculate
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
Let us see how to calculate ∂W .
∂Ω(θ) h iT
= ∂Ω(θ)∂ ρ̂1
, ∂Ω(θ)
∂ ρ̂2
, . . . ∂Ω(θ)
∂ ρ̂k
∂ ρ̂
For each neuron l ∈ 1 . . . k in hidden layer, we have
∂Ω(θ) ρ (1 − ρ)
=− +
∂ ρ̂l ρ̂l 1 − ρ̂l
∂ ρ̂l
and = xi (g 0 (W T xi + b))T (see next slide)
∂W
44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
Can be re-written as
k
X L (θ) is the squared error loss or
Ω(θ) = ρlogρ−ρlog ρ̂l +(1−ρ)log(1−ρ)−(1−ρ)log(1−ρ̂l ) cross entropy loss and Ω(θ) is the
l=1
sparsity constraint.
By Chain rule:
We already know how to calculate
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
Let us see how to calculate ∂W .
∂Ω(θ) h iT
= ∂Ω(θ)∂ ρ̂1
, ∂Ω(θ)
∂ ρ̂2
, . . . ∂Ω(θ)
∂ ρ̂k Finally,
∂ ρ̂
For each neuron l ∈ 1 . . . k in hidden layer, we have
∂ Lˆ(θ) ∂L (θ) ∂Ω(θ)
∂Ω(θ) ρ (1 − ρ) = +
=− + ∂W ∂W ∂W
∂ ρ̂l ρ̂l 1 − ρ̂l
∂ ρ̂l (and we know how to calculate both
and = xi (g 0 (W T xi + b))T (see next slide)
∂W terms on R.H.S)
44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Derivation
∂ ρ̂  ∂ ρ̂ ∂ ρ̂2 ∂ ρ̂k

= ∂W 1
∂W . . . ∂W
∂W
∂ ρ̂l
For each element in the above equation we can calculate ∂W (which is the partial
derivative of a scalar w.r.t. a matrix = matrix). For a single element of a matrix Wjl :-
h P i
1 m T
∂ ρ̂l ∂ m i=1 g W:,l xi + bl
=
∂Wjl ∂Wjl
h i
m T
1 X ∂ g W:,l xi + bl
=
m i=1 ∂Wjl
m
1 X 0 T

= g W:,l xi + bl xij
m i=1

So in matrix notation we can write it as :


∂ ρ̂l
= xi (g 0 (W T xi + b))T
∂W
45/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.6: Contractive Autoencoders

46/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func- x̂
tion.

47/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func- x̂
tion.
It does so by adding the following reg-
ularization term to the loss function h

Ω(θ) = kJx (h)k2F


x

47/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func- x̂
tion.
It does so by adding the following reg-
ularization term to the loss function h

Ω(θ) = kJx (h)k2F


x
where Jx (h) is the Jacobian of the en-
coder.

47/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func- x̂
tion.
It does so by adding the following reg-
ularization term to the loss function h

Ω(θ) = kJx (h)k2F


x
where Jx (h) is the Jacobian of the en-
coder.
Let us see what it looks like.

47/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
If the input has n dimensions and the
hidden layer has k dimensions then

48/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
If the input has n dimensions and the  
hidden layer has k dimensions then ∂h1 ∂h1
∂x1 ... ... ... ∂xn
 ∂h ∂h2 
 ∂x12 ... ... ... ∂xn 
Jx (h) =  .
.. .. 

 .. . . 
∂hk ∂hk
∂x1 ... ... ... ∂xn

48/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
If the input has n dimensions and the  
hidden layer has k dimensions then ∂h1 ∂h1
∂x1 ... ... ... ∂xn
 ∂h ∂h2 
In other words, the (j, l) entry of the  ∂x12 ... ... ... ∂xn 
Jacobian captures the variation in the Jx (h) =  .
.. .. 

 .. . . 
output of the lth neuron with a small ∂hk ∂hk
... ... ...
variation in the j th input. ∂x1 ∂xn

48/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
If the input has n dimensions and the  
hidden layer has k dimensions then ∂h1 ∂h1
∂x1 ... ... ... ∂xn
 ∂h ∂h2 
In other words, the (j, l) entry of the  ∂x12 ... ... ... ∂xn 
Jacobian captures the variation in the Jx (h) =  .
.. .. 

 .. . . 
output of the lth neuron with a small ∂hk ∂hk
... ... ...
variation in the j th input. ∂x1 ∂xn

n X
k 
∂hl 2
X 
kJx (h)k2F =
∂xj
j=1 l=1

48/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
What is the intuition behind this ? n X
k 
X ∂hl 2
kJx (h)k2F =
∂xj
j=1 l=1

49/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
What is the intuition behind this ? n X
k 
X ∂hl 2
Consider ∂h1 kJx (h)k2F =
∂x1 , what does it mean if ∂xj
∂h1 j=1 l=1
∂x1 = 0

49/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
What is the intuition behind this ? n X
k 
X ∂hl 2
Consider ∂h1 kJx (h)k2F =
∂x1 , what does it mean if ∂xj
∂h1 j=1 l=1
∂x1 = 0
It means that this neuron is not very
sensitive to variations in the input x1 .

49/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
What is the intuition behind this ? n X
k 
X ∂hl 2
Consider ∂h1 kJx (h)k2F =
∂x1 , what does it mean if ∂xj
∂h1 j=1 l=1
∂x1 = 0
It means that this neuron is not very
sensitive to variations in the input x1 .

But doesn’t this contradict our other
goal of minimizing L(θ) which re-
quires h to capture variations in the h
input.

49/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Indeed it does and that’s the idea n X
k 
X ∂hl 2
kJx (h)k2F =
∂xj
j=1 l=1

50/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Indeed it does and that’s the idea n X
k 
X ∂hl 2
By putting these two contradicting kJx (h)k2F =
∂xj
objectives against each other we en- j=1 l=1

sure that h is sensitive to only very


important variations as observed in
the training data. x̂

50/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Indeed it does and that’s the idea n X
k 
X ∂hl 2
By putting these two contradicting kJx (h)k2F =
∂xj
objectives against each other we en- j=1 l=1

sure that h is sensitive to only very


important variations as observed in
the training data. x̂
L(θ) - capture important variations
in data
h

50/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Indeed it does and that’s the idea n X
k 
X ∂hl 2
By putting these two contradicting kJx (h)k2F =
∂xj
objectives against each other we en- j=1 l=1

sure that h is sensitive to only very


important variations as observed in
the training data. x̂
L(θ) - capture important variations
in data
h
Ω(θ) - do not capture variations in
data
x

50/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Indeed it does and that’s the idea n X
k 
X ∂hl 2
By putting these two contradicting kJx (h)k2F =
∂xj
objectives against each other we en- j=1 l=1

sure that h is sensitive to only very


important variations as observed in
the training data. x̂
L(θ) - capture important variations
in data
h
Ω(θ) - do not capture variations in
data
Tradeoff - capture only very import- x
ant variations in the data

50/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us try to understand this with the help of an illustration.

51/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
y

u1
u2

52/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
y
Consider the variations in the data
along directions u1 and u2
u1
u2

52/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
y
Consider the variations in the data
along directions u1 and u2
u1 It makes sense to maximize a neuron
to be sensitive to variations along u1
u2

52/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
y
Consider the variations in the data
along directions u1 and u2
u1 It makes sense to maximize a neuron
to be sensitive to variations along u1
u2 At the same time it makes sense to
inhibit a neuron from being sensitive
to variations along u2 (as there seems
to be small noise and unimportant for
x reconstruction)

52/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
y
Consider the variations in the data
along directions u1 and u2
u1 It makes sense to maximize a neuron
to be sensitive to variations along u1
u2 At the same time it makes sense to
inhibit a neuron from being sensitive
to variations along u2 (as there seems
to be small noise and unimportant for
x reconstruction)
By doing so we can balance between
the contradicting goals of good recon-
struction and low sensitivity.

52/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
y
Consider the variations in the data
along directions u1 and u2
u1 It makes sense to maximize a neuron
to be sensitive to variations along u1
u2 At the same time it makes sense to
inhibit a neuron from being sensitive
to variations along u2 (as there seems
to be small noise and unimportant for
x reconstruction)
By doing so we can balance between
the contradicting goals of good recon-
struction and low sensitivity.
What does this remind you of ?

52/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.7 : Summary

53/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂ y PCA

h ≡ u1 u2

x
x P T X T XP =D

54/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂ y PCA

h ≡ u1 u2

x
x ∗ 2
P T X T XP =D | {z } kF
min kX − HW
θ
U ΣV T
(SVD)

54/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i

x̃i
xij |xij )
P (e
xi

55/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i
Regularization

x̃i
xij |xij )
P (e
xi

55/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i
Regularization

h Ω(θ) = λkθk2 Weight decaying

x̃i
xij |xij )
P (e
xi

55/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i
Regularization

h Ω(θ) = λkθk2 Weight decaying


k
X ρ 1−ρ
Ω(θ) = ρ log + (1 − ρ) log Sparse
ρ̂l 1 − ρ̂l
l=1
x̃i
xij |xij )
P (e
xi

55/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i
Regularization

h Ω(θ) = λkθk2 Weight decaying


k
X ρ 1−ρ
Ω(θ) = ρ log + (1 − ρ) log Sparse
ρ̂l 1 − ρ̂l
l=1
x̃i n X
k 
X ∂hl 2
Ω(θ) = Contractive
xij |xij )
P (e j=1 l=1
∂xj

xi

55/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

You might also like