Lecture7 PDF
Lecture7 PDF
Mitesh M. Khapra
1/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.1: Introduction to Autoencoders
2/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i
W∗
h
W
xi
3/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
An autoencoder is a special type of
x̂i feed forward neural network which
does the following
W∗
h
W
xi
3/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
An autoencoder is a special type of
x̂i feed forward neural network which
does the following
W∗
Encodes its input xi into a hidden
h representation h
W
xi
3/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
An autoencoder is a special type of
x̂i feed forward neural network which
does the following
W∗
Encodes its input xi into a hidden
h representation h
W
xi
h = g(W xi + b)
3/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
An autoencoder is a special type of
x̂i feed forward neural network which
does the following
W∗
Encodes its input xi into a hidden
h representation h
W Decodes the input again from this
hidden representation
xi
h = g(W xi + b)
3/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
An autoencoder is a special type of
x̂i feed forward neural network which
does the following
W∗
Encodes its input xi into a hidden
h representation h
W Decodes the input again from this
hidden representation
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
3/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
An autoencoder is a special type of
x̂i feed forward neural network which
does the following
W∗
Encodes its input xi into a hidden
h representation h
W Decodes the input again from this
hidden representation
xi
The model is trained to minimize a
certain loss function which will ensure
that x̂i is close to xi (we will see some
h = g(W xi + b)
such loss functions soon)
x̂i = f (W ∗ h + c)
3/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i
W∗
h
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
4/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case where
x̂i dim(h) < dim(xi )
W∗
h
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
4/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case where
x̂i dim(h) < dim(xi )
If we are still able to reconstruct x̂i
W∗
perfectly from h, then what does it
h say about h?
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
4/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case where
x̂i dim(h) < dim(xi )
If we are still able to reconstruct x̂i
W∗
perfectly from h, then what does it
h say about h?
h is a loss-free encoding of xi . It cap-
W
tures all the important characteristics
xi of xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
4/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case where
x̂i dim(h) < dim(xi )
If we are still able to reconstruct x̂i
W∗
perfectly from h, then what does it
h say about h?
h is a loss-free encoding of xi . It cap-
W
tures all the important characteristics
xi of xi
Do you see an analogy with PCA?
h = g(W xi + b)
x̂i = f (W ∗ h + c)
4/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case where
x̂i dim(h) < dim(xi )
If we are still able to reconstruct x̂i
W∗
perfectly from h, then what does it
h say about h?
h is a loss-free encoding of xi . It cap-
W
tures all the important characteristics
xi of xi
Do you see an analogy with PCA?
h = g(W xi + b)
x̂i = f (W ∗ h + c)
4/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i
W∗
h
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗
h
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
Such an identity encoding is useless
xi in practice as it does not really tell us
anything about the important char-
h = g(W xi + b) acteristics of the data
x̂i = f (W ∗ h + c)
5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
Such an identity encoding is useless
xi in practice as it does not really tell us
anything about the important char-
h = g(W xi + b) acteristics of the data
x̂i = f (W ∗ h + c)
5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The Road Ahead
6/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The Road Ahead
Choice of f (xi ) and g(xi )
6/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The Road Ahead
Choice of f (xi ) and g(xi )
Choice of loss function
6/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The Road Ahead
Choice of f (xi ) and g(xi )
Choice of loss function
7/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i = f (W ∗ h + c)
W∗
h = g(W xi + b)
W
xi
0 1 1 0 1 (binary inputs)
8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are binary
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗
h = g(W xi + b)
W
xi
0 1 1 0 1 (binary inputs)
8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are binary
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
W
xi
0 1 1 0 1 (binary inputs)
8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are binary
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
xi
0 1 1 0 1 (binary inputs)
8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are binary
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi
0 1 1 0 1 (binary inputs)
8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are binary
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi
x̂i = logistic(W ∗ h + c)
0 1 1 0 1 (binary inputs)
8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are binary
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi
x̂i = logistic(W ∗ h + c)
8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are binary
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi
x̂i = logistic(W ∗ h + c)
8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i = f (W ∗ h + c)
W∗
h = g(W xi + b)
W
xi
9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗
h = g(W xi + b)
W
xi
9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
W
xi
9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
xi
9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi
9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi
x̂i = logistic(W ∗ h + c)
9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi
x̂i = logistic(W ∗ h + c)
0.25 0.5 1.25 3.5 4.5 What will logistic and tanh do?
(real valued inputs)
9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi
x̂i = logistic(W ∗ h + c)
0.25 0.5 1.25 3.5 4.5 What will logistic and tanh do?
(real valued inputs) They will restrict the reconstruc-
ted x̂i to lie between [0,1] or [-1,1]
whereas we want x̂i ∈ Rn
9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi
x̂i = logistic(W ∗ h + c)
0.25 0.5 1.25 3.5 4.5 What will logistic and tanh do?
(real valued inputs) They will restrict the reconstruc-
ted x̂i to lie between [0,1] or [-1,1]
Again, g is typically chosen as the
whereas we want x̂i ∈ Rn
sigmoid function
9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The Road Ahead
Choice of f (xi ) and g(xi )
Choice of loss function
10/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i
W∗
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
11/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are real
x̂i valued
W∗
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
11/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are real
x̂i valued
The objective of the autoencoder is to recon-
W∗ struct x̂i to be as close to xi as possible
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
11/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are real
x̂i valued
The objective of the autoencoder is to recon-
W∗ struct x̂i to be as close to xi as possible
h = g(W xi + b)
x̂i = f (W ∗ h + c)
11/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are real
x̂i valued
The objective of the autoencoder is to recon-
W∗ struct x̂i to be as close to xi as possible
11/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are real
x̂i valued
The objective of the autoencoder is to recon-
W∗ struct x̂i to be as close to xi as possible
11/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are real
x̂i valued
The objective of the autoencoder is to recon-
W∗ struct x̂i to be as close to xi as possible
11/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi )
h2 = x̂i
a2
W∗
h1
a1
W
h0 = xi
12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi )
h2 = x̂i
a2
W∗
h1
a1
W
h0 = xi
12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi ) ∂L (θ) ∂L (θ) ∂h2 ∂a2
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2
W∗
h1
a1
W
h0 = xi
12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi ) ∂L (θ) ∂L (θ) ∂h2 ∂a2
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2
∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
W∗ ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
h1
a1
W
h0 = xi
12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi ) ∂L (θ) ∂L (θ) ∂h2 ∂a2
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2
∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
W∗ ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
h1
a1 We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
W
h0 = xi
12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi ) ∂L (θ) ∂L (θ) ∂h2 ∂a2
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2
∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
W∗ ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
h1
a1 We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
W
h0 = xi ∂L (θ) ∂L (θ)
=
∂h2 ∂x̂i
12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi ) ∂L (θ) ∂L (θ) ∂h2 ∂a2
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2
∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
W∗ ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
h1
a1 We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
W
h0 = xi ∂L (θ) ∂L (θ)
=
∂h2 ∂x̂i
= ∇x̂i {(x̂i − xi )T (x̂i − xi )}
Note that the loss function is
shown for only one training
example.
12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi ) ∂L (θ) ∂L (θ) ∂h2 ∂a2
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2
∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
W∗ ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
h1
a1 We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
W
h0 = xi ∂L (θ) ∂L (θ)
=
∂h2 ∂x̂i
= ∇x̂i {(x̂i − xi )T (x̂i − xi )}
Note that the loss function is = 2(x̂i − xi )
shown for only one training
example.
12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i = f (W ∗ h + c)
W∗
h = g(W xi + b)
W
xi
0 1 1 0 1 (binary inputs)
13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary
W∗
h = g(W xi + b)
W
xi
0 1 1 0 1 (binary inputs)
13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary
W
xi
0 1 1 0 1 (binary inputs)
13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary
13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary
13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary
13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary
13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary
13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary
W∗
h1
a1
W
h0 = xi
14/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
n ∂L (θ) ∂L (θ) ∂h2 ∂a2
L (θ) = − =
P
(xij log x̂ij + (1 − xij ) log(1 − x̂ij )) ∗
j=1 ∂W ∂h2 ∂a2 ∂W ∗
h2 = x̂i
a2
W∗
h1
a1
W
h0 = xi
14/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
n ∂L (θ) ∂L (θ) ∂h2 ∂a2
L (θ) = − =
P
(xij log x̂ij + (1 − xij ) log(1 − x̂ij )) ∗
j=1 ∂W ∂h2 ∂a2 ∂W ∗
h2 = x̂i
a2 ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
W∗
h1
a1
W
h0 = xi
14/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
n ∂L (θ) ∂L (θ) ∂h2 ∂a2
L (θ) = − =
P
(xij log x̂ij + (1 − xij ) log(1 − x̂ij )) ∗
j=1 ∂W ∂h2 ∂a2 ∂W ∗
h2 = x̂i
a2 ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
W∗
h1 We have already seen how to
a1 calculate the expressions in the
W square boxes when we learnt BP
h0 = xi
14/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
n ∂L (θ) ∂L (θ) ∂h2 ∂a2
L (θ) = − =
P
(xij log x̂ij + (1 − xij ) log(1 − x̂ij )) ∗
j=1 ∂W ∂h2 ∂a2 ∂W ∗
h2 = x̂i
a2 ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
W∗
h1 We have already seen how to
a1 calculate the expressions in the
W square boxes when we learnt BP
h0 = xi The first two terms on RHS can be
computed as:
∂L (θ) xij 1 − xij
=− +
∂h2j x̂ij 1 − x̂ij
∂h2j
= σ(a2j )(1 − σ(a2j ))
∂a2j
14/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
n ∂L (θ) ∂L (θ) ∂h2 ∂a2
L (θ) = − =
P
(xij log x̂ij + (1 − xij ) log(1 − x̂ij )) ∗
j=1 ∂W ∂h2 ∂a2 ∂W ∗
h2 = x̂i
a2 ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
W∗
h1 We have already seen how to
a1 calculate the expressions in the
W square boxes when we learnt BP
h0 = xi The first two terms on RHS can be
computed as:
∂L (θ) xij 1 − xij
=− +
1 − x̂ij
∂L (θ)
∂h2j x̂ij
∂h
∂L 21
(θ)
∂h2j
∂L (θ)
∂h22 = σ(a2j )(1 − σ(a2j ))
= . ∂a2j
∂h2 ..
∂L (θ)
∂h2n
14/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.2: Link between PCA and Autoencoders
15/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now see that the encoder part
x̂i y PCA of an autoencoder is equivalent to
PCA if we
h ≡ u1 u2
x
xi P T X T XP =D
16/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now see that the encoder part
x̂i y PCA of an autoencoder is equivalent to
PCA if we
use a linear encoder
h ≡ u1 u2
x
xi P T X T XP =D
16/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now see that the encoder part
x̂i y PCA of an autoencoder is equivalent to
PCA if we
use a linear encoder
h ≡ u1 u2 use a linear decoder
x
xi P T X T XP =D
16/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now see that the encoder part
x̂i y PCA of an autoencoder is equivalent to
PCA if we
use a linear encoder
h ≡ u1 u2 use a linear decoder
use squared error loss function
x
xi P T X T XP =D
16/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now see that the encoder part
x̂i y PCA of an autoencoder is equivalent to
PCA if we
use a linear encoder
h ≡ u1 u2 use a linear decoder
use squared error loss function
normalize the inputs to
x
m
!
xi P T X T XP =D 1 1 X
x̂ij = √ xij − xkj
m m
k=1
16/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First let us consider the implication
x̂i y PCA of normalizing the inputs to
m
!
1 1 X
u1 u2 x̂ij = √ xij − xkj
h ≡ m m
k=1
x
xi P T X T XP =D
17/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First let us consider the implication
x̂i y PCA of normalizing the inputs to
m
!
1 1 X
u1 u2 x̂ij = √ xij − xkj
h ≡ m m
k=1
The operation in the bracket ensures
x
that the data now has 0 mean along
xi P T X T XP = D each dimension j (we are subtracting
the mean)
17/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First let us consider the implication
x̂i y PCA of normalizing the inputs to
m
!
1 1 X
u1 u2 x̂ij = √ xij − xkj
h ≡ m m
k=1
The operation in the bracket ensures
x
that the data now has 0 mean along
xi P T X T XP = D each dimension j (we are subtracting
the mean)
0
Let X be this zero mean data mat-
rix then what the above normaliza-
0
tion gives us is X = √1m X
17/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First let us consider the implication
x̂i y PCA of normalizing the inputs to
m
!
1 1 X
u1 u2 x̂ij = √ xij − xkj
h ≡ m m
k=1
The operation in the bracket ensures
x
that the data now has 0 mean along
xi P T X T XP = D each dimension j (we are subtracting
the mean)
0
Let X be this zero mean data mat-
rix then what the above normaliza-
0
tion gives us is X = √1m X
Now (X)T X = m 1
(X 0 )T X 0 is the co-
variance matrix (recall that covari-
ance matrix plays an important role
in PCA) 17/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i y PCA
h ≡ u1 u2
x
xi P T X T XP =D
18/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First we will show that if we use lin-
x̂i y PCA ear decoder and a squared error loss
function then
h ≡ u1 u2
x
xi P T X T XP =D
18/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First we will show that if we use lin-
x̂i y PCA ear decoder and a squared error loss
function then
The optimal solution to the following
h ≡ u1 u2 objective function
x
xi P T X T XP =D
18/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First we will show that if we use lin-
x̂i y PCA ear decoder and a squared error loss
function then
The optimal solution to the following
h ≡ u1 u2 objective function
m n
x 1 XX
xi (xij − x̂ij )2
P T X T XP = D m
i=1 j=1
18/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First we will show that if we use lin-
x̂i y PCA ear decoder and a squared error loss
function then
The optimal solution to the following
h ≡ u1 u2 objective function
m n
x 1 XX
xi (xij − x̂ij )2
P T X T XP = D m
i=1 j=1
18/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1
19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1
This is equivalent to
19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1
This is equivalent to
min (kX − HW ∗ kF )2
W ∗H
19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1
This is equivalent to v
um X
n
uX
min (kX − HW ∗ kF )2 kAkF = t a2ij
W ∗H
i=1 j=1
19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1
This is equivalent to v
um X
n
uX
min (kX − HW ∗ kF )2 kAkF = t a2ij
W ∗H
i=1 j=1
(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we
are ignoring the biases)
19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1
This is equivalent to v
um X
n
uX
min (kX − HW ∗ kF )2 kAkF = t a2ij
W ∗H
i=1 j=1
(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we
are ignoring the biases)
From SVD we know that optimal solution to the above problem is given by
19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1
This is equivalent to v
um X
n
uX
min (kX − HW ∗ kF )2 kAkF = t a2ij
W ∗H
i=1 j=1
(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we
are ignoring the biases)
From SVD we know that optimal solution to the above problem is given by
H = U. ,≤k Σk,k
W ∗ = V.T,≤k
19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W
20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U. ,≤k Σk,k
20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)
20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)
= XV ΣT U T U (ΣΣT )−1 U T U. ,≤k Σk,k ((ABC)−1 = C −1 B −1 A−1 )
20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)
= XV ΣT U T U (ΣΣT )−1 U T U. ,≤k Σk,k ((ABC)−1 = C −1 B −1 A−1 )
= XV ΣT (ΣΣT )−1 U T U. ,≤k Σk,k (U T U = I)
20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)
= XV ΣT U T U (ΣΣT )−1 U T U. ,≤k Σk,k ((ABC)−1 = C −1 B −1 A−1 )
= XV ΣT (ΣΣT )−1 U T U. ,≤k Σk,k (U T U = I)
−1
= XV ΣT ΣT Σ−1 U T U. ,≤k Σk,k ((AB)−1 = B −1 A−1 )
20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)
= XV ΣT U T U (ΣΣT )−1 U T U. ,≤k Σk,k ((ABC)−1 = C −1 B −1 A−1 )
= XV ΣT (ΣΣT )−1 U T U. ,≤k Σk,k (U T U = I)
−1
= XV ΣT ΣT Σ−1 U T U. ,≤k Σk,k ((AB)−1 = B −1 A−1 )
= XV Σ−1 I. ,≤k Σk,k (U T U. ,≤k = I. ,≤k )
20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)
= XV ΣT U T U (ΣΣT )−1 U T U. ,≤k Σk,k ((ABC)−1 = C −1 B −1 A−1 )
= XV ΣT (ΣΣT )−1 U T U. ,≤k Σk,k (U T U = I)
−1
= XV ΣT ΣT Σ−1 U T U. ,≤k Σk,k ((AB)−1 = B −1 A−1 )
= XV Σ−1 I. ,≤k Σk,k (U T U. ,≤k = I. ,≤k )
= XV I. ,≤k (Σ−1 I. ,≤k = Σ−1
k,k )
20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)
= XV ΣT U T U (ΣΣT )−1 U T U. ,≤k Σk,k ((ABC)−1 = C −1 B −1 A−1 )
= XV ΣT (ΣΣT )−1 U T U. ,≤k Σk,k (U T U = I)
−1
= XV ΣT ΣT Σ−1 U T U. ,≤k Σk,k ((AB)−1 = B −1 A−1 )
= XV Σ−1 I. ,≤k Σk,k (U T U. ,≤k = I. ,≤k )
= XV I. ,≤k (Σ−1 I. ,≤k = Σ−1
k,k )
H = XV. ,≤k
20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)
= XV ΣT U T U (ΣΣT )−1 U T U. ,≤k Σk,k ((ABC)−1 = C −1 B −1 A−1 )
= XV ΣT (ΣΣT )−1 U T U. ,≤k Σk,k (U T U = I)
−1
= XV ΣT ΣT Σ−1 U T U. ,≤k Σk,k ((AB)−1 = B −1 A−1 )
= XV Σ−1 I. ,≤k Σk,k (U T U. ,≤k = I. ,≤k )
= XV I. ,≤k (Σ−1 I. ,≤k = Σ−1
k,k )
H = XV. ,≤k
21/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of X T X
21/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of X T X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
21/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of X T X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
We saw earlier that, if entries of X are normalized by
21/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of X T X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
We saw earlier that, if entries of X are normalized by
m
!
1 1 X
x̂ij = √ xij − xkj
m m
k=1
21/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of X T X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
We saw earlier that, if entries of X are normalized by
m
!
1 1 X
x̂ij = √ xij − xkj
m m
k=1
21/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of X T X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
We saw earlier that, if entries of X are normalized by
m
!
1 1 X
x̂ij = √ xij − xkj
m m
k=1
21/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
22/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
22/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder
22/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder
use a squared error loss function
22/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder
use a squared error loss function
and normalize the inputs to
22/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder
use a squared error loss function
and normalize the inputs to
m
!
1 1 X
x̂ij = √ xij − xkj
m m
k=1
22/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.3: Regularization in autoencoders
(Motivation)
23/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i
W∗
h
W
xi
24/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
While poor generalization could hap-
x̂i pen even in undercomplete autoen-
coders it is an even more serious prob-
W∗ lem for overcomplete auto encoders
h
W
xi
24/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
While poor generalization could hap-
x̂i pen even in undercomplete autoen-
coders it is an even more serious prob-
W∗ lem for overcomplete auto encoders
h Here, (as stated earlier) the model
can simply learn to copy xi to h and
W then h to x̂i
xi
24/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
While poor generalization could hap-
x̂i pen even in undercomplete autoen-
coders it is an even more serious prob-
W∗ lem for overcomplete auto encoders
h Here, (as stated earlier) the model
can simply learn to copy xi to h and
W then h to x̂i
xi To avoid poor generalization, we need
to introduce regularization
24/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The simplest solution is to add a L2 -
x̂i regularization term to the objective
function
W∗
m n
1 XX
h min (x̂ij − xij )2 + λkθk2
θ,w,w∗ ,b,c m
i=1 j=1
W
xi
25/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The simplest solution is to add a L2 -
x̂i regularization term to the objective
function
W∗
m n
1 XX
h min (x̂ij − xij )2 + λkθk2
θ,w,w∗ ,b,c m
i=1 j=1
W
This is very easy to implement and
xi
just adds a term λW to the gradient
∂L (θ)
∂W (and similarly for other para-
meters)
25/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Another trick is to tie the weights of
x̂i the encoder and decoder
W∗
h
W
xi
26/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Another trick is to tie the weights of
x̂i the encoder and decoder i.e., W ∗ =
WT
W∗
h
W
xi
26/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Another trick is to tie the weights of
x̂i the encoder and decoder i.e., W ∗ =
WT
W∗
This effectively reduces the capacity
h of Autoencoder and acts as a regular-
izer
W
xi
26/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.4: Denoising Autoencoders
27/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A denoising encoder simply corrupts
the input data using a probabilistic
x̂i
xij |xij )) before feeding it
process (P (e
to the network
x̃i
xij |xij )
P (e
xi
28/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A denoising encoder simply corrupts
the input data using a probabilistic
x̂i
xij |xij )) before feeding it
process (P (e
to the network
A simple P (e xij |xij ) used in practice
h
is the following
x̃i
xij |xij )
P (e
xi
28/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A denoising encoder simply corrupts
the input data using a probabilistic
x̂i
xij |xij )) before feeding it
process (P (e
to the network
A simple P (e xij |xij ) used in practice
h
is the following
P (e
xij = 0|xij ) = q
x̃i
xij |xij )
P (e
xi
28/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A denoising encoder simply corrupts
the input data using a probabilistic
x̂i
xij |xij )) before feeding it
process (P (e
to the network
A simple P (e xij |xij ) used in practice
h
is the following
P (e
xij = 0|xij ) = q
x̃i
xij = xij |xij ) = 1 − q
P (e
xij |xij )
P (e
xi
28/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A denoising encoder simply corrupts
the input data using a probabilistic
x̂i
xij |xij )) before feeding it
process (P (e
to the network
A simple P (e xij |xij ) used in practice
h
is the following
P (e
xij = 0|xij ) = q
x̃i
xij = xij |xij ) = 1 − q
P (e
xij |xij )
P (e
In other words, with probability q the
xi input is flipped to 0 and with probab-
ility (1 − q) it is retained as it is
28/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
How does this help ?
x̂i
x̃i
xij |xij )
P (e
xi
29/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
How does this help ?
x̂i This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
m n
h 1 XX
arg min (x̂ij − xij )2
θ m
i=1 j=1
x̃i
xij |xij )
P (e
xi
29/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
How does this help ?
x̂i This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
m n
h 1 XX
arg min (x̂ij − xij )2
θ m
i=1 j=1
29/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
How does this help ?
x̂i This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
m n
h 1 XX
arg min (x̂ij − xij )2
θ m
i=1 j=1
29/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
How does this help ?
x̂i This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
m n
h 1 XX
arg min (x̂ij − xij )2
θ m
i=1 j=1
30/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
0 1 2 3 9
Task: Hand-written digit
recognition
|xi | = 784 = 28 × 28
28*28
31/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i ∈ R784
Task: Hand-written digit
recognition
h ∈ Rd
|xi | = 784 = 28 × 28
h ∈ Rd
|xi | = 784 = 28 × 28
34/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We can think of each neuron as a filter which
x̂i will fire (or get maximally) activated for a cer-
tain input configuration xi
xi
35/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We can think of each neuron as a filter which
x̂i will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h
h1 = σ(W1T xi ) [ignoring bias b]
35/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We can think of each neuron as a filter which
x̂i will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h
h1 = σ(W1T xi ) [ignoring bias b]
35/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We can think of each neuron as a filter which
x̂i will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h
h1 = σ(W1T xi ) [ignoring bias b]
35/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We can think of each neuron as a filter which
x̂i will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h
h1 = σ(W1T xi ) [ignoring bias b]
35/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We can think of each neuron as a filter which
x̂i will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h
h1 = σ(W1T xi ) [ignoring bias b]
35/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Thus the inputs
x̂i
W1 W2 Wn
xi = q ,q ,... p
h W1T W1 W2T W2 WnT Wn
max {W1T xi }
xi
36/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Thus the inputs
x̂i
W1 W2 Wn
xi = q ,q ,... p
h W1T W1 W2T W2 WnT Wn
36/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Thus the inputs
x̂i
W1 W2 Wn
xi = q ,q ,... p
h W1T W1 W2T W2 WnT Wn
36/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Figure: Vanilla AE Figure: 25% Denoising Figure: 50% Denoising
(No noise) AE (q=0.25) AE (q=0.5)
37/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Figure: Vanilla AE Figure: 25% Denoising Figure: 50% Denoising
(No noise) AE (q=0.25) AE (q=0.5)
37/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Figure: Vanilla AE Figure: 25% Denoising Figure: 50% Denoising
(No noise) AE (q=0.25) AE (q=0.5)
x̃i
xij |xij )
P (e
xi
38/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We saw one form of P (e xij |xij ) which flips a
x̂i fraction q of the inputs to zero
Another way of corrupting the inputs is to add
a Gaussian noise to the input
h
eij = xij + N (0, 1)
x
x̃i
xij |xij )
P (e
xi
38/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We saw one form of P (e xij |xij ) which flips a
x̂i fraction q of the inputs to zero
Another way of corrupting the inputs is to add
a Gaussian noise to the input
h
eij = xij + N (0, 1)
x
38/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Figure: Weight decay
Figure: Data Figure: AE filters
filters
39/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Figure: Weight decay
Figure: Data Figure: AE filters
filters
39/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.5: Sparse Autoencoders
40/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i
xi
41/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i A hidden neuron with sigmoid activation will
have values between 0 and 1
xi
41/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i A hidden neuron with sigmoid activation will
have values between 0 and 1
We say that the neuron is activated when its
h output is close to 1 and not activated when
its output is close to 0.
xi
41/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i A hidden neuron with sigmoid activation will
have values between 0 and 1
We say that the neuron is activated when its
h output is close to 1 and not activated when
its output is close to 0.
A sparse autoencoder tries to ensure the
xi neuron is inactive most of the times.
41/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i If the neuron l is sparse (i.e. mostly inactive)
then ρ̂l → 0
xi
42/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i If the neuron l is sparse (i.e. mostly inactive)
then ρ̂l → 0
A sparse autoencoder uses a sparsity para-
h meter ρ (typically very close to 0, say, 0.005)
and tries to enforce the constraint ρ̂l = ρ
xi
42/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i If the neuron l is sparse (i.e. mostly inactive)
then ρ̂l → 0
A sparse autoencoder uses a sparsity para-
h meter ρ (typically very close to 0, say, 0.005)
and tries to enforce the constraint ρ̂l = ρ
One way of ensuring this is to add the follow-
xi ing term to the objective function
k
The average value of the X ρ 1−ρ
Ω(θ) = ρ log + (1 − ρ) log
activation of a neuron l is given ρ̂l 1 − ρ̂l
l=1
by
m
1 X
ρ̂l = h(xi )l
m
i=1
42/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i If the neuron l is sparse (i.e. mostly inactive)
then ρ̂l → 0
A sparse autoencoder uses a sparsity para-
h meter ρ (typically very close to 0, say, 0.005)
and tries to enforce the constraint ρ̂l = ρ
One way of ensuring this is to add the follow-
xi ing term to the objective function
k
The average value of the X ρ 1−ρ
Ω(θ) = ρ log + (1 − ρ) log
activation of a neuron l is given ρ̂l 1 − ρ̂l
l=1
by
m
1 X When will this term reach its minimum value
ρ̂l = h(xi )l and what is the minimum value? Let us plot
m
i=1
it and check.
42/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
ρ = 0.2
Ω(θ)
0.2 ρ̂l
43/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
ρ = 0.2
Ω(θ)
0.2 ρ̂l
43/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Now,
44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Now,
44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Now,
44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Now,
44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
Can be re-written as
k
X L (θ) is the squared error loss or
Ω(θ) = ρlogρ−ρlog ρ̂l +(1−ρ)log(1−ρ)−(1−ρ)log(1−ρ̂l ) cross entropy loss and Ω(θ) is the
l=1
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
∂Ω(θ)
Let us see how to calculate ∂W .
44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
Can be re-written as
k
X L (θ) is the squared error loss or
Ω(θ) = ρlogρ−ρlog ρ̂l +(1−ρ)log(1−ρ)−(1−ρ)log(1−ρ̂l ) cross entropy loss and Ω(θ) is the
l=1
sparsity constraint.
By Chain rule:
We already know how to calculate
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
Let us see how to calculate ∂W .
44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
Can be re-written as
k
X L (θ) is the squared error loss or
Ω(θ) = ρlogρ−ρlog ρ̂l +(1−ρ)log(1−ρ)−(1−ρ)log(1−ρ̂l ) cross entropy loss and Ω(θ) is the
l=1
sparsity constraint.
By Chain rule:
We already know how to calculate
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
Let us see how to calculate ∂W .
∂Ω(θ) h
∂Ω(θ) ∂Ω(θ)
iT
= ∂ ρ̂1
, ∂ ρ̂2 , . . . ∂Ω(θ)
∂ ρ̂k
∂ ρ̂
44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
Can be re-written as
k
X L (θ) is the squared error loss or
Ω(θ) = ρlogρ−ρlog ρ̂l +(1−ρ)log(1−ρ)−(1−ρ)log(1−ρ̂l ) cross entropy loss and Ω(θ) is the
l=1
sparsity constraint.
By Chain rule:
We already know how to calculate
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
Let us see how to calculate ∂W .
∂Ω(θ) h iT
= ∂Ω(θ)
∂ ρ̂1
, ∂Ω(θ)
∂ ρ̂2
, . . . ∂Ω(θ)
∂ ρ̂k
∂ ρ̂
For each neuron l ∈ 1 . . . k in hidden layer, we have
44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
Can be re-written as
k
X L (θ) is the squared error loss or
Ω(θ) = ρlogρ−ρlog ρ̂l +(1−ρ)log(1−ρ)−(1−ρ)log(1−ρ̂l ) cross entropy loss and Ω(θ) is the
l=1
sparsity constraint.
By Chain rule:
We already know how to calculate
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
Let us see how to calculate ∂W .
∂Ω(θ) h iT
= ∂Ω(θ)
∂ ρ̂1
, ∂Ω(θ)
∂ ρ̂2
, . . . ∂Ω(θ)
∂ ρ̂k
∂ ρ̂
For each neuron l ∈ 1 . . . k in hidden layer, we have
∂Ω(θ) ρ (1 − ρ)
=− +
∂ ρ̂l ρ̂l 1 − ρ̂l
44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
Can be re-written as
k
X L (θ) is the squared error loss or
Ω(θ) = ρlogρ−ρlog ρ̂l +(1−ρ)log(1−ρ)−(1−ρ)log(1−ρ̂l ) cross entropy loss and Ω(θ) is the
l=1
sparsity constraint.
By Chain rule:
We already know how to calculate
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
Let us see how to calculate ∂W .
∂Ω(θ) h iT
= ∂Ω(θ)∂ ρ̂1
, ∂Ω(θ)
∂ ρ̂2
, . . . ∂Ω(θ)
∂ ρ̂k
∂ ρ̂
For each neuron l ∈ 1 . . . k in hidden layer, we have
∂Ω(θ) ρ (1 − ρ)
=− +
∂ ρ̂l ρ̂l 1 − ρ̂l
∂ ρ̂l
and = xi (g 0 (W T xi + b))T (see next slide)
∂W
44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
Can be re-written as
k
X L (θ) is the squared error loss or
Ω(θ) = ρlogρ−ρlog ρ̂l +(1−ρ)log(1−ρ)−(1−ρ)log(1−ρ̂l ) cross entropy loss and Ω(θ) is the
l=1
sparsity constraint.
By Chain rule:
We already know how to calculate
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
Let us see how to calculate ∂W .
∂Ω(θ) h iT
= ∂Ω(θ)∂ ρ̂1
, ∂Ω(θ)
∂ ρ̂2
, . . . ∂Ω(θ)
∂ ρ̂k Finally,
∂ ρ̂
For each neuron l ∈ 1 . . . k in hidden layer, we have
∂ Lˆ(θ) ∂L (θ) ∂Ω(θ)
∂Ω(θ) ρ (1 − ρ) = +
=− + ∂W ∂W ∂W
∂ ρ̂l ρ̂l 1 − ρ̂l
∂ ρ̂l (and we know how to calculate both
and = xi (g 0 (W T xi + b))T (see next slide)
∂W terms on R.H.S)
44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Derivation
∂ ρ̂ ∂ ρ̂ ∂ ρ̂2 ∂ ρ̂k
= ∂W 1
∂W . . . ∂W
∂W
∂ ρ̂l
For each element in the above equation we can calculate ∂W (which is the partial
derivative of a scalar w.r.t. a matrix = matrix). For a single element of a matrix Wjl :-
h P i
1 m T
∂ ρ̂l ∂ m i=1 g W:,l xi + bl
=
∂Wjl ∂Wjl
h i
m T
1 X ∂ g W:,l xi + bl
=
m i=1 ∂Wjl
m
1 X 0 T
= g W:,l xi + bl xij
m i=1
46/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func- x̂
tion.
47/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func- x̂
tion.
It does so by adding the following reg-
ularization term to the loss function h
47/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func- x̂
tion.
It does so by adding the following reg-
ularization term to the loss function h
47/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func- x̂
tion.
It does so by adding the following reg-
ularization term to the loss function h
47/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
If the input has n dimensions and the
hidden layer has k dimensions then
48/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
If the input has n dimensions and the
hidden layer has k dimensions then ∂h1 ∂h1
∂x1 ... ... ... ∂xn
∂h ∂h2
∂x12 ... ... ... ∂xn
Jx (h) = .
.. ..
.. . .
∂hk ∂hk
∂x1 ... ... ... ∂xn
48/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
If the input has n dimensions and the
hidden layer has k dimensions then ∂h1 ∂h1
∂x1 ... ... ... ∂xn
∂h ∂h2
In other words, the (j, l) entry of the ∂x12 ... ... ... ∂xn
Jacobian captures the variation in the Jx (h) = .
.. ..
.. . .
output of the lth neuron with a small ∂hk ∂hk
... ... ...
variation in the j th input. ∂x1 ∂xn
48/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
If the input has n dimensions and the
hidden layer has k dimensions then ∂h1 ∂h1
∂x1 ... ... ... ∂xn
∂h ∂h2
In other words, the (j, l) entry of the ∂x12 ... ... ... ∂xn
Jacobian captures the variation in the Jx (h) = .
.. ..
.. . .
output of the lth neuron with a small ∂hk ∂hk
... ... ...
variation in the j th input. ∂x1 ∂xn
n X
k
∂hl 2
X
kJx (h)k2F =
∂xj
j=1 l=1
48/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
What is the intuition behind this ? n X
k
X ∂hl 2
kJx (h)k2F =
∂xj
j=1 l=1
x̂
49/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
What is the intuition behind this ? n X
k
X ∂hl 2
Consider ∂h1 kJx (h)k2F =
∂x1 , what does it mean if ∂xj
∂h1 j=1 l=1
∂x1 = 0
x̂
49/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
What is the intuition behind this ? n X
k
X ∂hl 2
Consider ∂h1 kJx (h)k2F =
∂x1 , what does it mean if ∂xj
∂h1 j=1 l=1
∂x1 = 0
It means that this neuron is not very
sensitive to variations in the input x1 .
x̂
49/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
What is the intuition behind this ? n X
k
X ∂hl 2
Consider ∂h1 kJx (h)k2F =
∂x1 , what does it mean if ∂xj
∂h1 j=1 l=1
∂x1 = 0
It means that this neuron is not very
sensitive to variations in the input x1 .
x̂
But doesn’t this contradict our other
goal of minimizing L(θ) which re-
quires h to capture variations in the h
input.
49/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Indeed it does and that’s the idea n X
k
X ∂hl 2
kJx (h)k2F =
∂xj
j=1 l=1
x̂
50/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Indeed it does and that’s the idea n X
k
X ∂hl 2
By putting these two contradicting kJx (h)k2F =
∂xj
objectives against each other we en- j=1 l=1
50/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Indeed it does and that’s the idea n X
k
X ∂hl 2
By putting these two contradicting kJx (h)k2F =
∂xj
objectives against each other we en- j=1 l=1
50/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Indeed it does and that’s the idea n X
k
X ∂hl 2
By putting these two contradicting kJx (h)k2F =
∂xj
objectives against each other we en- j=1 l=1
50/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Indeed it does and that’s the idea n X
k
X ∂hl 2
By putting these two contradicting kJx (h)k2F =
∂xj
objectives against each other we en- j=1 l=1
50/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us try to understand this with the help of an illustration.
51/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
y
u1
u2
52/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
y
Consider the variations in the data
along directions u1 and u2
u1
u2
52/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
y
Consider the variations in the data
along directions u1 and u2
u1 It makes sense to maximize a neuron
to be sensitive to variations along u1
u2
52/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
y
Consider the variations in the data
along directions u1 and u2
u1 It makes sense to maximize a neuron
to be sensitive to variations along u1
u2 At the same time it makes sense to
inhibit a neuron from being sensitive
to variations along u2 (as there seems
to be small noise and unimportant for
x reconstruction)
52/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
y
Consider the variations in the data
along directions u1 and u2
u1 It makes sense to maximize a neuron
to be sensitive to variations along u1
u2 At the same time it makes sense to
inhibit a neuron from being sensitive
to variations along u2 (as there seems
to be small noise and unimportant for
x reconstruction)
By doing so we can balance between
the contradicting goals of good recon-
struction and low sensitivity.
52/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
y
Consider the variations in the data
along directions u1 and u2
u1 It makes sense to maximize a neuron
to be sensitive to variations along u1
u2 At the same time it makes sense to
inhibit a neuron from being sensitive
to variations along u2 (as there seems
to be small noise and unimportant for
x reconstruction)
By doing so we can balance between
the contradicting goals of good recon-
struction and low sensitivity.
What does this remind you of ?
52/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.7 : Summary
53/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂ y PCA
h ≡ u1 u2
x
x P T X T XP =D
54/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂ y PCA
h ≡ u1 u2
x
x ∗ 2
P T X T XP =D | {z } kF
min kX − HW
θ
U ΣV T
(SVD)
54/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i
x̃i
xij |xij )
P (e
xi
55/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i
Regularization
x̃i
xij |xij )
P (e
xi
55/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i
Regularization
x̃i
xij |xij )
P (e
xi
55/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i
Regularization
55/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i
Regularization
xi
55/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7