0% found this document useful (0 votes)

101 views

Lecture7 PDF

An autoencoder is a type of neural network that encodes an input into a hidden representation and then decodes the hidden representation to reconstruct the input. If an undercomplete autoencoder (where the hidden dimension is less than the input dimension) can perfectly reconstruct the input from the hidden representation, it means the hidden representation captures all the important characteristics of the input. This is analogous to principal component analysis. The document then discusses regularization techniques for autoencoders like denoising, sparse, and contractive autoencoders.

Uploaded by

vicky.sajnani

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

101 views

Lecture7 PDF

Uploaded by

vicky.sajnani

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 228

CS7015 (Deep Learning) : Lecture 7

Autoencoders and relation to PCA, Regularization in autoencoders, Denoising

autoencoders, Sparse autoencoders, Contractive autoencoders

Mitesh M. Khapra

Department of Computer Science and Engineering

Indian Institute of Technology Madras

1/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.1: Introduction to Autoencoders

2/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i

W∗
h
W
xi

3/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
An autoencoder is a special type of
x̂i feed forward neural network which
does the following
W∗
h
W
xi

h = g(W xi + b)

h = g(W xi + b)
x̂i = f (W ∗ h + c)

3/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
An autoencoder is a special type of
x̂i feed forward neural network which
does the following
W∗
Encodes its input xi into a hidden
h representation h
W Decodes the input again from this
hidden representation
xi
The model is trained to minimize a
certain loss function which will ensure
that x̂i is close to xi (we will see some
h = g(W xi + b)
such loss functions soon)
x̂i = f (W ∗ h + c)

3/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i

W∗
h
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

4/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case where
x̂i dim(h) < dim(xi )

W∗
h
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

4/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case where
x̂i dim(h) < dim(xi )
If we are still able to reconstruct x̂i
W∗
perfectly from h, then what does it
h say about h?
h is a loss-free encoding of xi . It cap-
W
tures all the important characteristics
xi of xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

An autoencoder where dim(h) < dim(xi ) is

called an under complete autoencoder

4/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i

W∗
h
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗
h
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
Such an identity encoding is useless
xi in practice as it does not really tell us
anything about the important char-
h = g(W xi + b) acteristics of the data
x̂i = f (W ∗ h + c)

An autoencoder where dim(h) ≥ dim(xi ) is

called an over complete autoencoder

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The Road Ahead

6/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The Road Ahead
Choice of f (xi ) and g(xi )

6/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The Road Ahead
Choice of f (xi ) and g(xi )
Choice of loss function

7/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i = f (W ∗ h + c)

W∗
h = g(W xi + b)

W
xi

0 1 1 0 1 (binary inputs)

8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are binary
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗
h = g(W xi + b)

W
xi

0 1 1 0 1 (binary inputs)

8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are binary
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)

W
xi

0 1 1 0 1 (binary inputs)

0 1 1 0 1 (binary inputs) Logistic as it naturally restricts all

outputs to be between 0 and 1

0 1 1 0 1 (binary inputs) Logistic as it naturally restricts all

outputs to be between 0 and 1

g is typically chosen as the sigmoid

function

8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i = f (W ∗ h + c)

W∗
h = g(W xi + b)

W
xi

0.25 0.5 1.25 3.5 4.5

(real valued inputs)

9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗
h = g(W xi + b)

W
xi

0.25 0.5 1.25 3.5 4.5

(real valued inputs)

9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)

W
xi

0.25 0.5 1.25 3.5 4.5

(real valued inputs)

0.25 0.5 1.25 3.5 4.5

(real valued inputs)

0.25 0.5 1.25 3.5 4.5

(real valued inputs)

0.25 0.5 1.25 3.5 4.5

(real valued inputs)

0.25 0.5 1.25 3.5 4.5 What will logistic and tanh do?
(real valued inputs)

0.25 0.5 1.25 3.5 4.5 What will logistic and tanh do?
(real valued inputs) They will restrict the reconstruc-
ted x̂i to lie between [0,1] or [-1,1]
whereas we want x̂i ∈ Rn

0.25 0.5 1.25 3.5 4.5 What will logistic and tanh do?
(real valued inputs) They will restrict the reconstruc-
ted x̂i to lie between [0,1] or [-1,1]
Again, g is typically chosen as the
whereas we want x̂i ∈ Rn
sigmoid function

9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The Road Ahead
Choice of f (xi ) and g(xi )
Choice of loss function

10/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i

W∗

h = g(W xi + b)
x̂i = f (W ∗ h + c)

11/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are real
x̂i valued

W∗

h = g(W xi + b)
x̂i = f (W ∗ h + c)

11/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are real
x̂i valued
The objective of the autoencoder is to recon-
W∗ struct x̂i to be as close to xi as possible

h = g(W xi + b)
x̂i = f (W ∗ h + c)

h This can be formalized using the following

objective function:
W m n
1 XX
min (x̂ij − xij )2
xi ∗
W,W ,c,b m i=1 j=1

h = g(W xi + b)
x̂i = f (W ∗ h + c)

h This can be formalized using the following

objective function:
W m n
1 XX
min (x̂ij − xij )2
xi ∗
W,W ,c,b m i=1 j=1
m
1 X
h = g(W xi + b) i.e., min (x̂i − xi )T (x̂i − xi )
∗
W,W ,c,b m i=1
x̂i = f (W ∗ h + c)

h This can be formalized using the following

objective function:
W m n
1 XX
min (x̂ij − xij )2
xi ∗
W,W ,c,b m i=1 j=1
m
1 X
h = g(W xi + b) i.e., min (x̂i − xi )T (x̂i − xi )
∗
W,W ,c,b m i=1
x̂i = f (W ∗ h + c)
We can then train the autoencoder just like
a regular feedforward network using back-
propagation

h This can be formalized using the following

11/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi )
h2 = x̂i
a2

W∗
h1
a1
W
h0 = xi

12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi )
h2 = x̂i
a2

W∗
h1
a1
W
h0 = xi

Note that the loss function is

shown for only one training
example.

12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi ) ∂L (θ) ∂L (θ) ∂h2 ∂a2
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2

W∗
h1
a1
W
h0 = xi

Note that the loss function is

shown for only one training
example.

Note that the loss function is

shown for only one training
example.

Note that the loss function is

shown for only one training
example.

12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi ) ∂L (θ) ∂L (θ) ∂h2 ∂a2
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2
∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
W∗ ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
h1
a1 We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
W
h0 = xi ∂L (θ) ∂L (θ)
=
∂h2 ∂x̂i

Note that the loss function is

shown for only one training
example.

12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i = f (W ∗ h + c)

W∗

h = g(W xi + b)

W
xi

0 1 1 0 1 (binary inputs)

13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary

W∗

h = g(W xi + b)

W
xi

0 1 1 0 1 (binary inputs)

13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary

W∗ We use a sigmoid decoder which will

produce outputs between 0 and 1, and
h = g(W xi + b) can be interpreted as probabilities.

W
xi

0 1 1 0 1 (binary inputs)

13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary

W∗ We use a sigmoid decoder which will

produce outputs between 0 and 1, and
h = g(W xi + b) can be interpreted as probabilities.

W For a single n-dimensional ith input we

can use the following loss function
xi X n
min{− (xij log x̂ij + (1 − xij ) log(1 − x̂ij ))}
j=1
0 1 1 0 1 (binary inputs)

13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary

W∗ We use a sigmoid decoder which will

produce outputs between 0 and 1, and
h = g(W xi + b) can be interpreted as probabilities.

W For a single n-dimensional ith input we

can use the following loss function
xi X n
min{− (xij log x̂ij + (1 − xij ) log(1 − x̂ij ))}
j=1
0 1 1 0 1 (binary inputs)
What value of x̂ij will minimize this
function?

13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary

W∗ We use a sigmoid decoder which will

produce outputs between 0 and 1, and
h = g(W xi + b) can be interpreted as probabilities.

W For a single n-dimensional ith input we

can use the following loss function
xi X n
min{− (xij log x̂ij + (1 − xij ) log(1 − x̂ij ))}
j=1
0 1 1 0 1 (binary inputs)
What value of x̂ij will minimize this
function?
If xij = 1 ?

13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary

W∗ We use a sigmoid decoder which will

produce outputs between 0 and 1, and
h = g(W xi + b) can be interpreted as probabilities.

W For a single n-dimensional ith input we

can use the following loss function
xi X n
min{− (xij log x̂ij + (1 − xij ) log(1 − x̂ij ))}
j=1
0 1 1 0 1 (binary inputs)
What value of x̂ij will minimize this
function?
If xij = 1 ?
If xij = 0 ?

13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary

W∗ We use a sigmoid decoder which will

produce outputs between 0 and 1, and
h = g(W xi + b) can be interpreted as probabilities.

W For a single n-dimensional ith input we

13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary

W∗ We use a sigmoid decoder which will

produce outputs between 0 and 1, and
h = g(W xi + b) can be interpreted as probabilities.

W For a single n-dimensional ith input we

can use the following loss function
xi X n
min{− (xij log x̂ij + (1 − xij ) log(1 − x̂ij ))}
j=1
0 1 1 0 1 (binary inputs) ∂L (θ)
Again we need is a formula for ∂W ∗ and
What value of x̂ij will minimize this ∂L (θ)
∂W to use backpropagation
function?
If xij = 1 ?
If xij = 0 ?
Indeed the above function will be
minimized when x̂ij = xij !
13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
n
L (θ) = −
P
(xij log x̂ij + (1 − xij ) log(1 − x̂ij ))
j=1
h2 = x̂i
a2

W∗
h1
a1
W
h0 = xi

14/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
n ∂L (θ) ∂L (θ) ∂h2 ∂a2
L (θ) = − =
P
(xij log x̂ij + (1 − xij ) log(1 − x̂ij )) ∗
j=1 ∂W ∂h2 ∂a2 ∂W ∗
h2 = x̂i
a2

W∗
h1
a1
W
h0 = xi

14/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
n ∂L (θ) ∂L (θ) ∂h2 ∂a2
L (θ) = − =
P
(xij log x̂ij + (1 − xij ) log(1 − x̂ij )) ∗
j=1 ∂W ∂h2 ∂a2 ∂W ∗
h2 = x̂i
a2 ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
W∗
h1 We have already seen how to
a1 calculate the expressions in the
W square boxes when we learnt BP
h0 = xi The first two terms on RHS can be
computed as:
∂L (θ) xij 1 − xij
=− +
1 − x̂ij
 ∂L (θ) 
∂h2j x̂ij
∂h
 ∂L 21
(θ) 
 ∂h2j
∂L (θ) 
 ∂h22  = σ(a2j )(1 − σ(a2j ))
= .  ∂a2j
∂h2  .. 
 
∂L (θ)
∂h2n
14/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.2: Link between PCA and Autoencoders

15/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now see that the encoder part
x̂i y PCA of an autoencoder is equivalent to
PCA if we

h ≡ u1 u2

x
xi P T X T XP =D

16/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now see that the encoder part
x̂i y PCA of an autoencoder is equivalent to
PCA if we
use a linear encoder
h ≡ u1 u2

x
xi P T X T XP =D

16/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now see that the encoder part
x̂i y PCA of an autoencoder is equivalent to
PCA if we
use a linear encoder
h ≡ u1 u2 use a linear decoder
use squared error loss function
x
xi P T X T XP =D

16/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First let us consider the implication
x̂i y PCA of normalizing the inputs to
m
!
1 1 X
u1 u2 x̂ij = √ xij − xkj
h ≡ m m
k=1

x
xi P T X T XP =D

17/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First let us consider the implication
x̂i y PCA of normalizing the inputs to
m
!
1 1 X
u1 u2 x̂ij = √ xij − xkj
h ≡ m m
k=1
The operation in the bracket ensures
x
that the data now has 0 mean along
xi P T X T XP = D each dimension j (we are subtracting
the mean)
0
Let X be this zero mean data mat-
rix then what the above normaliza-
0
tion gives us is X = √1m X
Now (X)T X = m 1
(X 0 )T X 0 is the co-
variance matrix (recall that covari-
ance matrix plays an important role
in PCA) 17/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i y PCA

h ≡ u1 u2

x
xi P T X T XP =D

18/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First we will show that if we use lin-
x̂i y PCA ear decoder and a squared error loss
function then

h ≡ u1 u2

x
xi P T X T XP =D

18/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First we will show that if we use lin-
x̂i y PCA ear decoder and a squared error loss
function then
The optimal solution to the following
h ≡ u1 u2 objective function

x
xi P T X T XP =D

is obtained when we use a linear en-

coder.

18/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1

19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1
This is equivalent to

19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1
This is equivalent to
min (kX − HW ∗ kF )2
W ∗H

19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1
This is equivalent to v
um X
n
uX
min (kX − HW ∗ kF )2 kAkF = t a2ij
W ∗H
i=1 j=1

(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we
are ignoring the biases)

19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1
This is equivalent to v
um X
n
uX
min (kX − HW ∗ kF )2 kAkF = t a2ij
W ∗H
i=1 j=1

(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we
are ignoring the biases)
From SVD we know that optimal solution to the above problem is given by

HW ∗ = U. ,≤k Σk,k V.T,≤k

19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1
This is equivalent to v
um X
n
uX
min (kX − HW ∗ kF )2 kAkF = t a2ij
W ∗H
i=1 j=1

(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we
are ignoring the biases)
From SVD we know that optimal solution to the above problem is given by

HW ∗ = U. ,≤k Σk,k V.T,≤k

By matching variables one possible solution is

H = U. ,≤k Σk,k
W ∗ = V.T,≤k

19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)
= XV ΣT U T U (ΣΣT )−1 U T U. ,≤k Σk,k ((ABC)−1 = C −1 B −1 A−1 )
= XV ΣT (ΣΣT )−1 U T U. ,≤k Σk,k (U T U = I)
−1
= XV ΣT ΣT Σ−1 U T U. ,≤k Σk,k ((AB)−1 = B −1 A−1 )

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

Thus H is a linear transformation of X and W = V. ,≤k

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We have encoder W = V.,≤k

21/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of X T X

21/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of X T X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix

m
!
1 1 X
x̂ij = √ xij − xkj
m m
k=1

then X T X is indeed the covariance matrix

m
!
1 1 X
x̂ij = √ xij − xkj
m m
k=1

then X T X is indeed the covariance matrix

Thus, the encoder matrix for linear autoencoder(W ) and the projection
matrix(P ) for PCA could indeed be the same. Hence proved

21/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Remember
The encoder of a linear autoencoder is equivalent to PCA if we

22/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder

22/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder

22/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder
use a squared error loss function

m
!
1 1 X
x̂ij = √ xij − xkj
m m
k=1

22/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.3: Regularization in autoencoders
(Motivation)

23/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i

W∗
h
W
xi

24/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
While poor generalization could hap-
x̂i pen even in undercomplete autoen-
coders it is an even more serious prob-
W∗ lem for overcomplete auto encoders
h Here, (as stated earlier) the model
can simply learn to copy xi to h and
W then h to x̂i
xi

24/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The simplest solution is to add a L2 -
x̂i regularization term to the objective
function
W∗
m n
1 XX
h min (x̂ij − xij )2 + λkθk2
θ,w,w∗ ,b,c m
i=1 j=1
W
xi

25/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The simplest solution is to add a L2 -
x̂i regularization term to the objective
function
W∗
m n
1 XX
h min (x̂ij − xij )2 + λkθk2
θ,w,w∗ ,b,c m
i=1 j=1
W
This is very easy to implement and
xi
just adds a term λW to the gradient
∂L (θ)
∂W (and similarly for other para-
meters)

25/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Another trick is to tie the weights of
x̂i the encoder and decoder

W∗
h
W
xi

26/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Another trick is to tie the weights of
x̂i the encoder and decoder i.e., W ∗ =
WT
W∗
h
W
xi

26/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Another trick is to tie the weights of
x̂i the encoder and decoder i.e., W ∗ =
WT
W∗
This effectively reduces the capacity
h of Autoencoder and acts as a regular-
izer
W
xi

26/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.4: Denoising Autoencoders

27/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A denoising encoder simply corrupts
the input data using a probabilistic
x̂i
xij |xij )) before feeding it
process (P (e
to the network

x̃i
xij |xij )
P (e
xi

28/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A denoising encoder simply corrupts
the input data using a probabilistic
x̂i
xij |xij )) before feeding it
process (P (e
to the network
A simple P (e xij |xij ) used in practice
h
is the following

x̃i
xij |xij )
P (e
xi

P (e
xij = 0|xij ) = q
x̃i
xij |xij )
P (e
xi

P (e
xij = 0|xij ) = q
x̃i
xij = xij |xij ) = 1 − q
P (e
xij |xij )
P (e
xi

P (e
xij = 0|xij ) = q
x̃i
xij = xij |xij ) = 1 − q
P (e
xij |xij )
P (e
In other words, with probability q the
xi input is flipped to 0 and with probab-
ility (1 − q) it is retained as it is

28/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
How does this help ?
x̂i

x̃i
xij |xij )
P (e
xi

29/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
How does this help ?
x̂i This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
m n
h 1 XX
arg min (x̂ij − xij )2
θ m
i=1 j=1

x̃i
xij |xij )
P (e
xi

x̃i It no longer makes sense for the model

to copy the corrupted xei into h(xei )
xij |xij )
P (e and then into x̂i (the objective func-
xi tion will not be minimized by doing
so)

x̃i It no longer makes sense for the model

to copy the corrupted xei into h(xei )
xij |xij )
P (e and then into x̂i (the objective func-
xi tion will not be minimized by doing
so)
Instead the model will now have to
capture the characteristics of the data
correctly.

x̃i It no longer makes sense for the model

to copy the corrupted xei into h(xei )
xij |xij )
P (e and then into x̂i (the objective func-
xi tion will not be minimized by doing
so)
For example, it will have to learn to Instead the model will now have to
reconstruct a corrupted xij correctly by capture the characteristics of the data
relying on its interactions with other correctly.
elements of xi
29/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now see a practical application in which AEs are used and then compare
Denoising Autoencoders with regular autoencoders

30/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
0 1 2 3 9
Task: Hand-written digit
recognition

|xi | = 784 = 28 × 28

28*28

Figure: Basic approach(we use raw data as input

Figure: MNIST Data features)

31/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i ∈ R784
Task: Hand-written digit
recognition

h ∈ Rd

|xi | = 784 = 28 × 28

Figure: MNIST Data

Figure: AE approach (first learn important

characteristics of data)
32/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Task: Hand-written digit 0 1 2 3 9
recognition

h ∈ Rd

|xi | = 784 = 28 × 28

Figure: MNIST Data

Figure: AE approach (and then train a classifier on

top of this hidden representation)
33/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now see a way of visualizing AEs and use this visualization to compare
different AEs

34/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We can think of each neuron as a filter which
x̂i will fire (or get maximally) activated for a cer-
tain input configuration xi

35/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We can think of each neuron as a filter which
x̂i will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h
h1 = σ(W1T xi ) [ignoring bias b]

xi Where W1 is the trained vector of weights con-

necting the input to the first hidden neuron

xi Where W1 is the trained vector of weights con-

necting the input to the first hidden neuron
What values of xi will cause h1 to be max-
imum (or maximally activated)

xi Where W1 is the trained vector of weights con-

necting the input to the first hidden neuron
What values of xi will cause h1 to be max-
imum (or maximally activated)
Suppose we assume that our inputs are nor-
malized so that kxi k = 1

xi Where W1 is the trained vector of weights con-

necting the input to the first hidden neuron
What values of xi will cause h1 to be max-
imum (or maximally activated)
max {W1T xi } Suppose we assume that our inputs are nor-
xi
malized so that kxi k = 1
s.t. ||xi ||2 = xTi xi = 1

xi Where W1 is the trained vector of weights con-

35/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Thus the inputs
x̂i

W1 W2 Wn
xi = q ,q ,... p
h W1T W1 W2T W2 WnT Wn

will respectively cause hidden neurons 1 to n

xi to maximally fire

max {W1T xi }
xi

s.t. ||xi ||2 = xTi xi = 1

W1
Solution: xi = p
W1T W1

36/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Thus the inputs
x̂i

W1 W2 Wn
xi = q ,q ,... p
h W1T W1 W2T W2 WnT Wn

will respectively cause hidden neurons 1 to n

xi to maximally fire
Let us plot these images (xi ’s) which maxim-
ally activate the first k neurons of the hidden
representations learned by a vanilla autoen-
max {W1T xi } coder and different denoising autoencoders
xi

s.t. ||xi ||2 = xTi xi = 1

W1
Solution: xi = p
W1T W1

36/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Thus the inputs
x̂i

W1 W2 Wn
xi = q ,q ,... p
h W1T W1 W2T W2 WnT Wn

will respectively cause hidden neurons 1 to n

s.t. ||xi ||2 = xTi xi = 1

These xi ’s are computed by the above formula
W1
using the weights (W1 , W2 . . . Wk ) learned by
Solution: xi = p the respective autoencoders
W1T W1

36/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Figure: Vanilla AE Figure: 25% Denoising Figure: 50% Denoising
(No noise) AE (q=0.25) AE (q=0.5)

The vanilla AE does not learn many meaningful patterns

37/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Figure: Vanilla AE Figure: 25% Denoising Figure: 50% Denoising
(No noise) AE (q=0.25) AE (q=0.5)

The vanilla AE does not learn many meaningful patterns

37/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Figure: Vanilla AE Figure: 25% Denoising Figure: 50% Denoising
(No noise) AE (q=0.25) AE (q=0.5)

The vanilla AE does not learn many meaningful patterns

The hidden neurons of the denoising AEs seem to act like pen-stroke detectors
(for example, in the highlighted neuron the black region is a stroke that you
would expect in a ’0’ or a ’2’ or a ’3’ or a ’8’ or a ’9’)
As the noise increases the filters become more wide because the neuron has to
rely on more adjacent pixels to feel confident about a stroke
37/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We saw one form of P (e xij |xij ) which flips a
x̂i fraction q of the inputs to zero

x̃i
xij |xij )
P (e
xi

38/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We saw one form of P (e xij |xij ) which flips a
x̂i fraction q of the inputs to zero
Another way of corrupting the inputs is to add
a Gaussian noise to the input
h
eij = xij + N (0, 1)
x

x̃i
xij |xij )
P (e
xi

x̃i We will now use such a denoising AE on a

different dataset and see their performance
xij |xij )
P (e
xi

38/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Figure: Weight decay
Figure: Data Figure: AE filters
filters

The hidden neurons essentially behave like edge detectors

39/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Figure: Weight decay
Figure: Data Figure: AE filters
filters

The hidden neurons essentially behave like edge detectors

PCA does not give such edge detectors

39/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.5: Sparse Autoencoders

40/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i

41/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i A hidden neuron with sigmoid activation will
have values between 0 and 1

41/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i A hidden neuron with sigmoid activation will
have values between 0 and 1
We say that the neuron is activated when its
h output is close to 1 and not activated when
its output is close to 0.

41/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i If the neuron l is sparse (i.e. mostly inactive)
then ρ̂l → 0

The average value of the

activation of a neuron l is given
by
m
1 X
ρ̂l = h(xi )l
m
i=1

42/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i If the neuron l is sparse (i.e. mostly inactive)
then ρ̂l → 0
A sparse autoencoder uses a sparsity para-
h meter ρ (typically very close to 0, say, 0.005)
and tries to enforce the constraint ρ̂l = ρ

The average value of the

activation of a neuron l is given
by
m
1 X
ρ̂l = h(xi )l
m
i=1

k
The average value of the X ρ 1−ρ
Ω(θ) = ρ log + (1 − ρ) log
activation of a neuron l is given ρ̂l 1 − ρ̂l
l=1
by
m
1 X
ρ̂l = h(xi )l
m
i=1

k
The average value of the X ρ 1−ρ
Ω(θ) = ρ log + (1 − ρ) log
activation of a neuron l is given ρ̂l 1 − ρ̂l
l=1
by
m
1 X When will this term reach its minimum value
ρ̂l = h(xi )l and what is the minimum value? Let us plot
m
i=1
it and check.

42/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
ρ = 0.2

Ω(θ)

0.2 ρ̂l

43/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
ρ = 0.2

Ω(θ)

0.2 ρ̂l

The function will reach its minimum value(s) when ρ̂l = ρ.

43/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Now,

Lˆ(θ) = L (θ) + Ω(θ)

44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Now,

Lˆ(θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss or

cross entropy loss and Ω(θ) is the
sparsity constraint.

44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Now,

Lˆ(θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss or

cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W

44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Now,

Lˆ(θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss or

cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
∂Ω(θ)
Let us see how to calculate ∂W .

44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss or

cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
∂Ω(θ)
Let us see how to calculate ∂W .

44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
Can be re-written as
k
X L (θ) is the squared error loss or
Ω(θ) = ρlogρ−ρlog ρ̂l +(1−ρ)log(1−ρ)−(1−ρ)log(1−ρ̂l ) cross entropy loss and Ω(θ) is the
l=1
sparsity constraint.
By Chain rule:
We already know how to calculate
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
Let us see how to calculate ∂W .
∂Ω(θ) h iT
= ∂Ω(θ)∂ ρ̂1
, ∂Ω(θ)
∂ ρ̂2
, . . . ∂Ω(θ)
∂ ρ̂k
∂ ρ̂
For each neuron l ∈ 1 . . . k in hidden layer, we have
∂Ω(θ) ρ (1 − ρ)
=− +
∂ ρ̂l ρ̂l 1 − ρ̂l
∂ ρ̂l
and = xi (g 0 (W T xi + b))T (see next slide)
∂W
44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
Can be re-written as
k
X L (θ) is the squared error loss or
Ω(θ) = ρlogρ−ρlog ρ̂l +(1−ρ)log(1−ρ)−(1−ρ)log(1−ρ̂l ) cross entropy loss and Ω(θ) is the
l=1
sparsity constraint.
By Chain rule:
We already know how to calculate
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
Let us see how to calculate ∂W .
∂Ω(θ) h iT
= ∂Ω(θ)∂ ρ̂1
, ∂Ω(θ)
∂ ρ̂2
, . . . ∂Ω(θ)
∂ ρ̂k Finally,
∂ ρ̂
For each neuron l ∈ 1 . . . k in hidden layer, we have
∂ Lˆ(θ) ∂L (θ) ∂Ω(θ)
∂Ω(θ) ρ (1 − ρ) = +
=− + ∂W ∂W ∂W
∂ ρ̂l ρ̂l 1 − ρ̂l
∂ ρ̂l (and we know how to calculate both
and = xi (g 0 (W T xi + b))T (see next slide)
∂W terms on R.H.S)
44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Derivation
∂ ρ̂ ∂ ρ̂ ∂ ρ̂2 ∂ ρ̂k

= ∂W 1
∂W . . . ∂W
∂W
∂ ρ̂l
For each element in the above equation we can calculate ∂W (which is the partial
derivative of a scalar w.r.t. a matrix = matrix). For a single element of a matrix Wjl :-
h P i
1 m T
∂ ρ̂l ∂ m i=1 g W:,l xi + bl
=
∂Wjl ∂Wjl
h i
m T
1 X ∂ g W:,l xi + bl
=
m i=1 ∂Wjl
m
1 X 0 T

= g W:,l xi + bl xij
m i=1

So in matrix notation we can write it as :

∂ ρ̂l
= xi (g 0 (W T xi + b))T
∂W
45/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.6: Contractive Autoencoders

46/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func- x̂
tion.

47/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func- x̂
tion.
It does so by adding the following reg-
ularization term to the loss function h

Ω(θ) = kJx (h)k2F

x
where Jx (h) is the Jacobian of the en-
coder.

Ω(θ) = kJx (h)k2F

x
where Jx (h) is the Jacobian of the en-
coder.
Let us see what it looks like.

47/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
If the input has n dimensions and the
hidden layer has k dimensions then

48/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
If the input has n dimensions and the  
hidden layer has k dimensions then ∂h1 ∂h1
∂x1 ... ... ... ∂xn
 ∂h ∂h2 
 ∂x12 ... ... ... ∂xn 
Jx (h) =  .
.. .. 

 .. . . 
∂hk ∂hk
∂x1 ... ... ... ∂xn

48/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
If the input has n dimensions and the  
hidden layer has k dimensions then ∂h1 ∂h1
∂x1 ... ... ... ∂xn
 ∂h ∂h2 
In other words, the (j, l) entry of the  ∂x12 ... ... ... ∂xn 
Jacobian captures the variation in the Jx (h) =  .
.. .. 

 .. . . 
output of the lth neuron with a small ∂hk ∂hk
... ... ...
variation in the j th input. ∂x1 ∂xn

n X
k
∂hl 2
X
kJx (h)k2F =
∂xj
j=1 l=1

48/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
What is the intuition behind this ? n X
k
X ∂hl 2
kJx (h)k2F =
∂xj
j=1 l=1

x̂

49/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
What is the intuition behind this ? n X
k
X ∂hl 2
Consider ∂h1 kJx (h)k2F =
∂x1 , what does it mean if ∂xj
∂h1 j=1 l=1
∂x1 = 0

x̂

49/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
What is the intuition behind this ? n X
k
X ∂hl 2
Consider ∂h1 kJx (h)k2F =
∂x1 , what does it mean if ∂xj
∂h1 j=1 l=1
∂x1 = 0
It means that this neuron is not very
sensitive to variations in the input x1 .
x̂
But doesn’t this contradict our other
goal of minimizing L(θ) which re-
quires h to capture variations in the h
input.

49/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Indeed it does and that’s the idea n X
k
X ∂hl 2
kJx (h)k2F =
∂xj
j=1 l=1

x̂

50/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Indeed it does and that’s the idea n X
k
X ∂hl 2
By putting these two contradicting kJx (h)k2F =
∂xj
objectives against each other we en- j=1 l=1

sure that h is sensitive to only very

important variations as observed in
the training data. x̂

sure that h is sensitive to only very

important variations as observed in
the training data. x̂
L(θ) - capture important variations
in data
h

sure that h is sensitive to only very

important variations as observed in
the training data. x̂
L(θ) - capture important variations
in data
h
Ω(θ) - do not capture variations in
data
x

sure that h is sensitive to only very

important variations as observed in
the training data. x̂
L(θ) - capture important variations
in data
h
Ω(θ) - do not capture variations in
data
Tradeoff - capture only very import- x
ant variations in the data

50/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us try to understand this with the help of an illustration.

51/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
y

u1
u2

52/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
y
Consider the variations in the data
along directions u1 and u2
u1
u2

52/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
y
Consider the variations in the data
along directions u1 and u2
u1 It makes sense to maximize a neuron
to be sensitive to variations along u1
u2 At the same time it makes sense to
inhibit a neuron from being sensitive
to variations along u2 (as there seems
to be small noise and unimportant for
x reconstruction)

52/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.7 : Summary

53/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂ y PCA

h ≡ u1 u2

x
x P T X T XP =D

54/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂ y PCA

h ≡ u1 u2

x
x ∗ 2
P T X T XP =D | {z } kF
min kX − HW
θ
U ΣV T
(SVD)

54/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i

x̃i
xij |xij )
P (e
xi

55/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i
Regularization

x̃i
xij |xij )
P (e
xi

55/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i
Regularization

h Ω(θ) = λkθk2 Weight decaying

x̃i
xij |xij )
P (e
xi

55/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i
Regularization

h Ω(θ) = λkθk2 Weight decaying

k
X ρ 1−ρ
Ω(θ) = ρ log + (1 − ρ) log Sparse
ρ̂l 1 − ρ̂l
l=1
x̃i
xij |xij )
P (e
xi

55/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i
Regularization

h Ω(θ) = λkθk2 Weight decaying

k
X ρ 1−ρ
Ω(θ) = ρ log + (1 − ρ) log Sparse
ρ̂l 1 − ρ̂l
l=1
x̃i n X
k
X ∂hl 2
Ω(θ) = Contractive
xij |xij )
P (e j=1 l=1
∂xj

55/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
58% (78)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (78)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Workbook For The Body Keeps The Score
88% (52)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Phone Codes
78% (27)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
Sample Mental Health Progress Note
96% (47)
Sample Mental Health Progress Note
3 pages
2025 MandateForLeadership FULL
70% (10)
2025 MandateForLeadership FULL
920 pages
How To Kiss A Woman's Breast
60% (114)
How To Kiss A Woman's Breast
14 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (7)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
1001 Songs
70% (71)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
IFAT India 2023 Exhibitor List 03102023
No ratings yet
IFAT India 2023 Exhibitor List 03102023
8 pages
Ralph Acampora - Using Technical Analysis To Improve Portfolio Performance
100% (1)
Ralph Acampora - Using Technical Analysis To Improve Portfolio Performance
59 pages
CS7015 (Deep Learning) : Lecture 7
No ratings yet
CS7015 (Deep Learning) : Lecture 7
55 pages
Unit 5
No ratings yet
Unit 5
36 pages
Lecture 3 - The Perceptron
No ratings yet
Lecture 3 - The Perceptron
4 pages
Lecture -14_FFNN
No ratings yet
Lecture -14_FFNN
59 pages
Lecture8 PDF
No ratings yet
Lecture8 PDF
434 pages
Lecture 5
No ratings yet
Lecture 5
1,073 pages
Unit 4
No ratings yet
Unit 4
61 pages
Lecture 05
No ratings yet
Lecture 05
49 pages
Cs7015 (Deep Learning) : Lecture 11: Convolutional Neural Networks, Lenet, Alexnet, Zf-Net, Vggnet, Googlenet and Resnet
No ratings yet
Cs7015 (Deep Learning) : Lecture 11: Convolutional Neural Networks, Lenet, Alexnet, Zf-Net, Vggnet, Googlenet and Resnet
252 pages
6 05 Undercomplete Vs Overcomplete Hidden Layer
No ratings yet
6 05 Undercomplete Vs Overcomplete Hidden Layer
4 pages
Error Gradient
No ratings yet
Error Gradient
35 pages
p5
No ratings yet
p5
4 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
Integration
No ratings yet
Integration
31 pages
05 Draft
No ratings yet
05 Draft
68 pages
Deep Learning Networks For Off-Line Handwritten Signature Recognition
No ratings yet
Deep Learning Networks For Off-Line Handwritten Signature Recognition
10 pages
CS7015 (Deep Learning) : Lecture 12: Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only Look Once (YOLO)
No ratings yet
CS7015 (Deep Learning) : Lecture 12: Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only Look Once (YOLO)
47 pages
Chisel Cheatsheet
No ratings yet
Chisel Cheatsheet
2 pages
Inner Product Spaces
100% (4)
Inner Product Spaces
14 pages
5 Backward Propagation
No ratings yet
5 Backward Propagation
81 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
60 pages
Lecture 22
No ratings yet
Lecture 22
24 pages
(IJCST-V6I4P17) :P T V Lakshmi
No ratings yet
(IJCST-V6I4P17) :P T V Lakshmi
4 pages
DP Optimizations - CHT
No ratings yet
DP Optimizations - CHT
26 pages
Weaving Properties of Hilbert Space Frames
No ratings yet
Weaving Properties of Hilbert Space Frames
5 pages
CPP Unit-V
No ratings yet
CPP Unit-V
26 pages
00 - Perceptron - Scientific Machine Learning (SciML)
No ratings yet
00 - Perceptron - Scientific Machine Learning (SciML)
42 pages
Lec 5
No ratings yet
Lec 5
23 pages
Graphics Slides 04
No ratings yet
Graphics Slides 04
7 pages
VIP Cheatsheet: Unsupervised Learning: Afshine Amidi and Shervine Amidi September 9, 2018
No ratings yet
VIP Cheatsheet: Unsupervised Learning: Afshine Amidi and Shervine Amidi September 9, 2018
3 pages
Auto-Encoder Based Data Clustering: Abstract. Linear or Non-Linear Data Transformations Are Widely Used
No ratings yet
Auto-Encoder Based Data Clustering: Abstract. Linear or Non-Linear Data Transformations Are Widely Used
8 pages
Mathematics New 9th Grand Test 2025
No ratings yet
Mathematics New 9th Grand Test 2025
2 pages
13000324118_Rishiraj Debnath(BSM-202)
No ratings yet
13000324118_Rishiraj Debnath(BSM-202)
9 pages
micropaper
No ratings yet
micropaper
16 pages
Lecture 2
No ratings yet
Lecture 2
12 pages
Variational Autoencoders
No ratings yet
Variational Autoencoders
94 pages
Session 2 Introduction to Deep Learning
No ratings yet
Session 2 Introduction to Deep Learning
24 pages
02u Handout
No ratings yet
02u Handout
37 pages
Lecture 1 Mechanism Design 2
No ratings yet
Lecture 1 Mechanism Design 2
11 pages
Hota-ML28
No ratings yet
Hota-ML28
10 pages
Auto Encoder S
No ratings yet
Auto Encoder S
32 pages
Time-Domain Analysis
No ratings yet
Time-Domain Analysis
38 pages
HSC Commerce 2015 October Maths1 PDF
No ratings yet
HSC Commerce 2015 October Maths1 PDF
2 pages
BAM Session2Slides
No ratings yet
BAM Session2Slides
25 pages
Tutorial4 SVM
No ratings yet
Tutorial4 SVM
8 pages
Pelm
No ratings yet
Pelm
1 page
03 Convolutional Neural Networks
No ratings yet
03 Convolutional Neural Networks
83 pages
APA Chapter3 T20
No ratings yet
APA Chapter3 T20
24 pages
Bayes Ball
No ratings yet
Bayes Ball
5 pages
Support Vector Machines - An Introduction: Department of Electrical Engineering Technion, Israel
100% (1)
Support Vector Machines - An Introduction: Department of Electrical Engineering Technion, Israel
44 pages
Chemical Engineering
No ratings yet
Chemical Engineering
6 pages
Introduction To Support Vector Machines: Hsuan-Tien Lin
No ratings yet
Introduction To Support Vector Machines: Hsuan-Tien Lin
20 pages
03 BCT Bitcoin Cryptographic Concepts
No ratings yet
03 BCT Bitcoin Cryptographic Concepts
51 pages
TheLearningTheory 2
No ratings yet
TheLearningTheory 2
90 pages
Support Vector Machines: Logisic Regression
No ratings yet
Support Vector Machines: Logisic Regression
10 pages
Lecture 8
No ratings yet
Lecture 8
19 pages
Lecture05-DeepLearningCNN
No ratings yet
Lecture05-DeepLearningCNN
84 pages
Introduction to Deep Learning
No ratings yet
Introduction to Deep Learning
24 pages
Digital Start Up
No ratings yet
Digital Start Up
11 pages
Attribution Guide
No ratings yet
Attribution Guide
6 pages
Derivatives Pricing Using QuantLib - An Introduction
No ratings yet
Derivatives Pricing Using QuantLib - An Introduction
25 pages
1498040377845
No ratings yet
1498040377845
3 pages
The Indian Fiscal-Monetary Framework: Dominance or Coordination?
No ratings yet
The Indian Fiscal-Monetary Framework: Dominance or Coordination?
19 pages
Uob Q1 2019
No ratings yet
Uob Q1 2019
56 pages
SSRN Id1821643 PDF
No ratings yet
SSRN Id1821643 PDF
36 pages
Book Review: Marcos Lopez de Prado: Advances in Financial Machine Learning, Wiley, 2018
No ratings yet
Book Review: Marcos Lopez de Prado: Advances in Financial Machine Learning, Wiley, 2018
3 pages
Week6 Assignment Solutions
No ratings yet
Week6 Assignment Solutions
14 pages
Futureof Artificial Intelligence
No ratings yet
Futureof Artificial Intelligence
5 pages
Review of 2019 S&P GSCI Index Rebalancing
No ratings yet
Review of 2019 S&P GSCI Index Rebalancing
20 pages
Business Environment: Stewart & Mackertich I The Steel Report
No ratings yet
Business Environment: Stewart & Mackertich I The Steel Report
7 pages
HSL PCG Commodity Radar - SILVER 07082019-201908080841315548168
No ratings yet
HSL PCG Commodity Radar - SILVER 07082019-201908080841315548168
5 pages
Introduction To ARMA Models: T Iid 2
No ratings yet
Introduction To ARMA Models: T Iid 2
15 pages
Rational, Are We
No ratings yet
Rational, Are We
4 pages
Laboratory Report Static Electricity
No ratings yet
Laboratory Report Static Electricity
3 pages
Q4LCTG10
No ratings yet
Q4LCTG10
31 pages
Rebuilding Rails - Free Chapters
No ratings yet
Rebuilding Rails - Free Chapters
45 pages
Determiners
No ratings yet
Determiners
14 pages
DesignBuilder Simulation Training Slides
No ratings yet
DesignBuilder Simulation Training Slides
27 pages
Labmidpaper
No ratings yet
Labmidpaper
10 pages
Casadio Et Al-2012-Journal of Raman Spectros
No ratings yet
Casadio Et Al-2012-Journal of Raman Spectros
11 pages
MC3818-translate_20240523155256
No ratings yet
MC3818-translate_20240523155256
11 pages
Intro To Osmotic Power Osmotic Power Sistem Basic
No ratings yet
Intro To Osmotic Power Osmotic Power Sistem Basic
4 pages
Firestone Air Gripper Catalog
No ratings yet
Firestone Air Gripper Catalog
42 pages
A Review On Feature Selection in Mobile Malware Detection
No ratings yet
A Review On Feature Selection in Mobile Malware Detection
16 pages
Oxygenplant Vaporizer Explosion
No ratings yet
Oxygenplant Vaporizer Explosion
4 pages
Small Clauses in English - The Nonverbal Types
No ratings yet
Small Clauses in English - The Nonverbal Types
241 pages
11 19 20 Elective 3 QUIZ
No ratings yet
11 19 20 Elective 3 QUIZ
3 pages
Next Generation Financial Protocol: Velo Team
No ratings yet
Next Generation Financial Protocol: Velo Team
27 pages