0% found this document useful (0 votes)

16 views159 pages

Lecture 04 (6hrs) Neural Network and Deep Learning

The document is a lecture on neural networks and deep learning. It discusses gradient descent algorithms, backpropagation algorithms for feedforward neural networks, convolutional neural networks, and deep learning. Specifically, it defines gradient and directional derivative, describes gradient descent algorithms and their geometric meaning, and compares gradient descent to Newton's method. It provides examples and illustrations of gradient and gradient descent.

Uploaded by

Engr. Md. Borhan Uddin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views159 pages

Lecture 04 (6hrs) Neural Network and Deep Learning

Uploaded by

Engr. Md. Borhan Uddin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 159

Neural Network and Deep

Learning

Xizhao WANG
Big Data Institute
College of Computer Science
Shenzhen University

March 2021
Gradient Descent Algorithm
BP Algorithm for Feed-Forward Neural Network Model
Convolutional Neural Network
Deep Learning

Outline

1. Gradient Descent Algorithm

2. BP Algorithm for Feed-Forward Neural
Network Model
3. Convolutional Neural Network
4. Deep Learning

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
Gradient Descent Algorithm 1. Definition of Gradient
BP Algorithm for Feed-Forward Neural Network Model 2. Gradient Descent Algorithm (GDA)
Convolutional Neural Network 3. Difference between GDA and Newton's Method
Deep Learning 4. An example

Lecture 01
Gradient Descent Algorithm
Gradient Descent Algorithm 1. Definition of Gradient
BP Algorithm for Feed-Forward Neural Network Model 2. Gradient Descent Algorithm (GDA)
Convolutional Neural Network 3. Difference between GDA and Newton's Method
Deep Learning 4. An example

Gradient Descent Algorithm

Definition:

Directional derivative (taking the triadic function as an example):

Suppose function f is defined in a neighborhood of point P0 (x0, y0,
z0),l is a ray from point P0, P (x, y, z) is a point on l and is
contained in the neighborhood of P0, ρ represents the distance
between P and P0.

If lim((( f ( P ))  ( f ( P0 ))) /  )  lim(f /  )

exists when ρ→0, we call this limit the directional derivative of f

at P0 along the direction of l.

Generally speaking, directional derivative is the rate of change

of a function in a specified direction.

Gradient Descent Algorithm

The geometric meaning of directional derivatives:

Suppose z=f (x, y) is a surface equation; M (x, y, z) is a point on the curved

surface. An intersecting line is formed by the curved surface and the vertical
plane which goes through M along the direction of l. θ is the angle between l
and the tangent of the intersecting line at M.

f
Then  tan  .
l

Gradient Descent Algorithm

Computation for directional derivative:

If function f is differentiable at P0 (x0, y0, z0), then the directional

derivative of f at P0 along any direction l exists, and the expression is:

f f f f
 cos   cos   cos  ,
l x y z

where cosα, cosβ, cosγ are direction cosine of l.

Gradient Descent Algorithm

The gradient of a scalar function f (x1, x2, ∙∙∙, xn) is denoted as
 f 
 x 
 1 
 f  T
   f f f 
f ( X )   x 2   , , ,  .
   x1 x 2 x n  f
 
  f
 x


 n 

In the three-dimensional Cartesian coordinate system with a Euclidean

metric, the gradient, if it exists, is given by:

f f f
f  i j k
x y z

where i, j, k are the standard unit vectors in the directions of the

coordinates, respectively. For example, the gradient of the function
f ( x, y, z )  2 x  3 y 2  sin( z ) is f  2i  6 yj  cos( z )k.

Gradient Descent Algorithm

Geometric Meaning
The gradient specifies the direction that produces the steepest increase in the
function. The negative of gradient therefore gives the direction of steepest
decrease.

In the above two images, the values of the function are represented in black and white, black re
presenting higher values, and its corresponding gradient is represented by blue arrows.

Gradient Descent Algorithm

Geometric Meaning
2
 2
The gradient of the function f ( x, y )   cos x  cos y 
2

is depicted as a projected vector field on the bottom plane.

Gradient Descent Algorithm

For the 2-dimensional case:
Gradient: Suppose z  f ( x, y ) has the first-order continuous partial derivative on
region D, and for there exists a vector  P ( x, y )
 f f   
 ,   f x ( x , y ) i  f y ( x , y ) j,
 x y 
then the gradient of z=f (x, y) at P(x, y) is marked as grad f (x, y) or i.e., f ( x, y ),

f  f 
grad f ( x, y )  f ( x, y )  i j.
x y

Along with gradient direction,

the function changes most
quickly

Gradient Descent Algorithm

Suppose e = [cosα, cosβ] is a unit vector in l direction, then

f f f  f f 
 cos   cos    , cos  , cos  
l x y  x y 
 grad f ( x, y )  e
 grad f ( x, y )  e  cos gradf ( x, y ), e

cos gradf ( x, y ), e  1,
f
then the directional derivative attains its maximum value, which equals to the
l
norm of gradient, i.e.

2 2
 f   f 
grad f ( x, y )    
 y 
 .
 x   
Then when variables change along the gradient direction, the rate of change of a
function attains its maximum value, which is the norm of the gradient.
Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
Gradient Descent Algorithm 1. Definition of Gradient
BP Algorithm for Feed-Forward Neural Network Model 2. Gradient Descent Algorithm (GDA)
Convolutional Neural Network 3. Difference between GDA and Newton's Method
Deep Learning 4. An example

Gradient Descent Algorithm

When gradient is generalized to n dimensional space, it can be represented as:

 f 
 x 
 1
 f  T
   f f f 
f ( X )   x2    , ,,  .
   x1 x2 xn 
 
 f 
 x 
 n

Along with gradient direction, the function changes most quickly.

Gradient Descent Algorithm


initial point

the minimum value

The gradient descent algorithm may lead to local optimal solution; the
global optimal one can be ensured when the loss function is convex.

Gradient Descent Algorithm

Note on the Gradient Descent Algorithm parameters

1. The magnitude of gradient, epxilong, is one of termination conditions


2. Another termination condition is the iteration numbers (time control)

3. The learning rate, alpha, is to control the “walking-step”, too small will
lead to slow convergence (low efficiency), but too big will result in
vibrating (non-convergence). Its appropriate value is dependent on the
specific function to be minimized.

Gradient Descent Algorithm

1. Definition of gradient
2. Gradient descent algorithm (GDA)
3. Difference between GDA and Newton’ s method
4. An example

Gradient Descent Algorithm

Suppose the objective function f(x) has the second order continuous partial
derivative; xk is an approximation of its minimum point. The second order
Taylor polynomial approximation of f(x) near xk is shown as follows:


Its gradient is

The minimum point of the approximate function satisfies

then
where H(xk) is the Hessian matrix of f(x) at point xk.

In the minimizing process of f(x), is considered as the

searching direction.

Gradient Descent Algorithm

The minimizing process of Gauss-Newton method can be represented as:

Gradient Descent Algorithm

In optimization, Newton's method is applied to the derivative f ′ of a twice-di

fferentiable function f to find the roots of the derivative (solutions to
f ′(x)=0), also known as the stationary points of f.

In the one-dimensional problem, Newton's method to find the roots attempt

s to construct a sequence xn from an initial guess x0 that converges towards
 
x , t , x
some value x* satisfying f ′(x*)=0. This x* is a stationary point of f. t
wi order Taylor expansion fT(x) of f around xn is:
The second

wi
1 ''
fT  x   fT  xn  x   f  xn   f '
 xn  x  f  xn  x 2 .
2

Gradient Descent Algorithm

We want to find Δx such that xn + Δx is a stationary point. We seek to solve the e
quation that sets the derivative of this last expression with respect to Δx equal t
o zero:
d  1 '' 
  n  n f  xn  x 2   f '  xn   f ''  xn  x.
'
0 f x  f x x 
d x  2 
 
For the value of Δx = −f ′(xn) / f ″(xn), which
x , t is, the solution
x of this equation, it can t
be hoped that xn+1 = xn + Δx = xn − f ′(xn) / f ″(xn) will be closer to a stationary point
x*. Provided that f is a twice-differentiable function and other technical condition
s are satisfied, the sequence x1, x2, ∙∙∙ will converge to a point x* satisfying f ′(x*)
= 0.

The above iterative scheme can be generalized to several dimensions by replac

ing the derivative with the gradient, ∇f (x), and the reciprocal of the second deri
vative with the inverse of the Hessian matrix, H f (x). One obtains the iterative s
cheme

1
x n 1  x n   H f  xn  f  xn  , n  0.

Gradient Descent Algorithm

Comparison of GDA and Netwon's Method

A comparison of gradient descent

 (green) and Newton's method (red
x, t , x t
) for minimizing a function (with s
mall step sizes).

wi Newton's method uses curvature

information to take a more direct
route.

Gradient Descent Algorithm

1. Definition of gradient
2. Gradient descent algorithm (GDA)
3. Difference between GDA and Newton’s method
4. An example

Gradient Descent Algorithm

Minimize: f ( x)  x 2 .
Step 1: computing the gradient,   2x.
Step 2: moving x along the negative direction of the gradient, i.e.,
x  x    , where γ is the learning rate.
Step 3: Looping Step 2, untill the difference of f (x) between two
adjacent iteration is small enough, which indicates f (x) attains its local
minimum value.
Step 4: outputting x, which is the optimal solution.

Gradient Descent Algorithm

Example

Minimize f (x)=x2 by using Gradient Descent Algorithm

 
x, t , x The initial value of x is 2,
and the step length is 0.1.

After iteration 49 times, t

wi he minimum value 1.273
147e-09 of the function is
obtained, and the corresp
onding x value is 3.56811
9e-05.

Gradient Descent Algorithm

Example

Minimize f (x)=x2 by using Netwon's Method


x, t , x0
The initial value of x is 2.

After iteration 15 times, t

he minimum value 3.7253
wi e-09 of the function is
obtained, and the corresp
onding x value is 6.1035e
x1
-05.
x2

Gradient Descent Algorithm

Gradient Descent
Algorithm

The End.

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An application
6. Questions

Outline

1. Gradient Descent Algorithm

2. BP Algorithm for Feed-Forward Neural
Network Model
3. Convolutional Neural Network
4. Deep Learning

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

• Rumelhart, McClelland proposed BP(Back Propagation)
algorithm for feed-forward neural network

David
Rumelhart

• BP algorithm – key idea J. McClelland

– Using the error of output layer to estimate the error of its
previous layer, generally using the error of layer n to
estimate the error of layer n-1

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

A intuitive understanding to a feed-forward neural network

A feed-forward NN is a smooth function which can used

to approximate a system of input-output (Black Box)

What is the specific form of the function

in box?
Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

A intuitive understanding to a feed-forward neural network

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

• A Perceptron

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

• A Perceptron

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

• A Perceptron can be used to represent many Boolean
functions, like the following:

A Perceptron cannot be used to represent:

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

• Sigmoid threshold unit

The Sigmoid unit computes its output o as where

It is easy to check that

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

• Sigmoid function picture

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

  
x  AT A
1
AT b,

Iteration method, approaches the optimal solution gradually through each updating
step.

Gradient descent, which belongs to iteration methods, is available for least squares
problems.

Gauss-Newton method is an commonly used iteration approach to solving

nonlinear least squares problems.

Levenberg-Marquardt is another iteration method to solve nonlinear least squares

problems.
Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

It is a function of and its minimum
exists. BP algorithm is to use the
gradient descent technique to find
the minimum by gradually updating
the weights

It is easy to know that

The remaining task to derive a convenient expression

for
Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

In summary:

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

1. Brief introduction
2. Feedforward NN
3. BP algorithm
4. Notes on BP
5. An application
6. Questions

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

• Learning process ：
– Stimulated by input samples, the connection weights
update gradually, such that network outputs approach
expected outputs step by step.

• Learning essence ：
– Dynamically update connection weights

• Learning rule ：
– It is the rule of how updating the connection weights
(What rule is followed)

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

• Learning type ： Supervised
• Key idea ：
– The output error (in a suitable form) is back-propagated
to input layer via hidden layer(s)

Assigning the error to Updating

all units (nodes) in weight for
layers each node
• Features ：
– Signal forward-propagated
– Error back-propagated

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

• Forward propagation ：
– Input sample － input layer － every hidden layer
－ output layer
• Judge whether go to back-propagation ：
– If the difference between actual and expected outputs
(in output layer) is bigger than a threshold
• Back-propagation
– Representing errors of each layer and updating weight
for each node
• Stop if output error is under a predefined threshold or the
number of iterations attains the predefined maximum.
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

Related concepts of gradient descent

1. Learning rate: in the process of gradient descent, the function decreases

along the negative direction of the gradient. Learning rate determines the
descent degree for each iteration step.

2. Feature: the inputs of the algorithm, which are used to describe the
samples.

3. Hypothesis function: in supervised learning, it aims to fit leaning samples.

4. Loss function: it can measure the effectiveness of hypothesis function,

generally, which is computed as the square of the difference between the
outputs and the prediction fitting values.

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

Standard Gradient Descent As described in the Gradient Descent Algorithm,

the calculation of gradient is based on all the 
 training samples.
x, t , x t
Stochasticw Gradient Descent Whereas the gradient descent training rule
i
presented in the Gradient Descent Algorithm computes weight updates after summing
over all the training examples, the idea behind stochastic gradient descent is to
approximate this gradient
wi descent search by updating weights incrementally,
x , t
following the calculation of thefunction for each individual example.
x
Batch GradientwiDescent
, , where the gradient is based on a batch of the training
samples.

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

Remarks

The key differences between standard gradient descent and stochastic gradient
descent are:

• In standard gradient descent, the error is summed

 over allexamples before updating weights,
x, t , x
whereas in stochastic gradient descent weights are updated upon examining each training
t
example.
wi

• Summing over multiple wexamples

i in standard gradient descent requires more computation per

weight update step.
x , t On the other hand, because it uses the true gradient, standard gradient
 size per weight update than stochastic gradient descent.
descent is often used with a larger step
x
wi ,
• In cases where there are multiple local minima with respect to the objective function, stochastic
gradient descent can sometimes avoid falling into these local minima because it uses the various
V E d ( G ) rather than V E ( 6 ) to guide its search.

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

1. Brief introduction
2. Feedforward NN
3. BP algorithm
4. Notes on BP
5. An application
6. Questions

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

• A 3-layer feed-forward neural network: Neural
network learning to steer an autonomous vehicle

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

1. Brief introduction
2. Feedforward NN
3. BP algorithm
4. Notes on BP
5. An application
6. Questions

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

Questions:

1.If the features are not numerical but symbolic, I mean,

the input-output system has input of symbols and
output of real number, do you think how to use BP to
train the approximator?
2.In comparison with real case, how about its
performance?
3.In your own opinion, how to empirically select the
step in Gradient Descent Algorithm?

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Brief Introduction
Gradient Descent Algorithm 2. Feedforward NN
BP Algorithm for Feed-Forward Neural Network Model 3. BP Algorithm
Convolutional Neural Network 4. Notes on BP
Deep Learning 5. An Application
6. Questions

BP Algorithm for Feed-Forward Neural Network Model

Feedforward NN and
BP Algorithm

The End.

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
Gradient Descent Algorithm 1. Convolution definition
BP Algorithm for Feed-Forward Neural Network Model 2. Convolutional Layer
Convolutional Neural Network 3. Pooling Layer
4. Fully Connected Layer
Deep Learning 5. Example

Outline

1. Gradient Descent Algorithm

2. BP Algorithm for Feed-Forward Neural
Network Model
3. Convolutional Neural Network
4. Deep Learning

Convolutional Neural Network

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
Gradient Descent Algorithm 1. Convolution Definition
BP Algorithm for Feed-Forward Neural Network Model 2. Convolutional Layer
Convolutional Neural Network 3. Pooling Layer
4. Fully Connected Layer
Deep Learning 5. Example

Convolutional Neural Network

1. Convolution definition
2. Convolution layer
3. Pooling layer
4. Fully connected layer
5. Example

Convolutional Neural Network

卷积与傅里叶变换有着密切的关系。利用一点性质，即两函数的傅里叶
变换的乘积等于它们卷积后的傅里叶变换，能使傅里叶分析中许多问题
的处理得到简化。

由卷积得到的函数 f*g 一般要比 f 和 g 都光滑。特别当 g 为具有紧致

集的光滑函数， f 为局部可积时，它们的卷积 f * g 也是光滑函数。利
用这一性质，对于任意的可积函数 f ，都可以简单地构造出一列逼近于
f 的光滑函数列 fs ，这种方法称为函数的光滑化或正则化。

卷积的概念还可以推广到数列、测度以及广义函数上去。

Convolutional Neural Network

卷积是两个变量在某范围内相乘后求和的结果。如果卷积的变量是
序列 x(n) 和 h(n) ，则卷积的结果 :

其中星号 * 表示卷积。当时序 n=0 时，序列 h(-i) 是 h(i) 的时序

i 取反的结果；时序取反使得 h(i) 以纵轴为中心翻转 180 度，所以这种
相乘后求和的计算法称为卷积和，简称卷积。另外， n 是使 h(-i) 位移的
量，不同的 n 对应不同的卷积结果。

如果卷积的变量是函数 x(t) 和 h(t) ，则卷积的计算变为 :

其中 p 是积分变量，积分也是求和， t 是使函数 h(-p) 位移的量，

星号 * 表示卷积。
Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
Gradient Descent Algorithm 1. Convolution Definition
BP Algorithm for Feed-Forward Neural Network Model 2. Convolutional Layer
Convolutional Neural Network 3. Pooling Layer
4. Fully Connected Layer
Deep Learning 5. Example

Convolutional Neural Network

1. Convolution definition
2. Convolution layer
3. Pooling layer
4. Fully connected layer
5. Example

Convolutional Neural Network

The connection of convolutional layer.

The connection of pooling layer.

23/9/20
Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
Gradient Descent Algorithm 1. Convolution Definition
BP Algorithm for Feed-Forward Neural Network Model 2. Convolutional Layer
Convolutional Neural Network 3. Pooling Layer
4. Fully Connected Layer
Deep Learning 5. Example

Convolutional Neural Network

Convolution in Neural Network
The filters play as feature detectors.
Edge detection
1 1 1
1 0 -1
0 0 0
1 0 -1

1 0 -1 -1 -1 -1

Vertical edge detection Horizontal edge detection

The value of weights can be other number, What we need to do is to train the
weights and bias.

Different kind of filters mean extracting different feathers.

Convolutional Neural Network

Convolution in Neural Network
An example of edge detection

The picture is from Andrew Ng.

The filters can become more intricate as they start incorporating information from a
n increasingly larger spatial extent.
23/9/20
Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
Gradient Descent Algorithm 1. Convolution Definition
BP Algorithm for Feed-Forward Neural Network Model 2. Convolutional Layer
Convolutional Neural Network 3. Pooling Layer
4. Fully Connected Layer
Deep Learning 5. Example

10 10 10 0 0 0
0 30 30 0
10 10 10 0 0 0

10 10 10 1 0 -1

1 0 -1
10 10 10
= 0
10 10 10 1 0 -1

Then slide the local receptive field across the entire input image.
23/9/20

10 10 10 0 0 0
0 30 30 0
10 10 10 0 0 0

The higher value of output means

matching the feature better.

Convolutional Neural Network

Convolutions on RGB image
Filter W0
Feature 0

* =

* =
RGB channels depth

Feature1
Filter
W1
Why convolutions ？
Parameter sharing

Convolutional Neural Network

1. Convolution definition
2. Convolution layer
3. Pooling layer
4. Fully connected layer
5. Example

Convolutional Neural Network

Pooling layers - Shrinking the image stack

Pooling:
1.Pick a window size(usually 2 or 3)
2.Pick a stride(usually 2)
3.Walk your window across your filtered images.
4.From each window
23/9/20, take the maximum value.

Convolutional Neural Network

Pooling layers ---Shrinking the image stack
• 3.2 Average pooling
Calculate the average
1 3 2 1 value of each window

2 9 1 1 3.75 1.25

2 3 2 3 4 2

5 6 1 2

• Remove the redundancy information of convolutional layer .

By having less spatial information you gain computation performance

Less spatial information also means less parameters, so less chance to over-fit

You get some translation invariance.

23/9/20

Convolutional Neural Network

1. Convolution definition
2. Convolution layer
3. Pooling layer
4. Fully connected layer
5. Example

Convolutional Neural Network

Full connection layer
The CNNs help extract certain features from the image, then fully connected
layer is able to generalize from these features to the output-space.

Convolutional Neural Network

All the layers are put together, Then the CNN looks like…

Convolutional Neural Network

1. Convolution definition
2. Convolution layer
3. Pooling layer
4. Fully connected layer
5. Example

Convolutional Neural Network

For example
Say whether a picture Is of an X or O.
A two-dimensional array of pixels

CNN
CNN XX or O

CNN O

Convolutional Neural Network

What the computer see

-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Convolutional Neural Network

Features match pieces of the image(Feature detectors)
1 -1 -1 1 -1 1 -1 -1 1
-1 1 -1 -1 1 -1 -1 1 -1
-1 -1 1 1 -1 1 1 -1 -1

Convolutional Neural Network

Filtering : The math behind the match

Convolutional Neural Network

Convolution layer
---One image becomes a stack of filtered images

stack

extracted
by three
filters

depth

Convolutional Neural Network

5. ReLU layer
A stack of images becomes a stack of images with no negative values.

Convolutional Neural Network

Pooling layer ---A stack of images becomes a stack of smaller images

Max pooling

Convolutional Neural Network

Layers get stacked
The output of one becomes the input of the next.

Layers can be repeated several(or many) times. 9 5

5 9
9 5
5 5
5 9
9 5

Convolutional Neural Network

Fully connected layer
Every value gets a vote---Vote depends on how strongly a value
predicts X or O.

9
5
5
9
9
5
5
5
5
9
9
5

Convolutional Neural Network

Summary ： Putting it all together
A set of pixels becomes a set of
votes.

Classifier

Convolutional Neural Network

23/9/20
Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
Convolutional Layer Another story based on filter


Filter 1
3x3x
channel
tensor

Convolution
Filter 2
3x3x
channel
tensor
……

……

6 x 6 image ……
(The values in the filters
are unknown parameters.)
100
Convolutional Layer 1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1

6 x 6 image 3 -2 -2 -1

101
Convolutional Layer -1 1 -1
-1 1 -1 Filter 2
-1 1 -1
stride=1 Do the same process for
1 0 0 0 0 1 every filter
0 1 0 0 1 0 3 -1 -3 -1
-1 -1 -1 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1
0 1 0 0 1 0 Feature
0 0 1 0 1 0 -3 -3 Map0 1
-1 -1 -2 1
6 x 6 image 3 -2 -2 -1
-1 0 -4 3

102
Convolutional 3
-1
-1
-1
-3
-1
-1
-1
Layer
-3 1 0 -3
-1 -1 -2 1

-3 -3 0 1
-1 -1 -2 1
3 -2 -2 -1
-1 0 -4 3
64
Convolution
filters “Image” with 64 channels

Convolution
……
Multiple
3 -1 -3 -1
Convolutional -1 -1 -1 -1
Layers -3 1 0 -3
-1 -1 -2 1

-3 -3 0 1
-1 -1 -2 1
3 -2 -2 -1
-1 0 -4 3
64
Convolution
filters “Image” with 64 channels

Convolution
Filter:
3 x 3 x 64
……

64 104
1 0 0 0 0 1
Multiple
0 1 0 0 1 0
Convolutional
0 0 1 1 0 0
Layers
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

64 3 -1 -3 -1
Convolution -1 -1 -1 -1
filters
-3 1 0 -3
-1 -1 -2 1
Convolution
-3 -3 0 1
-1 -1 -2 1
3 -2 -2 -1
……

-1 0 -4 3 105
Comparison of Two Stories

1 -1 -1 Filter
.
…..

-1 1 -1 3 x 3 x
channel
-1 -1 1 tensor

Receptiv
e field (ignore bias in this slide)

106
The neurons with different

.
…..
receptive fields share the
parameters.
bias
1 0 0 0 0 1 1
11 00 00 00 00 11

…
0 1 0 0 1 0
00 11 00 00 11 00
0 0 1 1 0 0
00 00 11 11 00 00
1 0 0 0 1 0
11 00 00 00 11 00

.
…..
0 1 0 0 1 0
00 11 00 00 11 00
0 0 1 0 1 0
00 00 11 00 11 00
bias
Each filter convolves 1
…

over the input image. 107

Convolutional Layer

Neuron Version Filter Version

Story Story
Each neuron only There are a set of
considers a receptive filters detecting small
field. patterns.
The neurons with
Each filter convolves
different receptive fields
over the input image.
share the parameters.

They are the same story.

108
Observation 3

• Subsampling the pixels will not change the object

bird
bird

subsampling

109
Pooling – Max Pooling

1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1

3 -1 -3 -1 -1 -1 -1 -1

-3 1 0 -3 -1 -1 -2 1

-3 -3 0 1 -1 -1 -2 1

3 -2 -2 -1 -1 0 -4 3
110
Convolutional
3 -1 -3 -1
Layers -1 -1 -1 -1
+ Pooling -3 1 0 -3
-1 -1 -2 1

-3 -3 0 1
-1 -1 -2 1
3 -2 -2 -1
-1 0 -4 3
Convolution
Repeat

“Image” with 64 channels

Pooling 3 0
-1 1

3 1
0 3
……

111
The whole CNN

cat dog ……
Convolution
softmax

Pooling

Fully Connected
Layers Convolution

Pooling

Flatten 112
Application: Playing Go

Next move
Network (19 x 19
positions)
19 x 19 classes
19 x 19 matrix
19(image)
x 19 vector
Black: 1 Fully-connected
48 network can be used
white: -1
channels in
Alpha Go none: 0 But CNN performs much better.
113
Why CNN for Go playing?

• Some patterns are much smaller than the whole image

• The same patterns appear in different regions.

Alpha Go uses 5 x 5 for first layer

114
Why CNN for Go playing?

• Subsampling the pixels will not change the object

Pooling How to explain this???

Alpha Go does not use Pooling …… 115

More Applications

Speech
https://dl.acm.org/doi/10.11
09/TASLP.2014.2339736

Natural Language
Processing
https://www.aclweb.org/ant
hology/S15-2079/

116
Gradient Descent Algorithm 1. Convolution Definition
BP Algorithm for Feed-Forward Neural Network Model 2. Convolutional Layer
Convolutional Neural Network 3. Pooling Layer
4. Fully Connected Layer
Deep Learning 5. Example

Convolutional Neural Network

Convolutional Neural
Networks

The End.

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Introduction
Gradient Descent Algorithm 2. What is Deep Learning
BP Algorithm for Feed-Forward Neural Network Model 3. Partial Connections
Convolutional Neural Network 4. Initial Weights
Deep Learning 5. Biological & Theoretical Justification
6. Looking Forward

Outline

1. Gradient Descent Algorithm

2. BP Algorithm for Feed-Forward Neural
Network Model
3. Convolutional Neural Network
4. Deep Learning

Deep Learning

1. Introduction
2. What is Deep Learning
3. Partial connections
4. Initial weights
5. Biological & Theoretical Justification
6. Looking Forward

Winter of Neural Network

+1
+1 +1

 Non-convex
 Need a lot of tricks to play with
 Hard to do theoretical analysis

What’s wrong with back-propagation

1. It requires labeled training data:
 Almost all data is unlabeled.

2. The learning time does not scale well:

 It is very slow in networks with multiple hidden layers.

3. It can get stuck in poor local optima:
 These are often quite good, but for deep nets they are
far from optimal.

The Paradigm of Deep Learning

Neural networks are coming back!

Race on ImageNet (Top 5 Hit Rate)

72%, 2010

74%, 2011

85%, 2012

Answer from Geoff Hinton, 2012.10

The Architecture

 Max-pooling layers follow first, second, and

fifth convolutional layers
 The number of neurons in each layer is given
by 253440, 186624, 64896, 64896,
43264,4096, 4096, 1000

Revolution on Speech Recognition, NLP…

Deep Learning in Industry

 First successful deep learning models for

speech recognition, by MSR in 2009
 Now deployed in MS products, e.g. Xbox

Deep Learning in Industry

“Google Brain” Project

 Led by Google fellow Jeff Dean

 Published two papers:
ICML2012, NIPS2012
 Company-wise large-scale deep learning
infrastructure
 Big success on images, speech, NLPs
Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Introduction
Gradient Descent Algorithm 2. What is Deep Learning
BP Algorithm for Feed-Forward Neural Network Model 3. Partial Connections
Convolutional Neural Network 4. Initial Weights
Deep Learning 5. Biological & Theoretical Justification
6. Looking Forward

Deep Learning

1. Introduction
2. What is Deep Learning
3. Partial connections
4. Initial weights
5. Biological & Theoretical Justification
6. Looking Forward

What is Deep Learning (DL)

The training of a feed forward neural network (which possesses some of the
following main features: ① ,②,③,④), i.e., the process of determining the connection
weights from data, is called Deep Learning.

23/9/20 Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Introduction
Gradient Descent Algorithm 2. What is Deep Learning
BP Algorithm for Feed-Forward Neural Network Model 3. Partial Connections
Convolutional Neural Network 4. Initial Weights
Deep Learning 5. Biological & Theoretical Justification
6. Looking Forward
5. Further thinking about DL

Fundamental structures
Multi-scale fusion (Inception)

Randomly connected

2006 2011 2014 2015 2016 2017 2019

Deep CNN (AlexNet) Recurrent NN(GRU) ResNet DenseNet

Basic training strategy

• The basic training strategy is BP (back-propagation) algorithm, an old
optimization technique based on gradient descent, with some specials:

Deep Learning

1. Introduction
2. What is Deep Learning
3. Partial connections
4. Initial weights
5. Biological & Theoretical Justification
6. Looking Forward

Partial Connections - Convolutional Layer

The connection of convolutional layer.

The connection of pooling layer.

Partial Connections - Pooling Layers

Pooling layers - Shrinking the image stack

Pooling:
1.Pick a window size(usually 2 or 3)
2.Pick a stride(usually 2)
3.Walk your window across your filtered images.
4.From each window
23/9/20, take the maximum value.

Skip Connections / Cyclic Connections

23/9/20

Partial Connections – Full Connection Layer

Full connection layer
The CNNs help extract certain features from the image, then fully connected
layer is able to generalize from these features to the output-space.

23/9/20
Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Introduction
Gradient Descent Algorithm 2. What is Deep Learning
BP Algorithm for Feed-Forward Neural Network Model 3. Partial Connections
Convolutional Neural Network 4. Initial Weights
Deep Learning 5. Biological & Theoretical Justification
6. Looking Forward

Deep Learning

1. Introduction
2. What is Deep Learning
3. Partial connections
4. Initial weights
5. Biological & Theoretical Justification
6. Looking Forward

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Introduction
Gradient Descent Algorithm 2. What is Deep Learning
BP Algorithm for Feed-Forward Neural Network Model 3. Partial Connections
Auto-encoder Neural Network Convolutional Neural Network
Deep Learning
4.
5.
Initial Weights
Biological & Theoretical Justification
6. Looking Forward

Initial Weights - Auto-Encoder Neural Network

+
1

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Introduction
Gradient Descent Algorithm 2. What is Deep Learning
BP Algorithm for Feed-Forward Neural Network Model 3. Partial Connections
Auto-encoder Neural Network Convolutional Neural Network
Deep Learning
4.
5.
Initial Weights
Biological & Theoretical Justification
6. Looking Forward

Initial Weights - Auto-Encoder Neural Network

Input Code Prediction

Encoder Decoder

Error

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Introduction
Gradient Descent Algorithm 2. What is Deep Learning
BP Algorithm for Feed-Forward Neural Network Model 3. Partial Connections
Sparse Auto-encoder
Auto-encoder Neural Network Convolutional Neural Network 4.
5.
Initial Weights
Biological & Theoretical Justification
Deep Learning
6. Looking Forward

Sparse Auto-Encoder
Input Code Prediction
Encoder Decoder

Error

Sparsity
Penalty

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Introduction
Gradient Descent Algorithm 2. What is Deep Learning
BP Algorithm for Feed-Forward Neural Network Model 3. Partial Connections
Sparse Auto-encoder
Stacked Auto-encoders
Auto-encoder Neural Network Convolutional Neural Network 4.
5.
Initial Weights
Biological & Theoretical Justification
Deep Learning
6. Looking Forward

Initial Weights – Stacked Auto-Encoders

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Introduction
Gradient Descent Algorithm 2. What is Deep Learning
BP Algorithm for Feed-Forward Neural Network Model 3. Partial Connections
Sparse Auto-encoder
Stacked Auto-encoders
Auto-encoder Neural Network Convolutional Neural Network 4.
5.
Initial Weights
Biological & Theoretical Justification
Deep Learning
6. Looking Forward

Stacked Auto-Encoders

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Introduction
Gradient Descent Algorithm 2. What is Deep Learning
BP Algorithm for Feed-Forward Neural Network Model 3. Partial Connections
Sparse Auto-encoder
Stacked Auto-encoders
Auto-encoder Neural Network Convolutional Neural Network 4.
5.
Initial Weights
Biological & Theoretical Justification
Deep Learning
6. Looking Forward

Initial Weights – Sparse Coding

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Introduction
Gradient Descent Algorithm 2. What is Deep Learning
BP Algorithm for Feed-Forward Neural Network Model 3. Partial Connections
Sparse Auto-encoder
Stacked Auto-encoders
Auto-encoder Neural Network Convolutional Neural Network 4.
5.
Initial Weights
Biological & Theoretical Justification
Deep Learning
6. Looking Forward

Initial Weights – Restricted Boltzmann Machine

Hidden variables
h

The energy of the joint configuration:

v
Visible variables

Probability of the joint configuration is given by the

Boltzmann distribution:

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Introduction
Gradient Descent Algorithm 2. What is Deep Learning
BP Algorithm for Feed-Forward Neural Network Model 3. Partial Connections
Sparse Auto-encoder
Stacked Auto-encoders
Auto-encoder Neural Network Convolutional Neural Network 4.
5.
Initial Weights
Biological & Theoretical Justification
Deep Learning
6. Looking Forward

Initial Weights – Restricted Boltzmann Machine

Hidden variables Restricted: No interaction between
hidden variables.
h

W Inferring the distribution over the

hidden variables is easy:

v
Visible variables

Similarly:

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Introduction
Gradient Descent Algorithm 2. What is Deep Learning
BP Algorithm for Feed-Forward Neural Network Model 3. Partial Connections
Sparse Auto-encoder
Stacked Auto-encoders
Auto-encoder Neural Network Convolutional Neural Network 4.
5.
Initial Weights
Biological & Theoretical Justification
Deep Learning
6. Looking Forward

Initial Weights – Model Parameter Learning

Hidden variables
h

W
Maximize (penalized) log-likelihood
objective:

v
Visible variables

Derivative of the log-likelihood:

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Introduction
Gradient Descent Algorithm 2. What is Deep Learning
BP Algorithm for Feed-Forward Neural Network Model 3. Partial Connections
Sparse Auto-encoder
Stacked Auto-encoders
Auto-encoder Neural Network Convolutional Neural Network 4.
5.
Initial Weights
Biological & Theoretical Justification
Deep Learning
6. Looking Forward

Initial Weights – Contrastive Divergence

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Introduction
Gradient Descent Algorithm 2. What is Deep Learning
BP Algorithm for Feed-Forward Neural Network Model 3. Partial Connections
Sparse Auto-encoder
Stacked Auto-encoders
Auto-encoder Neural Network Convolutional Neural Network 4.
5.
Initial Weights
Biological & Theoretical Justification
Deep Learning
6. Looking Forward

Initial Weights – RBM & Auto-Encoders

 also involve activation and reconstruction
 but have explicit f(x)
 not necessarily enforce sparsity
 but if put sparsity on a, often get improved results [e.g. sparse
RBM, Lee et al. NIPS08]

a a

encoding f(x) decoding g(x)

x X’

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Introduction
Gradient Descent Algorithm 2. What is Deep Learning
BP Algorithm for Feed-Forward Neural Network Model 3. Partial Connections
Sparse Auto-encoder
Stacked Auto-encoders
Auto-encoder Neural Network Convolutional Neural Network 4.
5.
Initial Weights
Biological & Theoretical Justification
Deep Learning
6. Looking Forward

Initial Weights – DBNs for MNIST Classification

After layer-by-layer Unsupervised Pre-training, discriminative fine-

tuning by back-propagation achieves an error rate of 1.2% on
MNIST. SVM’s get 1.4% and randomly initialized back propagation
gets 1.6%.

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Introduction
Gradient Descent Algorithm 2. What is Deep Learning
BP Algorithm for Feed-Forward Neural Network Model 3. Partial Connections
Sparse Auto-encoder
Stacked Auto-encoders
Auto-encoder Neural Network Convolutional Neural Network 4.
5.
Initial Weights
Biological & Theoretical Justification
Deep Learning
6. Looking Forward

Initial Weights – Deep Auto-encoders for Unsupervised Feature Learning

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Introduction
Gradient Descent Algorithm 2. What is Deep Learning
BP Algorithm for Feed-Forward Neural Network Model 3. Partial Connections
Sparse Auto-encoder
Stacked Auto-encoders
Auto-encoder Neural Network Convolutional Neural Network 4.
5.
Initial Weights
Biological & Theoretical Justification
Deep Learning
6. Looking Forward

Initial Weights – Recap of Deep Learning Tutorial

1. Building blocks :
 RBMs, Auto-encoder Neural Net, Sparse Coding

2. Go deeper: Layer-wise feature learning :

 Layer-by-layer unsupervised training
 Layer-by-layer supervised training

3. Fine tuning via Back-propagation :

 If data are big enough, direct fine tuning is enough.

4. Sparsity on hidden layers are often useful.

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Introduction
Gradient Descent Algorithm 2. What is Deep Learning
BP Algorithm for Feed-Forward Neural Network Model 3. Partial Connections
Sparse Auto-encoder
Stacked Auto-encoders
Auto-encoder Neural Network Convolutional Neural Network 4.
5.
Initial Weights
Biological & Theoretical Justification
Deep Learning
6. Looking Forward

Deep Learning

1. Introduction
2. What is Deep Learning
3. Partial connections
4. Initial weights
5. Biological & Theoretical Justification
6. Looking Forward

Why Hierarchy?
Sparse Auto-encoder
Stacked Auto-encoders
Auto-encoder Neural Network Convolutional Neural Network
Deep Learning
4.
5.
6.
Initial Weights
Biological & Theoretical Justification
Looking Forward

Biological & Theoretical Justification

Theoretical:
“…well-known depth-breadth tradeoff in circuits design [Hastad
1987]. This suggests many functions can be much more
efficiently represented with deeper architectures…” [Bengio&
LeCun 2007]

Biological:

Visual cortex is hierarchical

(Hubel-Wiesel Model)

Sparse DBN: Training on face images

Sparse Auto-encoder
Stacked Auto-encoders
Auto-encoder Neural Network Convolutional Neural Network
Deep Learning
4.
5.
6.
Initial Weights
Biological & Theoretical Justification
Looking Forward

Biological & Theoretical Justification

object models

object parts

edges

pixels

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Introduction
Gradient Descent Algorithm 2. What is Deep Learning
BP Algorithm for Feed-Forward Neural Network Model 3. Partial Connections
Sparse Auto-encoder
Stacked Auto-encoders
Auto-encoder Neural Network Convolutional Neural Network 4.
5.
Initial Weights
Biological & Theoretical Justification
Deep Learning
6. Looking Forward

Deep Learning

1. Introduction
2. What is Deep Learning
3. Partial connections
4. Initial weights
5. Biological & Theoretical Justification
6. Looking Forward

Why does it work so well?

Sparse Auto-encoder
Stacked Auto-encoders
Auto-encoder Neural Network Convolutional Neural Network
Deep Learning
4.
5.
6.
Initial Weights
Biological & Theoretical Justification
Looking Forward

Looking Forward
Plan:
 propose explanatory hypotheses
 observe the effects of pre-training
 infer its role & level of agreement with our hypotheses

Regularization hypothesis:

 Unsupervised component constrains the network to model P(x)

 P(x) representations good for P(y|x).

Optimization hypothesis:
 Unsupervised initialization near better local minimum of P(y|x)
 Reach lower local minimum not achievable by random initialization.

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Introduction
Gradient Descent Algorithm 2. What is Deep Learning
BP Algorithm for Feed-Forward Neural Network Model 3. Partial Connections
Sparse Auto-encoder
Open
Stacked Questions
Auto-encoders
Auto-encoder Neural Network …
Convolutional Neural Network
Deep Learning
4.
5.
6.
Initial Weights
Biological & Theoretical Justification
Looking Forward

Looking Forward

1. Is there a depth that is mostly sufficient for the

computations necessary to approach human-level
performance of AI tasks?
2. Why is gradient-based training of deep neural
networks from random initialization often
unsuccessful?
3. Are there other efficiently trainable deep
architectures besides Deep Brief Network, Stacked
Auto-encoders, and deep Boltzmann Machines?
4. Why Unsupervised Pre-training is Important?… …

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning
1. Introduction
Gradient Descent Algorithm 2. What is Deep Learning
BP Algorithm for Feed-Forward Neural Network Model 3. Partial Connections
Sparse Auto-encoder
Stacked Auto-encoders
Auto-encoder Neural Network Convolutional Neural Network 4.
5.
Initial Weights
Biological & Theoretical Justification
Deep Learning
6. Looking Forward

Deep Learning

The End

Machine Learning Lecture – Xizhao Wang Lecture 03: Neural Network and Deep Learning

Lecture 04 (3hrs) Neural Network and Deep Learning-Part A
No ratings yet
Lecture 04 (3hrs) Neural Network and Deep Learning-Part A
76 pages
Lec05-1-Gradient Descent-Detailed
No ratings yet
Lec05-1-Gradient Descent-Detailed
62 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
8 pages
Gradient Descent
No ratings yet
Gradient Descent
6 pages
2. Gradient Descent (GD)- GD With Momentum- Nesterov Accelerated GD- Stochastic GD - OrIGINAL
No ratings yet
2. Gradient Descent (GD)- GD With Momentum- Nesterov Accelerated GD- Stochastic GD - OrIGINAL
25 pages
UNIT III Part-2
No ratings yet
UNIT III Part-2
39 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Gradient Descent Unit3
No ratings yet
Gradient Descent Unit3
9 pages
AI Unit II Lec Notes Deep Learning
No ratings yet
AI Unit II Lec Notes Deep Learning
64 pages
DL_Unit2
No ratings yet
DL_Unit2
113 pages
Gradient Descent Algorithm in Machine Learning - Analytics Vidhya
No ratings yet
Gradient Descent Algorithm in Machine Learning - Analytics Vidhya
11 pages
Unit IV BPA GD
No ratings yet
Unit IV BPA GD
12 pages
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
No ratings yet
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
37 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Automatic Differentiation (1) : Slides Prepared By: Atılım Güneş Baydin Gunes@robots - Ox.ac - Uk
No ratings yet
Automatic Differentiation (1) : Slides Prepared By: Atılım Güneş Baydin Gunes@robots - Ox.ac - Uk
114 pages
Eem520l3 2023
No ratings yet
Eem520l3 2023
25 pages
Enigma Submission
No ratings yet
Enigma Submission
3 pages
Multi Percept Ron
No ratings yet
Multi Percept Ron
14 pages
DL Unit -2
No ratings yet
DL Unit -2
20 pages
UNIT2
No ratings yet
UNIT2
25 pages
Tut 01
No ratings yet
Tut 01
39 pages
optim
No ratings yet
optim
33 pages
Gradient Descent Final
No ratings yet
Gradient Descent Final
27 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Differentiable Programming and Design Optimization
No ratings yet
Differentiable Programming and Design Optimization
72 pages
Gradient Descent Algorithm is a first
No ratings yet
Gradient Descent Algorithm is a first
5 pages
Unit 2
No ratings yet
Unit 2
19 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
Gradient Descent_PR
No ratings yet
Gradient Descent_PR
31 pages
Module2-Optimizations
No ratings yet
Module2-Optimizations
65 pages
07autodiff Nnets
No ratings yet
07autodiff Nnets
12 pages
Unit-1 and 2 and 3 (1)
No ratings yet
Unit-1 and 2 and 3 (1)
212 pages
LInear
No ratings yet
LInear
14 pages
UNIT-II [ML-I]
No ratings yet
UNIT-II [ML-I]
81 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
3 Types of Gradient Descent Algorithms For Small & Large Datasets
No ratings yet
3 Types of Gradient Descent Algorithms For Small & Large Datasets
9 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
DL UNIT-I
No ratings yet
DL UNIT-I
30 pages
Automatic Differentiation and Neural Networks
No ratings yet
Automatic Differentiation and Neural Networks
13 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Deep learning chapter 1
No ratings yet
Deep learning chapter 1
46 pages
Lesson 4 Gradient Descent
No ratings yet
Lesson 4 Gradient Descent
13 pages
Deep Learning Tutorial 9
No ratings yet
Deep Learning Tutorial 9
70 pages
Unit 1 (1)
No ratings yet
Unit 1 (1)
72 pages
AML 04 Backpropagation
100% (1)
AML 04 Backpropagation
26 pages
Gradient Descent DS Rohit Sharma Fench Knjs
No ratings yet
Gradient Descent DS Rohit Sharma Fench Knjs
15 pages
Lesson 4 Training ANNs
No ratings yet
Lesson 4 Training ANNs
34 pages
DLA Unit 3
No ratings yet
DLA Unit 3
26 pages
Demystifying Deep Learning
No ratings yet
Demystifying Deep Learning
68 pages
Chapter 5 Final
No ratings yet
Chapter 5 Final
80 pages
Gradient Descent Explained. A Comprehensive Guide To Gradient - by Daksh Trehan - Towards Data Science
No ratings yet
Gradient Descent Explained. A Comprehensive Guide To Gradient - by Daksh Trehan - Towards Data Science
9 pages
Automatic Differentiation of Algorithms For Machine Learning
No ratings yet
Automatic Differentiation of Algorithms For Machine Learning
7 pages
ML Lec 08 Gradient Descent
No ratings yet
ML Lec 08 Gradient Descent
37 pages
DL-UNIT_2
No ratings yet
DL-UNIT_2
7 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
unit-2
No ratings yet
unit-2
16 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Ad Refer
No ratings yet
Ad Refer
53 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Perceptrons: Fundamentals and Applications for The Neural Building Block
From Everand
Perceptrons: Fundamentals and Applications for The Neural Building Block
Fouad Sabry
No ratings yet
PMLR (2018). Model-Based Reinforcement Learning via Meta-Policy Optimization
No ratings yet
PMLR (2018). Model-Based Reinforcement Learning via Meta-Policy Optimization
13 pages
PMLR (2018). Sim-to-Real Transfer in Reinforcement Learning for Deformable Object Manipulation
No ratings yet
PMLR (2018). Sim-to-Real Transfer in Reinforcement Learning for Deformable Object Manipulation
10 pages
PMLR (2017). Mutual Alignment Transfer Learning
No ratings yet
PMLR (2017). Mutual Alignment Transfer Learning
10 pages
arXiv (2024). Humanoid-Gym, Reinforcement Learning for Humanoid Robot with Zero-Shot Sim2Real Transfer
No ratings yet
arXiv (2024). Humanoid-Gym, Reinforcement Learning for Humanoid Robot with Zero-Shot Sim2Real Transfer
6 pages
Lecture 04 (3hrs) Neural Network and Deep Learning-Part B
No ratings yet
Lecture 04 (3hrs) Neural Network and Deep Learning-Part B
121 pages
CGR Assignment 2
No ratings yet
CGR Assignment 2
3 pages
9.4 Multiply Divide Rational Functions
100% (1)
9.4 Multiply Divide Rational Functions
2 pages
ETEG 425 Internal Exam Questions 2021
No ratings yet
ETEG 425 Internal Exam Questions 2021
2 pages
CS2040C Final Exam Solutions-2 2020
No ratings yet
CS2040C Final Exam Solutions-2 2020
6 pages
Clustering Algorithms: K-Means
No ratings yet
Clustering Algorithms: K-Means
17 pages
Sma 430 Numerical Analysis II
No ratings yet
Sma 430 Numerical Analysis II
3 pages
Chapter 2
No ratings yet
Chapter 2
26 pages
Polynomial-Function
No ratings yet
Polynomial-Function
124 pages
Unit - 1: Part - A
No ratings yet
Unit - 1: Part - A
6 pages
Minimal Spanning Tree
No ratings yet
Minimal Spanning Tree
29 pages
Contents of Chapter 4: - Chapter 4 The Greedy Method
0% (1)
Contents of Chapter 4: - Chapter 4 The Greedy Method
34 pages
Mat A29 Homework 4
No ratings yet
Mat A29 Homework 4
6 pages
K-Means Clustering From Scratch
No ratings yet
K-Means Clustering From Scratch
3 pages
Numerical Methods: Prof. Shishir Gupta Department of Mathematics & Computing IIT (ISM) Dhanbad
No ratings yet
Numerical Methods: Prof. Shishir Gupta Department of Mathematics & Computing IIT (ISM) Dhanbad
12 pages
LIN1 Solved
No ratings yet
LIN1 Solved
5 pages
CHE 411 Lesson 9 Note
No ratings yet
CHE 411 Lesson 9 Note
18 pages
Sources of Approximation: Before Computation
No ratings yet
Sources of Approximation: Before Computation
31 pages
Bcs 054
No ratings yet
Bcs 054
3 pages
A Star Algorithm
No ratings yet
A Star Algorithm
8 pages
Neural Network Fundamentals With Graphs
No ratings yet
Neural Network Fundamentals With Graphs
6 pages
04 Greedy Activity Selection
No ratings yet
04 Greedy Activity Selection
20 pages
Gradient Descent Deep Learning Lecture
No ratings yet
Gradient Descent Deep Learning Lecture
5 pages
Unit+1+REVIEW+ +Polynomial+Functions
No ratings yet
Unit+1+REVIEW+ +Polynomial+Functions
9 pages
Introduction To Parallel Algorithms and Parallel Program Design
No ratings yet
Introduction To Parallel Algorithms and Parallel Program Design
91 pages
Or in Education - Prim's and Kruskal's Algorithm Answer Sheet
No ratings yet
Or in Education - Prim's and Kruskal's Algorithm Answer Sheet
5 pages
L18 K Means
No ratings yet
L18 K Means
27 pages
Transportation Model
No ratings yet
Transportation Model
22 pages
L12 Fractional Knapsack Greedy
No ratings yet
L12 Fractional Knapsack Greedy
13 pages
lecture-5-si416-2025
No ratings yet
lecture-5-si416-2025
21 pages
Management Science
No ratings yet
Management Science
5 pages