0% found this document useful (0 votes)
6 views

01 - Neural Network Basics

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

01 - Neural Network Basics

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Applied Deep Learning

Neural Network Basics


March 10th, 2020 http://adl.miulab.tw
2 Learning ≈ Looking for a Function
◉ Speech Recognition f( ) =“你好”

◉ Handwritten Recognition f( ) = “2”

◉ Weather forecast f( Thursday )= “ Saturday”

◉ Play video games


f( ) = “move left”
3 Machine Learning Framework
x : “It claims too much.”
function input
ŷ : - (negative) Model: Hypothesis Function Set
function output f1 , f 2 

Training Data Training: Pick the best function f *


(x1, yˆ1 ), (x2 , yˆ 2 ),
“Best” Function f*
Testing Data Testing: f  ( x) = y y = +
(x, ? ),

Training is to pick the best function given the observed data


Testing is to predict the label using the learned function
4 How to Train a Model?
實際上我們是如何訓練一個模型的?
5 Machine Learning Framework
x : “It claims too much.”
function input
ŷ : - (negative) Model: Hypothesis Function Set
function output f1 , f 2  Training
Procedure
Training Data Training: Pick the best function f *
(x1, yˆ1 ), (x2 , yˆ 2 ),
“Best” Function f*
Testing Data Testing: f  ( x) = y y = +
(x, ? ),

Training is to pick the best function given the observed data


Testing is to predict the label using the learned function
6 Training Procedure

Model: Hypothesis Function Set


f1 , f 2 

Training: Pick the best function f *

“Best” Function f*

◉ Q1. What is the model? (function hypothesis set)


◉ Q2. What does a “good” function mean?
◉ Q3. How do we pick the “best” function?
7 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
8 What is the Model?
什麼是模型?
9 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
10 Classification Task
◉ Sentiment Analysis
“這規格有誠意!” +
Binary Classification
“太爛了吧~” -
Class A (yes)
input
◉ Speech Phoneme Recognition object
Class B (no)

/h/
◉ Handwritten Recognition Multi-class Classification
Class A
2 input
object Class B
Class C

Some cases are not easy to be formulated as classification problems


11 Target Function
◉ Classification Task

f (x ) = y f :R →RN M

○ x: input object to be classified → a N-dim vector


○ y: class/label → a M-dim vector

Assume both x and y can be represented as fixed-size vectors


12 Vector Representation Example
◉ Handwriting Digit Classification f : RN → RM
x: image y: class/label

Each pixel corresponds to 10 dimensions for digit recognition


an element in the vector
16 x 16
“1” “2”
1 “1” 0 “1” “1” or not
0  1: for ink 0  1 “2”
“2” “2” or not
1 0: otherwise    
  0 “3” 0 “3” “3” or not
   16 x 16 = 256 dimensions    

  
13 Vector Representation Example
◉ Sentiment Analysis f : RN → RM
x: word y: class/label
“love” Each element in the 3 dimensions
vector corresponds to a (positive, negative, neutral)
word in the vocabulary
“+” “-”
0 1: indicates the word 1 “+” 0  “+” “+” or not
1 0: otherwise 0 “-” 1 “-” “-” or not
  dimensions = size of vocab    
   0 “?” 0  “?” “?” or not
   
 
14 Target Function
◉ Classification Task

f (x ) = y f :R →RN M

○ x: input object to be classified → a N-dim vector


○ y: class/label → a M-dim vector

Assume both x and y can be represented as fixed-size vectors


15 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
16 A Single Neuron

x1 w1 Activation

x2 w2
function
 (z )
z  (z )
+ y
wN

 (z ) =
xN 1
z
b 1 + e−z
bias Sigmoid function

Each neuron is a very simple function


17 A Single Neuron

x1 w1 Activation

x2 w2
function
 (z )
z  (z )
+ y
wN

 (z ) =
xN 1
z
b 1 + e−z
1 bias
Sigmoid function
The bias term is an “always on” feature
18 Why Bias?

 (z )

b
bias

z
The bias term gives a class prior
19 Model Parameters of A Single Neuron

x1 w1

x2 w2 z  (z )
+ y
wN

xN
 (z ) =
1
b 1 + e−z
1 bias

w, b are the parameters of this neuron


20 A Single Neuron

f : RN → RM
x1 w1

x2 w2 z
+ y
… wN
xN is "2" y  0.5
b 
1 bias
 not "2" y  0.5

A single neuron can only handle binary classification


21 A Layer of Neurons
◉ Handwriting digit classification f : RN → RM

x1 + y1
“1” or not
x2 + y2
“2” or not Which one is max?

xN + y3
“3” or not



1
10 neurons/10 classes
A layer of neurons can handle multiple possible output,
and the result depends on the max one
22 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
23 A Layer of Neurons – Perceptron
◉ Output units all operate separately – no shared weights

x1 + y1

x2 + y2

xN + y3
1


Adjusting weights moves the location, orientation, and steepness of cliff


http://aima.eecs.berkeley.edu/slides-pdf/chapter20b.pdf
24 Expression of Perceptron
x1 w1
w z
x2 2 + y
b
1

A perceptron can represent AND, OR, NOT, etc., but not XOR → linear separator
http://aima.eecs.berkeley.edu/slides-pdf/chapter20b.pdf
25 How to Implement XOR?

A
Input A’
Output A + B’
A B
AB’ + A’B
0 0 0
0 1 1
A’ + B
1 0 1 B
1 1 0 B’

A xor B = AB’ + A’B

Multiple operations can produce more complicate output


26 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
27 Neural Networks – Multi-Layer Perceptron

z1
x1 + a1

x2 z2 z
+ a2 + y

1 Hidden Units 1
28 Expression of Multi-Layer Perceptron
◉ Continuous function w/ 2 layers ◉ Continuous function w/ 3 layers

◉ Combine two opposite-facing ◉ Combine two perpendicular ridges to


threshold functions to make a ridge make a bump
○ Add bumps of various sizes and
locations to fit any surface
multiple layers enhance the model expression
→ the model can approximate more complex functions
http://aima.eecs.berkeley.edu/slides-pdf/chapter20b.pdf
29 Deep Neural Networks (DNN)
◉ Fully connected feedforward network f : RN → RM

Input Layer 1 Layer 2 Layer L Output

x1 … y1
vector … vector
x2 … y2
x y

……

……
……
……
xN … yM

Deep NN: multiple hidden layers


30 Notation Definition

Output of a neuron:
1 l −1 1
a a1l layer l
1
l
2
a l −1
2
a l
a i neuron i
2 2

…..
…..

j i
a lj−1 ….. ail
…..

Layer l − 1 Layer l
N l −1 nodes N l nodes output of one layer → a vector
31 Notation Definition

layer l − 1
to layer l
1 l −1 1
a1 a1l from neuron j (layer l-1)
2 2 to neuron i (layer l)
l −1
a 2 a2l

…..
…..

j i
a lj−1 ….. ail
…..

Layer l − 1 Layer l weights between two layers


N l −1 nodes N l nodes → a matrix
32 Notation Definition

1 l −1 l 1 bil : bias for neuron i at layer l


a1
b
1 al
1

2 l −1
2
a 2 a2l

…..
…..

b2l
j i
a lj−1 l ail
b
…..
…..

1
Layer l − 1 Layer l bias of all neurons at each layer
N l −1 nodes N l nodes → a vector
33 Notation Definition

1 : input of the activation function


for neuron i at layer l
2
…..

j i
…..
…..

1 activation function input at


Layer l − 1 Layer l each layer → a vector
N l −1 nodes N l nodes
34 Notation Summary

l
a i
: output of a neuron : a weight

al : output vector of a layer : a weight matrix

zil : input of activation function bil : a bias

l : input vector of activation function l : a bias vector


z for a layer
b
35 Layer Output Relation

1 l −1 1
a1 zl
1
a1l
2 l −1
2
a 2 z l
2
a2l



j i
a lj−1 z l
i ail



a l −1 zl al
Layer l − 1 Layer l
N l −1 nodes N l nodes
36 Layer Output Relation – from a to z

1 l −1 1
a zl
a1l

… …
1 1
2 l −1
2
a 2 z l
2
a2l



j i
a lj−1 z l
i ail



a l −1 zl al
Layer l − 1 Layer l
N l −1 nodes N l nodes
37 Layer Output Relation – from z to a

( )
ail =  zil
1 1
( )
l −1
a1 zl
1
a1l a1l   z1l 
 l  l 
2
a l −1
2 z l
2
2
a2l ( )
a2   z 2 
  =  



   
j
a lj−1
i ( )
ai   zi 
l l

z l
i a l
   
i

   



a l −1 zl a l

Layer l − 1
( )
Layer l
N l −1 nodes N l nodes al =  z l
38 Layer Output Relation

1 l −1 1
a a1l
z l = W l a l −1 + b l
l
1 z1
2 2
( )
l −1
a l
a2l
al =  z l
2 z 2



j i
a lj−1 z l
i ail



a l −1 zl al
Layer l − 1 Layer l
N l −1 nodes N l nodes
39 Neural Network Formulation
◉ Fully connected feedforward network f : RN → RM

Input Layer 1 Layer 2 Layer L Output

x1 … y1
vector … vector
x2
… y2 y
x

…..

…..
…..

…..
xN … yM
=

=
40 Neural Network Formulation
◉ Fully connected feedforward network f : RN → RM

Input Layer 1 Layer 2 Layer L Output

x1 … y1
vector … vector
x2
… y2 y
x

…..

…..
…..

…..
xN … yM
41 Activation Function

bounded
function
42 Activation Function

boolean

linear

non-linear
43 Non-Linear Activation Function
◉ Sigmoid

◉ Tanh

◉ Rectified Linear Unit (ReLU)

Non-linear functions are frequently used in neural networks


44 Why Non-Linearity?
◉ Function approximation
○ Without non-linearity, deep neural networks work the same as linear
transform

○ With non-linearity, networks with more layers can approximate more


complex functions

http://cs224d.stanford.edu/lectures/CS224d-Lecture4.pdf
What does the “Good”
45 Function mean?
什麼叫做“好”的Function呢?
46 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
47 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
48 Function = Model Parameters

function set different parameters W and b → different functions

◉ Formal definition

model parameter set

pick a function f = pick a set of model parameters θ


49 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
50 Model Parameter Measurement
◉ Define a function to measure the quality of a parameter set θ
○ Evaluating by a loss/cost/error function C(θ) → how bad θ is
○ Best model parameter set

○ Evaluating by an objective/reward function O(θ) → how good θ is


○ Best model parameter set
51 Loss Function Example
x : “It claims too much.”
function input
ŷ : - (negative) Model: Hypothesis Function Set
function output f1 , f 2 

Training Data Training: Pick the best function f *


(x1, yˆ1 ), (x2 , yˆ 2 ),
*
“Best” Function f

A “Good” function:

Define an example loss function:

sum over the error of all training samples


52 Frequent Loss Function
◉ Square loss

◉ Hinge loss

◉ Logistic loss

◉ Cross entropy loss

◉ Others: large margin, etc.

https://en.wikipedia.org/wiki/Loss_functions_for_classification
How can we Pick the
53 “Best” Function?
我們如何找出“最好”的Function呢?
54 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
55 Problem Statement
◉ Given a loss function and several model parameter sets
○ Loss function:
○ Model parameter sets:
◉ Find a model parameter set that minimizes C(θ)
How to solve this optimization problem?

◉ 1) Brute force – enumerate all possible θ


◉ 2) Calculus –

Issue: whole space of C(θ) is unknown


56 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
57 Gradient Descent for Optimization
◉ Assume that θ has only one variable

: the model at the i-th iteration


C( )


 0
 1
 2
 3

Idea: drop a ball and find the position where the ball stops rolling (local minima)
58 Gradient Descent for Optimization
◉ Assume that θ has only one variable
Randomly start at 𝜃 0
Compute 𝑑𝐶 𝜃 0 Τ𝑑𝜃:
C( )
Compute 𝑑𝐶 𝜃 1 Τ𝑑𝜃:

η is “learning rate”

 0
 1
59 Gradient Descent for Optimization
◉ Assume that θ has two variables {θ1, θ2}
60 Gradient Descent for Optimization
◉ Assume that θ has two variables {θ1, θ2}

• Randomly start at 𝜃 0 :

• Compute the gradients of 𝐶 𝜃 at 𝜃 0 :


• Update parameters:

• Compute the gradients of 𝐶 𝜃 at 𝜃1 :


61 Gradient Descent for Optimization

𝜃2
𝛻𝐶 𝜃 0
Algorithm
1
𝜃0 𝛻𝐶 𝜃 Initialization: start at 𝜃 0
while(𝜃 (𝑖+1) ≠ 𝜃 𝑖 )
𝜃1 𝛻𝐶 𝜃 2 {
𝜃2 compute gradient at 𝜃 𝑖
Gradient update parameters
Movement 𝛻𝐶 𝜃 3
𝜃3 }

𝜃1
62 Revisit Neural Network Formulation
◉ Fully connected feedforward network f : RN → RM

Input Layer 1 Layer 2 Layer L Output

x1 … y1
vector … vector
x2
… y2 y
x

…..

…..
…..

…..
xN … yM
63 Gradient Descent for Neural Network

Algorithm
Initialization: start at 𝜃 0
while(𝜃 (𝑖+1) ≠ 𝜃 𝑖 )
{
compute gradient at 𝜃 𝑖
update parameters

}
64 Gradient Descent for Optimization
Simple Case
Algorithm
Initialization: start at 𝜃 0
while(𝜃 (𝑖+1) ≠ 𝜃 𝑖 )
{
x1 w1 compute gradient at 𝜃 𝑖
update parameters
x2 w2 + z  (z ) y
b }
1
65 Gradient Descent for Optimization
Simple Case – Three Parameters & Square Error Loss
◉ Update three parameters for t-th iteration

◉ Square error loss


66 Gradient Descent for Optimization
Simple Case – Square Error Loss
◉ Square Error Loss
67 Gradient Descent for Optimization
Simple Case – Square Error Loss

sigmoid func
chain rule
68 Gradient Descent for Optimization
Simple Case – Square Error Loss
◉ Square Error Loss
69 Gradient Descent for Optimization
Simple Case – Three Parameters & Square Error Loss
◉ Update three parameters for t-th iteration x1 w1

x2 w2 z  (z )
+ y
b
1
70 Optimization Algorithm
Algorithm
Initialization: set the parameters 𝜃, 𝑏 at random
while(stopping criteria not met)
{
for training sample {𝑥, 𝑦},
ො compute gradient and update parameters

}
71 Gradient Descent for Neural Network

Algorithm
Initialization: start at 𝜃 0
while(𝜃 (𝑖+1) ≠ 𝜃 𝑖 )
{
compute gradient at 𝜃 𝑖
update parameters

Computing the gradient includes millions of parameters.


To compute it efficiently, we use backpropagation.
72 Gradient Descent Issue

Training Data
(x1, yˆ1 ), (x2 , yˆ 2 ),

After seeing all training samples, the model can be updated → slow
73 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
74 Stochastic Gradient Descent (SGD)
◉ Gradient Descent

◉ Stochastic Gradient Descent (SGD)


○ Pick a training sample xk Training Data
(x1, yˆ1 ), (x2 , yˆ 2 ),
○ If all training samples have same probability to be picked

The model can be updated after seeing one training sample → faster
75 Epoch Definition
◉ When running SGD, the model starts θ0
Training Data
(x1, yˆ1 ), (x2 , yˆ 2 ),
pick x1
pick x2 see all training


samples once
pick
xk

pick xK → one epoch

pick x1
76 Gradient Descent v.s. SGD
◉ Gradient Descent ◉ Stochastic Gradient Descent
✓ Update after seeing all examples ✓ If there are 20 examples, update 20 times
in one epoch.

See all See only one


examples example

1 epoch

SGD approaches to the target point faster than gradient descent


77 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
78 Mini-Batch SGD
◉ Batch Gradient Descent

Use all K samples in each iteration

◉ Stochastic Gradient Descent (SGD)


○ Pick a training sample xk

Use 1 samples in each iteration


◉ Mini-Batch SGD
○ Pick a set of B training samples as a batch b
B is “batch size”
Use all B samples in each iteration
79 Mini-Batch SGD
80 Batch v.s. Mini-Batch
Handwritting Digit Classification

Batch size = 1 Gradient Descent


81 Gradient Descent v.s. SGD v.s. Mini-Batch

Training speed: mini-batch > SGD > Gradient Descent


Training Time
(sec)

Why is mini-batch
faster than SGD?

Batch Size
1 10 100 1000 10000 full
SGD Gradient Descent
Mini-Batch
82 SGD v.s. Mini-Batch
◉ Stochastic Gradient Descent (SGD)

𝑧1 = 𝑊1 𝑥 𝑧1 = 𝑊1 𝑥 ……

◉ Mini-Batch SGD
matrix

𝑧1 𝑧1 = 𝑊1 𝑥 𝑥

Modern computers run matrix-matrix multiplication faster than


matrix-vector multiplication
83 Big Issue: Local Optima

Neural networks has no guarantee for obtaining global optimal solution


84 Training Procedure Outline

 Model Architecture
✓ A Single Layer of Neurons (Perceptron)
✓ Limitation of Perceptron
✓ Neural Network Model (Multi-Layer Perceptron)
 Loss Function Design
✓ Function = Model Parameters
✓ Model Parameter Measurement
 Optimization
✓ Gradient Descent
✓ Stochastic Gradient Descent (SGD)
✓ Mini-Batch SGD
✓ Practical Tips
85 Initialization
◉ Different initialization parameters may result in different trained models

Do not initialize the parameters equally → set them randomly


86 Learning Rate

cost
Very Large

small

Large
Just make
Error
# parameters updates Surface

Learning rate should be set carefully


87 Tips for Mini-Batch Training

◉ Shuffle training samples before every epoch


○ the network might memorize the order you feed the samples
◉ Use a fixed batch size for every epoch
○ enable to fast implement matrix multiplication for calculations
◉ Adapt the learning rate to the batch size
○ larger batch → smaller learning rate

http://stackoverflow.com/questions/13693966/neural-net-selecting-data-for-each-mini-
88 Learning Recipe

Testing Data

Training Data Validation Real Testing

x ŷ x y x y

*
“Best” Function f
89 Learning Recipe

Testing Data

Training Data Validation Real Testing

x ŷ x y x y
immediately Do not know the
know the performance until
performance submission
90 Learning Recipe

get good results


no
on training set

modify training
process

◉ Possible reasons
○ no good function exists: bad hypothesis function set
→ reconstruct the model architecture
○ cannot find a good function: local optima
→ change the training strategy
91 Learning Recipe
yes yes
get good results get good results on done
no no dev/validation set
on training set

modify training prevent overfitting


process

Better performance on training but worse performance on dev → overfitting


92 Overfitting

◉ Possible solutions
○ more training samples
○ some tips: dropout, etc.
93 Concluding Remarks
◉ Q1. What is the model? Model Architecture

◉ Q2. What does a “good” function mean? Loss Function Design


◉ Q3. How do we pick the “best” function? Optimization

Model: Hypothesis Function Set


f1 , f 2 

Training: Pick the best function f *

“Best” Function f*

You might also like