0% found this document useful (0 votes)
4 views

ANN-Unit 7 - Parameter Tuning & Normalization

The document outlines topics related to deep neural networks including hyperparameter tuning, batch normalization, mini-batches, regularization, softmax, and orthogonalization. It discusses exploring hyperparameters randomly from coarse to fine levels and exponential parameter tuning. Batch normalization is explained as normalizing data before activation layers to speed up training and improve model accuracy. Mini-batches and gradient descent implementation with batch normalization are also covered. Softmax regression and its use of the softmax activation function for classification problems are described.

Uploaded by

adeenahussain70
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

ANN-Unit 7 - Parameter Tuning & Normalization

The document outlines topics related to deep neural networks including hyperparameter tuning, batch normalization, mini-batches, regularization, softmax, and orthogonalization. It discusses exploring hyperparameters randomly from coarse to fine levels and exponential parameter tuning. Batch normalization is explained as normalizing data before activation layers to speed up training and improve model accuracy. Mini-batches and gradient descent implementation with batch normalization are also covered. Softmax regression and its use of the softmax activation function for classification problems are described.

Uploaded by

adeenahussain70
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

1/1/2024

Applied Neural
Networks
Unit – 7

Dr. Muhammad Usman Arif; Applied Neural Networks 1/1/2024 1

Lecture Outline
▪ Deep Neural Networks
▪ Hyper parameter Tuning
▪ Batch Normaliazation
▪ Mini-Batches
▪ Regularization
▪ Softmax
▪ Orthogonalization

Dr. Muhammad Usman Arif; Applied Neural Networks 1/1/2024 2

1
1/1/2024

Hyper-parameter Tuning
▪ α,β
▪ β1
▪ β2
▪ ε
▪ # hidden layers
▪ # hidden units
▪ Learning rates
▪ Mini-batch size
▪ Activation functions
▪ …..

Dr. Muhammad Usman Arif; Applied Neural Networks 1/1/2024 3

Hyper-parameter Tuning
▪ Don’t Use a Grid
▪ Explore Randomly
▪ Coarse to Fine

Dr. Muhammad Usman Arif; Applied Neural Networks 1/1/2024 4

2
1/1/2024

Exponential Parameter Tuning


▪ Choose the Right Scale
▪ r = -4 * np.random.rand() r є [-4, 0]
▪ α = 10-r (10-4 , 100)
▪ If β is to be explored between
0.9 …………….. 0.999

▪ Then: r є [-3, -1]


▪ β = 1 – 10-r
▪ 1-β = 10-r

Dr. Muhammad Usman Arif; Applied Neural Networks 1/1/2024 5

Panda vs. Caviar

Dr. Muhammad Usman Arif; Applied Neural Networks 1/1/2024 6

3
1/1/2024

Batch Normalization
▪ As we discussed before, normalizing the data is very
important to a machine learning model
▪ The Batch Normalization layer works normalizing the
data before the activation layer, it is essential as it
makes the model faster and more accurate.
▪ Using batch normalization, will speed up model
training, decreases the importance of initial
weights, regularizes the model a little and makes it a
little better.

Dr. Muhammad Usman Arif; Applied Neural Networks 1/1/2024 7

Batch Normalization Implementation

(𝑖) 𝑧 (𝑖) − 𝜇𝐵
𝑧𝑛𝑜𝑟𝑚 =
𝜎2 + 𝜀
Where:
𝛾 𝑎𝑛𝑑 𝛽 are learnable parameters
ⅈ (𝑖)
𝑧ǁ = 𝛾𝑧𝑛𝑜𝑟𝑚 + 𝛽
Dr. Muhammad Usman Arif; Applied Neural Networks 1/1/2024 8

4
1/1/2024

Adding Batch Normalization to a Neural Network

𝛾 [1] 𝛽[1] W [2] , b[2] 𝛾 [2] 𝛽[2]


X → Z[1] 𝑧ǁ [1] → a[1] = g[1] (𝑧ǁ [1]) Z[2] 𝑧ǁ [2] → a[2] ….
Batch Norm Batch Norm

Dr. Muhammad Usman Arif; Applied Neural Networks 1/1/2024 9

Working with Mini-Batches


𝛾 [1] 𝛽[1] W [2] , b[2] 𝛾 [2] 𝛽[2]
X{1} → Z[1] 𝑧ǁ [1] → a[1] = g[1] (𝑧ǁ [1]) Z[2] 𝑧ǁ [2] → a[2] ….
Batch Norm Batch Norm
𝛾 [1] 𝛽[1] W [2] , b[2] 𝛾 [2] 𝛽[2]
X{2} → Z[1] 𝑧ǁ [1] → a[1] = g[1] (𝑧ǁ [1]) Z[2] 𝑧ǁ [2] → a[2] ….
Batch Norm Batch Norm
𝛾 [1] 𝛽[1] W [2] , b[2] 𝛾 [2] 𝛽[2]
X{3} → Z[1] 𝑧ǁ [1] → a[1] = g[1] (𝑧ǁ [1]) Z[2] 𝑧ǁ [2] → a[2] ….
Batch Norm z [l] = w [l] a [l-1] + b [l] Batch Norm

Parameters:W[l] , b[l], 𝛾[l]𝑎𝑛𝑑 𝛽 [l] z [l] = w [l] a [l-1]


(𝑖)
𝑧𝑛𝑜𝑟𝑚 ⅈ (𝑖)
𝑧ǁ = 𝛾𝑧𝑛𝑜𝑟𝑚 + 𝛽
Dr. Muhammad Usman Arif; Applied Neural Networks 1/1/2024 10

10

5
1/1/2024

Implementing Gradient Descent


for t = 1………….num Mini Batch
Compute forward prop X{t}
In each hidden layer, use BN to replace z[l] with 𝑧ǁ [𝑙]
Use back prop and compute dW[l], db[l], 𝑑𝛾[l]𝑎𝑛𝑑 𝑑𝛽 [l]
Update parameters
W[l] = W − αdW[l]
β[l] = β − αdβ[l]
𝛾[l] = 𝛾 − αd𝛾[l]
Works with momentum, RMS Prop, Adam
Dr. Muhammad Usman Arif; Applied Neural Networks 1/1/2024 11

11

Learning on Shifting Input Distribution

Covariate Shift

X→Y

Dr. Muhammad Usman Arif; Applied Neural Networks 1/1/2024 12

12

6
1/1/2024

Why this is a Problem with NN W [3], b[3] W [4], b[4]


W [1], b[1] W [2], b[2]
a1[2]

a2[2]

a3[2]

a4[2]

Dr. Muhammad Usman Arif; Applied Neural Networks 1/1/2024 13

13

Batch Normalization as Regularization


▪ Each mini-batch is scaled by the mean/variance
computed on just that mini-batch.
▪ This adds some noise to the values z[l] within that
minibatch. So similar to dropout, it adds some noise to
each hidden layer’s activations.
▪ This has a slight regularization effect

Dr. Muhammad Usman Arif; Applied Neural Networks 1/1/2024 14

14

7
1/1/2024

Softmax Regression

Dr. Muhammad Usman Arif; Applied Neural Networks 1/1/2024 15

15

Softmax Layer
z [l] = w [l] a [l−1] + b [l]
Activation Function:
[𝑙]
t = 𝑒𝑧
[𝑙]
𝑒𝑧 [𝑙] 𝑡𝑖
𝑎[𝑙] = σ4 , 𝑎𝑖 = σ4
𝑗=1 𝑡𝑖 𝑗=1 𝑡𝑖

z[l] = = a[l]

Dr. Muhammad Usman Arif; Applied Neural Networks 1/1/2024 16

16

8
1/1/2024

Softmax Examples

Dr. Muhammad Usman Arif; Applied Neural Networks 1/1/2024 17

17

Softmax Classifier

Softmax Hardmax
1
0
0
0

Softmax regression generalizes logistic regression to C classes.

Dr. Muhammad Usman Arif; Applied Neural Networks 1/1/2024 18

18

9
1/1/2024

Loss Function
0 0.3
1 0.2
𝑦= 𝑦ො =
0 0.1
0 0.4
4
1 1 1 𝑚
ℒ 𝑦ො , 𝑦 = − ෍ 𝑦𝑗 log 𝑦ො 𝑗 𝐽 𝑤 𝑏 … = 𝑚 ෌𝑖=1 ℒ 𝑦ො 𝑖 , 𝑦 𝑖
𝑗=1

−𝑦2 log 𝑦ො2 = − log 𝑦ො2


0 1 0 0.3 0.5 0.2
1 0 0 1 2 3 0.2 0.2 0.1
𝑦= 𝑦 1 ,𝑦 2 ,𝑦 3 ,…. = … … . 𝑦ො = 𝑦ො , 𝑦ො , 𝑦ො ,…. =
0.1 0.15 0.6
…….
0 0 1
0 0 0 0.4 0.15 0.1
Dr. Muhammad Usman Arif; Applied Neural Networks 1/1/2024 19

19

Gradient Descent with Softmax

▪ Backpropogation 𝑧 [𝑙] → 𝑎[𝑙] = 𝑦ො → ℒ 𝑦ො , 𝑦


𝑑𝑧 = 𝑦ො − 𝑦
𝜕𝐽
𝜕𝑧 [𝑙]

Dr. Muhammad Usman Arif; Applied Neural Networks 1/1/2024 20

20

10
1/1/2024

Deep Learning Frameworks


▪ Caffe/Caffe2 Choosing deep learning frameworks
▪ CNTK - Ease of programming (development
▪ DL4J and deployment)
▪ Keras - Running speed
▪ Lasagne
- Truly open (open source with good
▪ mxnet
governance)
▪ PaddlePaddle
▪ TensorFlow
▪ Theano
▪ Torch
Dr. Muhammad Usman Arif; Applied Neural Networks 1/1/2024 21

21

ML Strategies to Improve the System

Ideas: ▪ Try smaller network


▪ Collect more data ▪ Try dropout
▪ Collect more diverse training set ▪ Add L2 regularization

▪ Train algorithm longer with gradient descent ▪ Network architecture


▪ Try Adam instead of gradient descent ▪ Activation functions

▪ Try bigger network ▪ # hidden units


▪ …

Dr. Muhammad Usman Arif; Applied Neural Networks 1/1/2024 22

22

11
1/1/2024

Orthogonalization
▪ Orthogonalization or orthogonality is a system design property that
assures that modifying an instruction or a component of an
algorithm will not create or propagate side effects to other
components of the system. It becomes easier to verify the algorithms
independently from one another, it reduces testing and development
time.

Dr. Muhammad Usman Arif; Applied Neural Networks 1/1/2024 23

23

Orthogonalization

Dr. Muhammad Usman Arif; Applied Neural Networks 1/1/2024 24

24

12
1/1/2024

Assumptions for Orthogonalization


▪ When a supervised learning system is design, these are the 4 assumptions that needs
to be true and orthogonal.
1. Fit training set well in cost function
If it doesn’t fit well, the use of a bigger neural network or switching to a better optimization algorithm might
help.
2. Fit development set well on cost function
If it doesn’t fit well, regularization or using bigger training set might help.
3. Fit test set well on cost function
If it doesn’t fit well, the use of a bigger development set might help
4. Performs well in real world
If it doesn’t perform well, the development test set is not set correctly or the cost function is not evaluating
the right thing.

Dr. Muhammad Usman Arif; Applied Neural Networks 1/1/2024 25

25

13

You might also like