ANN-Unit 6 - Deep Neural Networks
ANN-Unit 6 - Deep Neural Networks
Applied Neural
Networks
Unit – 6
Lecture Outline
▪ Deep Neural Networks
▪ Vectorized Implementation
▪ DNN Building Blocks
▪ Empirical Learning
▪ Bias and Variance
▪ Regularization to Handle Bias and Variance
▪ Vanishing and Exploding Gradients
1
12/19/2023
n[1] = 5
n[2] = 5
n[3] = 3
x = a[0] n[4] = n[L] = 1
= a[L] n[0] = nx = 3
Layer 4
2
12/19/2023
}
▪ z [4] = w [4] a [3] + b [4]
▪ Z [1] = W [1] A [0] + b [1]
▪ a [4] = g [4](z [4]) = Explicit for loop
▪ A [1] = g [1](Z [1])
▪ More generically Z [1] = [ Z [1] (i), Z [1] (2), …., ]▪ Z
Z [1] (m) [2] = W [2] A [1] + b [2]
required to iterate
over layers
▪ z [l] = w [l] a [l-1] + b [l] ▪ A [2] = g [2](Z [2])
▪ a [l] = g [l](z [l]) ▪ = g [4](Z [4]) = A [4]
Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 5
▪ z:(3,1) x:(2,1)
n[1] = 3
▪ (n[1],1) (n [0],1)
n[2] = 5
▪ w [1] = (3,2) or (n[1] ,n [0]) n[3] = 4
▪ Conversely n[4] = 2
n[5] = n[L] = 1
▪ w [2] = (5,3) or (n[2] ,n [1])
n[0] = nx = 2
▪ w [3] = (4,5) or (n[3] ,n [2])
3
12/19/2023
[ ]
▪ Z [1] = Z [1] (i), Z [1] (2), Z [1] (3), …., Z [1] (m)
4
12/19/2023
Layer l
Building Blocks of DNN w [l] b[l]
a [l-1] a [l]
Cache z[l]
w [l] b[l]
da [l-1] da [l]
10
5
12/19/2023
11
Backward Propagation
▪ Input: da[l]
▪ Output: da[l-1], dW [l], db[l]
Vectorized:
▪ dz [l] = da [l] * g[l]’(z[l]) dZ [l] = dA [l] * g[l]’(Z[l])
▪ dW [l] = dz [l] . a [l-1]T dW [l] = 1/m (dZ [l] . A [l-1]T )
db[l] = 1/m np.sum(dZ [l], axis = 1, keepdims = True)
▪ db[l] = dz [l] dA[l] = W [l] T . dZ[l]
▪ da[l] = W [l] T . dz[l]
▪ dz [l] = W [l+1] T dz [l+1] * g[l]’(z[l])
12
6
12/19/2023
Summary
X
ReLU ReLU Sigmoid
da[3]
= -(y/a) + (1-y)/(1-a)
dw[1] dw[2] dw[3]
db[1] db[2] db[3]
13
14
7
12/19/2023
Empirical Learning
Applied Deep Learning is an Empirical
process
15
16
8
12/19/2023
17
18
9
12/19/2023
19
y=1 y=0
Bias and Variance
Human Error ≈ 0%
Optimal (Bayes) Error ≈ 0%
20
10
12/19/2023
21
No
No, Done
22
11
12/19/2023
Regularization
23
Logistic Regression
min𝐽 𝑤, 𝑏 𝜔 ∈ ℝ𝑛𝑥 , 𝑏 ∈ ℝ
𝑤𝑏
𝑚
1 𝜆 2
𝐽 w, 𝑏 = ℒ 𝑦ො 𝑖 , 𝑦 𝑖 + 𝑤 2
𝑚 2𝑚
𝑖=1
𝑛𝑥
2
𝑤 2 = 𝑤𝑗2 = 𝑤 𝑇 𝑤
L2 Regularization j=1
𝑛𝑥
𝜆
𝑤
𝑚
L1 Regularization j=1
24
12
12/19/2023
Neural Network 𝑚 𝐿
1 𝜆 2
𝐽 W, 𝑏 = ℒ 𝑦ො 𝑖 , 𝑦 𝑖 + 𝑊 [𝑙]
𝑚 2𝑚 2
𝑖=1 𝑙=1
𝑛𝑙
𝑛 𝑙+1
2 2
𝑤 𝑙
= 𝑤𝑖𝑗
𝑙
𝑤: 𝑛 𝑙 , 𝑛 𝑙+1
𝑖=1
j=1
2 2
Frobenius Normal . 2 . 𝐹
𝜆 𝑙
dW [l] = (from backpropogation) + 𝑚 𝑊
W [l] := W [l] - α dW [l]
25
Weight Decay
26
13
12/19/2023
27
𝑚 𝐿
1 𝜆 2
𝐽 W, 𝑏 = ℒ 𝑦ො 𝑖 , 𝑦 𝑖
+ 𝑊 [𝑙]
𝑚 2𝑚 F
𝑖=1 𝑙=1
28
14
12/19/2023
Dropout Regularization
29
Implementing Dropout
Illustrate with l = 3, keep_prob = 0.8
d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob
# d3 will be a binary matrix based on random values < keep_prob
a3 = np.multiply(a3, d3) # a3 *= d3
a3/=keep_prob
50 units → 10 units shutoff
Reduced by
z [4] = w [4] a [3] + b [4] 20%
30
15
12/19/2023
31
3
3
2
1
1.0
1.0
0.7
1.0
Keep-prob
0.7 0.5
32
16
12/19/2023
33
Data Augmentation
34
17
12/19/2023
Orthogonalization:
Early stopping • Optimize cost Generation J
• Gradient, …
• Not overfit
• Regularization, ….
35
Vanishing/exploding Gradients
36
18
12/19/2023
1
tanh
𝑛 l−1
Xavier Initialization
𝑧 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ ⋅ 𝑤𝑛 𝑥𝑛 + 𝑏
2
Large n → Smaller wi
Var(w) = 1/n 𝑛 𝑙+1 + 𝑛 𝑙
W [l] = np.random.rand(shape)*np.sqrt(1/n[l-1])
37
Improving Training
Speed
38
19
12/19/2023
39
40
20
12/19/2023
Optimization Algorithms
41
Y{1} Y{2}
However for big data analytics m could be in the range
Batch t:
m = 5,000,000
X{t}, Y{t}
So we form mini batch of 1,000 each
Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 42
42
21
12/19/2023
𝜆
σ𝐿
2
Computer Cost 𝐽{t} = 𝑖=1 ℒ 𝑦ො 𝑖 , 𝑦 𝑖 + 𝑊 [𝑙]
1000 2.1000 𝑙=1 F
Backprop to compute gradients w.r.t J{t} (using (X {t} , Y {t}))
W [l] := W [l] – αdW [l], b[l] := b[l] – αdb[l]
}
Epoch
}
43
44
22
12/19/2023
45
46
23
12/19/2023
V0 = 0
V1 = 0.9 V0 + 0.1 θ1
V2 = 0.9 V1 + 0.1 θ2
V3 = 0.9 V2 + 0.1 θ3
.
.
.
Vt = 0.9 Vt-1 + 0.1 θt
Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 47
47
48
24
12/19/2023
49
Vθ = 0
Repeat {
Get next θt
Vθ := β V θ + (1- β) θt
}
50
25
12/19/2023
Slower
Learning
Faster
Momentum: Learning
On iteration t:
Compute dW, db on current mini-batch
VdW = βVdW + (1-β)dW Vθ := β V θ + (1- β) θt
Vdb = βVdb + (1-β)db
W := W – α VdW , b:= b - αVdb
51
Implementation Details
52
26
12/19/2023
53
1
α= α
1 + 𝑑𝑒𝑐𝑎𝑦 − 𝑟𝑎𝑡𝑒 ∗ 𝑒𝑝𝑜𝑐ℎ# 𝑜
αo = 0.2, decay-rate = 1
Epoch α
1 0.1
2 0.067
3 0.05
4 0.04
…. …
54
27
12/19/2023
55
W1
W2
56
28
12/19/2023
57
29