CV w6 - Deep Learning
CV w6 - Deep Learning
2 Hu a we i Confidential
Objectives
3 Hu a we i Confidential
Conten ts
1 . D e e p Learning S u m m a r y
2. Training Rules
3. Activation Function
4. Normalizer
5. Optimizer
6. Types o f N e u r a l N e t w o r k s
7 . C o m m o n Problems
4 Hu a we i Confidential
Tradi ti onal M a c h i n e Learning a n d D eep Learning
l As a m o d e l based o n unsupervised fe a tu r e l e a r n i n g a n d fe a tu r e hierarchy learning, deep
l e a r n i n g has g r e a t advantages in fields such as c o m p u t e r vision, speech recognition, a n d
n a t u r a l l a n g u a g e processing.
5 Hu a we i Confidential
Tradi ti onal M a c h i n e Learning
Issue analysis
P r o b l e m loc at ing
Model
t r ai n i n g
6 Hu a we i Confidential
D eep Learning
l Generally, t he deep lear n in g architecture is a deep neu r a l net w o rk . " D eep " in
" d e e p l e a r n i n g " refers t o t h e n u m b e r o f layers o f t h e n e u r a l n e t w o r k .
Nucleus
Hidden
layer
A xo n O u t p u t layer
I n p u t layer H id d e n layer I n p u t layer
7 Hu a we i Confidential
Neural Network
l Currently, t h e d e f i n i t i o n o f t h e n e u ra l n e t w o r k has n o t been d e t e r m i n e d yet. H e c h t Nielsen, a
n e u ra l n e t w o r k researcher in t h e U.S., defines a n e u ra l n e t w o r k as a c o m p u t e r system co mp o se d
o f simple a n d h ig h ly interconnected processing elements, w h i c h process i n f o r m a t i o n by d y n a m i c
response t o external inputs.
8 Hu a we i Confidential
D e v e l o p m e n t History o f N e u r a l N e t w o r k s
Deep
SV M
XOR n e tw o r k
Perceptron M LP
Golden age AI w i n t e r
9 Hu a we i Confidential
Single-Layer Perceptron
l I n p u t vector: 𝑋 = [𝑥0 , 𝑥1, … , 𝑥 𝑛 ] 𝑇 𝑥1
l Weight: 𝑊 = [𝜔0 , 𝜔1 , … , 𝜔 𝑛 ] 𝑇 , in w h i c h 𝜔0 is t h e offset. 𝑥2
𝑥𝑛
1, 𝑛𝑒𝑡 > 0, 𝑛
l Ac tiv ation func tion: 𝑂 = 𝑠𝑖𝑔𝑛 𝑛𝑒𝑡 =
−1, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒. 𝑛𝑒𝑡 = 𝜔𝑖𝑥 𝑖 = 𝑾 𝐗 𝑻
𝑖=0
l The preceding perc eptron is equiv alent t o a classifier. It uses t h e h i g h - d i m e n s i o n a l 𝑋 vector as t h e i n p u t a n d
p e r f o r m s binar y classification o n i n p u t samples in t h e h i g h - d i m e n s i o n a l space. W h e n 𝑾𝑻𝐗 > 0, O = 1. In this
case, t h e samples are classified i n t o a type. Otherwise, O = −1. In this case, t h e samples are classified i n t o t h e
o t h e r type. The b o u n d a r y o f these t w o types is 𝑾𝑻𝐗 = 0, w h i c h is a h i g h- d i me n s i o n a l hyperplane.
AND OR XOR
11 Hu a we i Confidential
Feedforward Neural Network
I n p u t layer O u t p u t layer
H i d d e n layer 1 H i d d e n layer 2
12 Hu a we i Confidential
Solution o f XOR
w0
w1
w2
w3
w4
XOR w5
XOR
13 Hu a we i Confidential
Imp a cts o f H i d d e n Layers o n A N e u r a l N e t w o r k
14 Hu a we i Confidential
Conten ts
1. Deep Learning S u m m a r y
2. Training Rules
3. Activation Function
4. Normalizer
5. Optimizer
6. Types o f N e u r a l N e t w o r k s
7 . C o m m o n Problems
15 Hu a we i Confidential
Gr a d i e n t Descent a n d Loss Function
l The g ra d ien t o f th e m u lt ivaria t e f u n ction 𝑜 = 𝑓 𝑥 = 𝑓 𝑥0 , 𝑥1 , … , 𝑥𝑛 at 𝑋 ′ = [𝑥 0 ′ , 𝑥 1 ′ , … , 𝑥 𝑛 ′ ] 𝑇 is s h o w n as
follows:
𝑇
𝛻𝑓 𝑥 0 ′ , 𝑥 1 ′ , … ,𝑥𝑛 ′ = [ 𝜕𝑓 , 𝜕𝑓 , … , 𝜕𝑓 ] |𝑋 =𝑋 ′ ,
𝜕𝑥0 𝜕𝑥1
𝜕
The direction o f t h e g r a d i e n t vector is t h e fastest g r o w i n g direction o f t h e func tion. As a result, t h e direction
o f t h e negative g r a d i e n t vector −𝛻𝑓 is t h e fastest descent direction o f t h e func tion.
l D u r i n g t h e t r a i ni ng o f t h e deep lear ning n e t w o r k , t a r g e t classification errors m u s t be parameterized. A loss
function ( e r r o r f u n c t i o n ) is used, w h i c h reflects t h e error b e t w e e n t h e t a r g e t o u t p u t a n d ac tual o u t p u t o f
t h e perceptron. For a single t r a i ni ng s ample x, t h e m o s t c o m m o n error f u n c t i o n is t h e Q u ad r at i c cost
function.
𝐸 𝑤 =1 𝑑 ∈𝐷 𝑡𝑑 − 𝑜𝑑 2,
2
surface o f 𝐸 𝑊 = 12 𝑑 ∈𝐷 𝑡𝑑 − 𝑜𝑑 2.
17 Hu a we i Confidential
C o m m o n Loss Functions i n D eep Learning
l Q u a d r a ti c cost func tion:
1
2
𝐸 𝑊 = 𝑡 𝑑 − 𝑜𝑑
2
𝑑 ∈𝐷
1
𝐸 𝑊 = − [𝑡𝑑 ln 𝑜𝑑 + (1 − 𝑡 𝑑 ) ln( 1 − 𝑜𝑑 )]
𝑛
𝑥 𝑑 ∈𝐷
18 Hu a we i Confidential
Batch Gr a d i e n t Descent A l g o r i t h m ( B G D )
l In t h e t r a i n i n g sa mp le set 𝐷, each sa mp le is recorded as < 𝑋, 𝑡 >, in w h i c h 𝑋 is t h e i n p u t vector, 𝑡
t h e t a r g e t o u t p u t , 𝑜 t h e actual o u t p u t , a n d 𝜂 t h e le a rn in g rate.
p Initializes each 𝑤𝑖 t o a r a n d o m value w i t h a smaller absolute value.
p Before t h e e n d c o n d it io n is m e t :
n Initializes each ∆𝑤𝑖 t o zero.
n For each < 𝑋, 𝑡 > in D:
− I n p u t 𝑋 t o this u n i t a n d calculate t h e o u t p u t 𝑜.
1 𝜕C(𝑡𝑑 ,𝑜𝑑 )
− 𝑖 -η
For each 𝑤𝑖in this unit: ∆𝑤 += x 𝑑 ∈𝐷 .
𝑛 𝜕𝑤
𝑖
19 Hu a we i Confidential
Stochastic Gr a d i e n t Descent A l g o r i t h m (SGD)
l To address t h e BGD a l g o r i t h m defect, a c o m m o n variant called I n cre me n t a l G ra d ie n t Descent
a l g o r i t h m is used, w h i c h is also called t h e Stochastic G ra d ie n t Descent (SGD) a l g o r i t h m . O n e
i m p l e m e n t a t i o n is called O n l i n e Learning, w h i c h updates t h e g r a d i e n t based o n each sample:
𝜕C(𝑡 𝑑,𝑜 𝑑)
n For each 𝑤𝑖 in this unit: 𝑤𝑖 += - η 𝑑∈𝐷 𝜕 𝑖 .
20 Hu a we i Confidential
M i n i - B a t c h Gr a d i e n t Descent A l g o r i t h m ( M B G D )
l To address t h e defects o f t h e previous t w o g r a d i e n t descent algorithms, t h e M i n i - b a t c h Gra d ie n t
Descent A l g o r i t h m ( M B G D ) wa s proposed a n d has been m o s t w i d e l y used. A sma ll n u m b e r o f
Batch Size (BS) samples are used a t a t i m e t o calculate ∆𝑤𝑖, a n d t h e n t h e w e i g h t is u p d a t e d
accordingly.
l Batch-gradient-descent
p Initializes each 𝑤𝑖 t o a r a n d o m value w i t h a smaller absolute value.
p Before t h e end c ondition is me t :
n Initializes each ∆𝑤𝑖 t o zero.
n For each < 𝑋, 𝑡 > in t h e BS samples in t h e nex t b a t c h in 𝐷:
− I n p u t 𝑋 t o this u n i t a n d calculate t h e o u t p u t 𝑜.
1
− For each 𝑤 in
𝑖 this unit: ∆𝑤 +=
𝑖 -η 𝑛 x 𝑑∈𝐷 𝜕C(𝑡 𝑑 ,𝑜
)
𝜕𝑤 𝑖
n For each 𝑤𝑖 in this unit: 𝑤 𝑖 += ∆𝑤𝑖
n For t h e last batch, t h e t r a i n i n g samples are m i x e d u p in a r a n d o m order.
21 Hu a we i Confidential
Backpropagation Algorithm (1)
l Signals are p r o p a g a t e d in f o r w a r d direction,
and errors are propagated in backward F o r w a r d p r o p a g a t i o n direction
direction.
B a c k p r o p a g a t i o n direction
22 Hu a we i Confidential
Backpropagation Algorithm (2)
l Accordi n g to the f o llo w i n g f o rm u las, errors in the input, hidden, a n d o u t p u t layers a re
a c c u m u l a t e d t o generate t h e error in t h e loss function.
2
1 1
E = å (dÎD ) [td - f (netd ) ] = å (dÎD) étd - f (å (cÎC) wc y c)ù
2
p Expanded hidden
2 2 ë û
layer error:
( )
2
1
E = å (dÎD ) êtd - f å (cÎC ) wc f (net c)
é ù
=
p Expanded i n p u t 2 ë ú
û
( ( ))
layer error: 1 ù
2
å (dÎD) éêëtd - f å (cÎC ) wc f åbÎB wb xb úû
2
23 Hu a we i Confidential
Backpropagation Algorithm (3)
l To m i n i m i z e error E, t h e g r a d i e n t descent iterative calculation can be used t o
solve w c a n d w b , t h a t is, calculating w c a n d w b t o m i n i m i z e error E.
l Formula:
¶E
Dwc = -h ,c ÎC
¶wc
Dwb = -h ¶E ,b Î B
¶wb
l If there are m u l t i p l e h i d d e n layers, chain rules are used t o t a k e a derivative f o r
each layer t o o b t a i n t h e o p t i m i z e d parameters by iteration.
24 Hu a we i Confidential
Backpropagation Algorithm (4)
l For a n e u ra l n e t w o r k w i t h a n y n u m b e r o f layers, t h e a rra n g e d f o r m u l a f o r t r a i n i n g is as follows:
Dwl = -hdl+1 f (z l )
jk k j j
'
(z
ìfj j j
l
)(t - f (z l
)), l Î outputs, (1)
dl j = ïí
j j
ïî å
d k
l +1 l
w f
jk j
'
(z l
j ),otherwise,(2)
k
25 Hu a we i Confidential
Conten ts
1. Deep Learning S u m m a r y
2. Training Rules
3. Activation Function
4. Normalizer
5. Optimizer
6. Types o f N e u r a l N e t w o r k s
7 . C o m m o n Problems
26 Hu a we i Confidential
Acti va ti o n Function
l Activation functions are i m p o r t a n t f o r t h e n e u r a l n e t w o r k m o d e l t o learn a n d
u n d e r s t a n d c o m p l e x n o n - l i n e a r functions. They a l l o w i n t r o d u c t i o n o f n o n - l i n e a r
features t o t h e n e t w o r k .
output = f (w1x1 + w2 x2 + w3 x3 ) = f (W • X ) t
27 Hu a we i Confidential
Si g m o i d
1
𝑓 𝑥 =
1 + 𝑒−𝑥
28 Hu a we i Confidential
Tanh
𝑒 𝑥 − 𝑒−𝑥
tanh 𝑥 = 𝑒 𝑥 + 𝑒−𝑥
29 Hu a we i Confidential
Softsign
𝑥
𝑓 𝑥 =
𝑥 +1
30 Hu a we i Confidential
Rectified Linear U n i t (ReLU)
𝑥, 𝑥 ≥0
𝑦=
0, 𝑥 <0
31 Hu a we i Confidential
Softplus
𝑓 𝑥 = ln 𝑒 𝑥 + 1
32 Hu a we i Confidential
Softmax
l S o f t m a x function:
𝑒
σ(z)𝑗 =
𝑘
𝑒
33 Hu a we i Confidential
Conten ts
1. Deep Learning S u m m a r y
2. Training Rules
3. Activation Function
4. Normalizer
5. Optimizer
6. Types o f N e u r a l N e t w o r k s
7 . C o m m o n Problems
34 Hu a we i Confidential
Normalizer
l Regularization is a n i m p o r t a n t a n d effective t e c h n o l o g y t o reduce generalization
errors in m a c h i n e learning. It is especially useful f o r deep learning m o d e l s t h a t
t e n d t o be over-fi t d u e t o a large n u m b e r o f parameters. Therefore, researchers
have proposed m a n y effective technologies t o prevent over-fitting, including:
p A d d i n g constraints t o parameters, such as 𝐿1 a n d 𝐿2 n o r m s
p Expanding t h e t r a i n i n g set, such as a d d i n g noise a n d t r a n s f o r m i n g d a t a
p Dropout
p Early s to p p i n g
35 Hu a we i Confidential
Penalty Parameters
l M a n y regularization m e t h o d s restrict t h e learning capability o f m o d e l s by
a d d i n g a penalty p a r a m e t e r Ω(𝜃) t o t h e objective f u n c t i o n 𝐽. Assume t h a t t h e
t a r g e t f u n c t i o n a f t e r regularization is 𝐽.
𝐽 𝜃; 𝑋, 𝑦 = 𝐽 𝜃; 𝑋, 𝑦 + 𝛼Ω(𝜃),
l W h e r e 𝛼𝜖[0, ∞) is a h y p e r p a r a m e t e r t h a t w e i g h t s t h e relative c o n t r i b u t i o n o f
t h e n o r m penalty t e r m Ω a n d t h e standard objective f u n c t i o n 𝐽(𝑋; 𝜃). If 𝛼 is set
t o 0, n o regularization is p e r f o r m e d . The p e n a l t y in regularization increases w i t h
𝛼.
36 Hu a we i Confidential
𝐿1 Regu l a ri z a ti o n
l A d d 𝐿1 n o r m constraint t o m o d e l parameters, t h a t is,
𝐽 𝑤; 𝑋, 𝑦 = 𝐽 𝑤; 𝑋, 𝑦 + 𝛼 𝑤 1 ,
l If a g r a d i e n t m e t h o d is used t o resolve t h e value, t h e p a r a m e t e r g r a d i e n t is
𝛻𝐽 𝑤 =∝ 𝑠𝑖𝑔𝑛 𝑤 + 𝛻𝐽 𝑤 .
37 Hu a we i Confidential
𝐿2 Regu l a ri z a ti o n
l A d d n o r m penalty t e r m 𝐿2 t o prevent overfitting.
1
𝐽 𝑤; 𝑋,𝑦 = 𝐽 𝑤; 𝑋, 𝑦 + 𝛼 𝑤 2,
2
2
𝑤 = 1 − 𝜀𝛼 𝜔 − 𝜀𝛻𝐽(𝑤),
l w h e r e 𝜀 is t h e learning rate. C o m p a r e d w i t h a c o m m o n g r a d i e n t o p t i m i z a t i o n
f o r m u l a , this f o r m u l a multiplies t h e p a r a m e t e r by a reduction factor.
38 Hu a we i Confidential
𝐿1 v.s. 𝐿2
l The m a j o r differences b e t w e e n 𝐿2 a n d 𝐿1 :
p Accor ding t o t h e preceding analysis, 𝐿1 can generat e a m o r e sparse m o d e l t h a n 𝐿2. W h e n t h e value o f p a r a m e t e r 𝑤 is
small, 𝐿1 r egular iz at ion can directly reduce t h e p a r a m e t e r value t o 0, w h i c h can be used f o r f e a t u r e selection.
p F r o m t h e perspective o f probability, m a n y n o r m constraints are equiv alent t o a d d in g p r io r p r o b a b ilit y dis t r ibut ion t o
parameters. In 𝐿2 regularization, t h e p a r a m e t e r value complies w i t h t h e Gaussian dis t r ibut ion rule. In 𝐿1 regularization,
t h e p a r a m e t e r value complies w i t h t h e Laplace dis t r ibut ion rule.
𝐿1 𝐿2
39 Hu a we i Confidential
Dataset Expansion
l The m o s t effective w a y t o p re ve n t o v e r - f i t t i n g is t o a d d a t r a i n i n g set. A larger t r a i n i n g set has a
smaller o v e r - f i t t i n g probability. Dataset expansion is a t ime -sa vin g m e t h o d , b u t it varies in
d i ff e r e n t fields.
p A c o m m o n m e t h o d in t h e object rec ognition field is t o r o t a t e o r scale images. (The prerequisite t o i m a g e
t r a n s f o r m a t i o n is t h a t t h e ty pe o f t h e i m a g e c a n n o t be c hanged t h r o u g h t r a n s f or ma t i o n . For example, f o r
h a n d w r i t i n g digit recognition, categories 6 a n d 9 can be easily c hanged a f t e r r otation) .
40 Hu a we i Confidential
Dropout
l D r o p o u t is a c o m m o n a n d simple r egular iz ation m e t h o d , w h i c h has been w i del y used since 2014. Simply put,
D r o p o u t r a n d o m l y discards s o m e inputs d u r i n g t h e t r a i ni ng process. In this case, t h e par ameter s
corresponding t o t h e discarded inputs are n o t updated. As a n i n t e g r a t i o n m e t h o d , D r o p o u t c ombines all sub-
n e t w o r k results a n d obtains s ub- netw or k s by r a n d o m l y d r o p p i n g inputs. See t h e figures below:
41 Hu a we i Confidential
Early Stoppi ng
l A test o n d a t a o f t h e validation set can be inserted d u r i n g t h e training. W h e n
t h e d a t a loss o f t h e verification set increases, p e r f o r m early stopping.
Early stopping
42 Hu a we i Confidential
Conten ts
1. Deep Learning S u m m a r y
2. Training Rules
3. Activation Function
4. Normalizer
5. Optimizer
6. Types o f N e u r a l N e t w o r k s
7 . C o m m o n Problems
43 Hu a we i Confidential
Optimizer
l There are various o p t i m i z e d versions o f g r a d i e n t descent algorithms. In object-
oriented l a n g u a g e i m p l e m e n t a t i o n , d i ff e r e n t g r a d i e n t descent a l g o r i t h m s are
o f t e n encapsulated i n t o objects called optimizers.
44 Hu a we i Confidential
M o m e n t u m Optimizer
l A m o s t basic i m p r o v e m e n t is t o a d d m o m e n t u m t e r m s f o r ∆𝑤𝑗𝑖. Assume t h a t t h e w e i g h t correction o f t h e 𝑛- t h i t e r a t i o n is
∆𝑤𝑗𝑖(𝑛) . The w e i g h t correction r ule is:
−𝜂𝛿𝑙+1 𝑥 𝑙 (𝑛)
𝑖 𝑗
45 Hu a we i Confidential
Advantages a n d Disadvantages o f M o m e n t u m O p t i m i z e r
l A d van t a g es:
p Enhances t h e stability o f t h e g r a d i e n t correction direction a n d reduces mutations .
p In areas w h e r e t h e g r a d i e n t direction is stable, t h e ball rolls faster a n d faster ( t h e r e is a speed u p p e r l i m i t
because 𝛼 < 1), w h i c h helps t h e ball quickly overshoot t h e f l a t area a n d accelerates convergence.
p A s mall ball w i t h inertia is m o r e likely t o r oll over s o m e n a r r o w local extrema.
l Disadvantages:
p The learning r ate 𝜂 a n d m o m e n t u m 𝛼 need t o be m a n u a l l y set, w h i c h o f t e n requires m o r e experiments t o
d e t e r m i n e t h e appr opr iate value.
46 Hu a we i Confidential
AdaGrad Optimizer (1)
l The com m o n f eat u r e o f th e ra n d o m g ra d ien t d escen t a lg o rit h m (SGD), sm all- b a tc h g ra d ien t d escen t
a l g o r i t h m ( M B G D ) , a n d m o m e n t u m o p t i mi z e r is t h a t each p a r a m e t e r is u p d a t e d w i t h t h e s ame LR.
l According t o t h e appr oac h o f AdaGrad, differ ent learning rates need t o be set f o r differ ent parameters.
gt = ¶C(t, o)
¶wt Gr a d ie n t calculation
r = r + g2 Square g r a d i e n t a c c u m u l a t i o n
t t-1 t
h
Dwt = - gt Computing update
e+ rt
A pp lic a tio n u p d a t e
wt +1 =wt +Dwt
l 𝑔𝑡 indicates t h e t - t h gradient, 𝑟 is a g r a d i e n t a c c u m u l a t i o n variable, a n d t h e initial value o f 𝑟 is 0, w h i c h
47 Hu a we i Confidential
AdaGrad Optimizer (2)
l The A d a G r a d o p t i m i z a t i o n a l g o r i t h m shows t h a t t h e 𝑟 continues increasing w h i l e t h e
overall lear ning r ate keeps decreasing as t h e a l g o r i t h m iterates. This is because w e h o p e
LR t o decrease as t h e n u m b e r o f updates increases. In t h e initial l e a r n i n g phase, w e are
fa r a w a y f r o m t h e o p t i m a l s o lu tio n t o t h e loss func tion. As t h e n u m b e r o f updates
increases, w e are closer t o t h e o p t i m a l solution, a n d th er e fo r e LR can decrease.
l Pros:
p The le a rn in g ra t e is a u t o m a t i c a l l y updated. As t h e n u m b e r o f updates increases, t h e le a rn in g
ra te decreases.
l Cons:
p The d e n o m i n a t o r keeps a c c u m u l a t i n g so t h a t t h e le a rn in g ra t e w i l l eventually b e c o m e very
small, a n d t h e a l g o r i t h m w i l l b e c o m e ineffective.
48 Hu a we i Confidential
RMSProp O p t i m i z e r
l The RMSProp o p t i m i z e r is a n i m p r o v e d AdaG r ad optimiz er. It introduces a n a t t e n u a t i o n coefficient t o ensure a
certain a t t e n u a t i o n r a t i o f o r 𝑟 in each round.
Dwt = -
h gt Computing update
e+ rt
wt +1 = wt + Dwt A p p lica t io n u p d a t e
l 𝑔𝑡 indicates t h e t - t h gradient, 𝑟 is a g r a d i e n t a c c u m u l a t i o n variable, a n d t h e initial value o f 𝑟 is 0, w h i c h m a y
n o t increase a n d needs t o b e adjusted using a p a r a m e t e r . 𝛽 is t h e a t t e n u a t i o n factor, 𝜂 indicates t h e g l o b a l
LR, w h i c h needs t o be set manually. 𝜀 is a s mall constant, a n d is set t o a b o u t 10 - 7 f o r n u m e r i c a l stability.
49 Hu a we i Confidential
Adam Optimizer (1)
l Adaptive M o m e n t Estimation ( A d a m ) : Developed based o n AdaGrad a n d
AdaDelta, A d a m m a i n t a i n s t w o a d d i t i o n a l variables 𝑚𝑡 a n d 𝑣𝑡 f o r each variable
t o be trained:
𝑚𝑡 = 𝛽1𝑚𝑡−1 + (1 − 𝛽1)𝑔𝑡
𝑣𝑡 = 𝛽2𝑣𝑡−1 + (1 − 𝛽2)𝑔2𝑡
50 Hu a we i Confidential
Adam Optimizer (2)
l If 𝑚𝑡 and 𝑣𝑡 are initialized using t h e zero vector, 𝑚𝑡 and 𝑣𝑡 are close t o 0 d u r i n g t h e initial
iterations, especially w h e n 𝛽1 a n d 𝛽2 are close t o 1. To solve this p r o b l e m , w e use 𝑚𝑡 a n d 𝑣𝑡 :
𝑚𝑡
𝑚𝑡 =
1 − 𝛽1𝑡
𝑣𝑡
𝑣𝑡 =
1 − 𝛽2𝑡
l The w e i g h t u p d a t e ru le o f A d a m is as follows:
𝜂
𝑤𝑡+1 = 𝑤𝑡 − 𝑚𝑡
𝑣𝑡 +𝜖
Comparison o f o p t i m i z a t i o n Comparison o f o p t i m i z a t i o n
a l g o r i t h m s in c o n t o u r m a p s o f a l g o r i t h m s a t t h e saddle p o i n t
loss functions
52 Hu a we i Confidential
Conten ts
1. Deep Learning S u m m a r y
2. Training Rules
3. Activation Function
4. Normalizer
5. Optimizer
6. Types o f N e u r a l N e t w o r k s
7 . C o m m o n Problems
53 Hu a we i Confidential
Convolutional Neural N e tw o r k
l A c o n v o l u ti o n a l n e u r a l n e t w o r k ( C N N ) is a f e e d f o r w a r d n e u r a l n e t w o r k . Its artificial
neurons m a y respond t o surrounding units w i t h i n t h e coverage range. C N N excels a t
i m a g e processing. It includes a convolutional layer, a pooling layer, a n d a fully
connected layer.
54 Hu a we i Confidential
M a i n Concepts o f C N N
l Local receptive field: It is generally considered t h a t h u m a n perception o f t h e outside
w o r l d is f r o m local t o global. Spatial correlations a m o n g local pixels o f a n i m a g e a r e
closer t h a n those a m o n g distant pixels. Therefore, each n e u r o n does n o t need t o
k n o w t h e g l o b a l image. It o n l y needs t o k n o w t h e local image. The local i n f o r m a t i o n is
c o m b i n e d a t a h i g h e r level t o generate g l o b a l i n f o r m a t i o n .
55 Hu a we i Confidential
Architecture o f C o n v o l u t i o n a l N e u r a l N e t w o r k
Input Th ree- fe a tu re Th ree- fe a tu re Five- fe a tu re Five- fe a tu re Output
im age image image image image layer
Bird Pb ird
Dog Pd o g
Cat Pca t
Vectorization
C o n vo lu t io n + nonlinearity M a x p o o lin g
56 Hu a we i Confidential
Single-Filter Calculation ( 1 )
l Description o f c o n v o l u t i o n calculation
57 Hu a we i Confidential
Single-Filter Calculation ( 2 )
l D e m o n s t r a t i o n o f t h e c o n v o l u t i o n calculation
58 Hu a we i Confidential
C o n v o l u t i o n a l Layer
l The basic architecture o f a C N N is m u l t i - c h a n n e l c o n v o l u t i o n consisting o f m u l t i p l e single convolutions. The
o u t p u t o f t h e previous layer ( o r t h e original i m a g e o f t h e first layer) is used as t h e i n p u t o f t h e c ur r ent layer.
It is t h e n convolved w i t h t h e filter in t h e layer a n d serves as t h e o u t p u t o f this layer. The c o n v o l u t i o n kernel o f
each layer is t h e w e i g h t t o be learned. Similar t o FCN, a f t e r t h e c o n v o l u t i o n is c omplete, t h e result s hould be
biased a n d activated t h r o u g h activation func tions before being i n p u t t o t h e nex t layer.
Wn bn
Fn
Input Output
t enso r tensor
F1
W2 b2 A ctivat e
Output
W1 b1
Con vo lu t io n al Bias
kernel
59 Hu a we i Confidential
Pooling Layer
l Pooling c ombines nearby units t o reduce t h e size o f t h e i n p u t o n t h e nex t layer, r educ ing dimensions.
C o m m o n p o o l i n g includes m a x p o o l i n g a n d average pooling. W h e n m a x p o o l i n g is used, t h e m a x i m u m value
in a s mall square area is selected as t h e representative o f this area, w h i l e t h e m e a n value is selected as t h e
representative w h e n average p o o l i n g is used. The side o f this s mall area is t h e p o o l w i n d o w size. The
f o l l o w i n g figur e shows t h e m a x p o o l i n g o p e r a t i o n w hos e p o o l i n g w i n d o w size is 2.
Sliding direction
60 Hu a we i Confidential
Fully Connected Layer
l The ful l y connected layer is essentially a classifier. The features extracted o n t h e
c o n v o l u t i o n a l layer a n d p o o l i n g layer are straightened a n d placed a t t h e ful l y
connected layer t o o u t p u t a n d classify results.
𝑒𝑧𝑗
σ(z)𝑗 =
𝑘
𝑒
61 Hu a we i Confidential
Recurrent N e u r a l N e t w o r k
l The rec urrent n e u r a l n e t w o r k ( R N N ) is a n e u r a l n e t w o r k t h a t captures d y n a m i c
i n f o r m a t i o n in sequential d a t a t h r o u g h periodical connections o f h i d d e n layer nodes. It
can classify sequential data.
62 Hu a we i Confidential
Recurrent N e u r a l N e t w o r k Architecture ( 1 )
l 𝑋𝑡 is t h e i n p u t o f t h e i n p u t sequence a t t i m e t.
l 𝑆𝑡 is th e m e m o ry u n i t o f the sequence a t t i m e t a n d caches
previous i n f o r m a t i o n .
𝑂𝑡 = 𝑡𝑎𝑛ℎ 𝑉𝑆𝑡
l 𝑂𝑡 a ft e r t h r o u g h m u l t i p le hidden la yers, i t can g e t the f inal
o u t p u t o f t h e sequence a t t i m e t.
63 Hu a we i Confidential
Recurrent N e u r a l N e t w o r k Architecture ( 2 )
64 Hu a we i Confidential
Types o f Recurrent N e u r a l N e t w o r k s
65 Hu a we i Confidential
B a c k p r o p a g a t i o n T h r o u g h T i m e (BPTT)
l BPTT:
p Tradit ional b a c k p r o p a g a t i o n is t h e extension o n t h e t i m e sequence.
p There are t w o sources o f errors in t h e sequence a t t i m e o f m e m o r y unit: first is f r o m t h e h id d e n layer o u t p u t error a t t
t i m e sequence; t h e second is t h e error f r o m t h e m e m o r y cell a t t h e nex t t i m e sequence t + 1.
p C o m p u t i n g t h e g r a d ie n t o f each w e ig h t .
66 Hu a we i Confidential
Recurrent N e u r a l N e t w o r k P r o b l e m
l 𝑆𝑡 = 𝜎 𝑈𝑋𝑡 + 𝑊𝑆𝑡−1 is extended o n t h e t i m e sequence.
69 Hu a we i Confidential
Generative Adversarial N e t w o r k ( G A N )
l Generative Adversarial N e t w o r k is a f r a m e w o r k t h a t trains g e n e r a t o r G a n d dis c riminator D t h r o u g h t h e
adversarial process. T h r o u g h t h e adversarial process, t h e dis c riminator can tell w h e t h e r t h e s ample f r o m t h e
gener ator is fak e o r real. G A N adopts a m a t u r e BP a l g o r i t h m .
70 Hu a we i Confidential
G A N Architecture
l Generator/Discriminator
71 Hu a we i Confidential
Generative M o d e l a n d Discriminative M o d e l
l Generative n e t w o r k l Discriminator n e t w o r k
p Generates s a m p l e d a t a p
D ete r mines w h e t h e r s a m p l e d a t a is real
n Input: Gaussian w h i t e noise vector z n Input: real sa mp le d a t a 𝑥 𝑟𝑒𝑎𝑙 a n d
n O u t p u t : sa mp le d a t a vector x generated sa mp le d a t a 𝑥 = 𝐺 𝑧
n O u t p u t : p ro b a b ilit y t h a t determines
w h e t h e r t h e sa mp le is real
x = G(z;qG ) y = D(x;qD )
𝑥𝑟𝑒𝑎𝑙
G
z x D y
x
72 Hu a we i Confidential
Training Rules o f G A N
l O p t i m i z a t i o n objective:
p Value f u n c t i o n
73 Hu a we i Confidential
Conten ts
1. Deep Learning S u m m a r y
2. Training Rules
3. Activation Function
4. Normalizer
5. Optimizer
6. Types o f N e u r a l N e t w o r k s
7 . C o m m o n Problems
74 Hu a we i Confidential
Data Imbalance (1)
l P r o b l e m description: In t h e dataset consisting o f various task categories, t h e n u m b e r o f
samples varies greatly f r o m o n e category t o another. O n e o r m o r e categories i n t h e
predicted categories c o n ta i n very f e w samples.
l Impacts:
p D u e t o t h e u n b a la n ce d n u m b e r o f samples, w e c a n n o t g e t t h e o p t i m a l r e a l - t i m e result
because m o d e l / a l g o r i t h m never examines categories w i t h very f e w samples adequately.
p Since f e w observation objects m a y n o t be representative f o r a class, w e m a y fail t o o b t a i n
a d e q u a t e samples f o r verification a n d test.
75 Hu a we i Confidential
Data Imbalance (2)
76 Hu a we i Confidential
Vanishing Gr a d i e n t a n d Exploding Gr a d i e n t P r o b l e m ( 1 )
l Vanishin g gradie n t: As net w o r k layers in crease, the d e rivative value of
b a c k p r o p a g a t i o n decreases, w h i c h causes a vanishing g r a d i e n t p r o b l e m .
w2 w3 w4
b1 b2 b3 C
𝜕C
rule, as layers increase, t h e derivation result 𝜕
decreases, resulting in t h e vanishing g r a d i e n t pr oblem.
l Solution: For example, g r a d i e n t clipping is used t o alleviate t h e ex ploding g r a d i e n t p r o b l e m, ReLU ac tiv ation
f u n c t i o n a n d LSTM are used t o alleviate t h e vanishing g r a d i e n t pr oblem.
79 Hu a we i Confidential
Overfitting
l P r o b l e m description: The m o d e l p e r f o r m s w e l l in t h e t r a i n i n g set, b u t badly in
t h e test set.
80 Hu a we i Confidential
Summary
82 Hu a we i Confidential
Quiz
A. True
B. False
B. False
83 Hu a we i Confidential
Quiz
3. (Single-choice) There are m a n y types o f deep lear ning neural networks. W h i c h o f t h e f o l l o w i n g is n o t a deep
lear ning neur al n e t w o r k ? ( )
A. CNN
B. RNN
C. LSTM
D. Logistic
A. A ctivation f u n c t io n
B. C o n vo lu t io n a l kernel
C. Pooling
84 Hu a we i Confidential
Recommendations
l O n l i n e learning website
p https://e.huawei.com/cn/talent/#/home
l H u a w e i K n o w l e d g e Base
p https://support.huawei.com/enterprise/servicecenter?lang=zh
85 Hu a we i Confidential
Thank you. 把数字世界带入每个人、每个家庭、
每个组织,构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.