0% found this document useful (0 votes)

39 views

CV w6 - Deep Learning

Uploaded by

Destu Wijayadiningrat

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views

CV w6 - Deep Learning

Uploaded by

Destu Wijayadiningrat

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 86

6.

Deep Learning Overview

COMPUTER VISION – TK44019
Materi kuliah untuk Chapter ini mengikuti modul HCIA - Artificial
Intelligence (hak cipta materi adalah milik Huawei).
D eep Learning O v e r v i e w
Foreword

l The chapter describes t h e basic k n o w l e d g e o f deep learning, including t h e

d e v e l o p m e n t history o f deep learning, c o m p o n e n t s a n d types o f deep
learning n e u r a l networks, a n d c o m m o n p r o b l e m s in deep learning projects.

2 Hu a we i Confidential
Objectives

O n c o m p l e t i o n o f this course, y o u w i l l be able to:

p Describe t h e d e f i n i t i o n a n d d e v e l o p m e n t o f n e u r a l networks.
p Learn a b o u t t h e i m p o r t a n t c o m p o n e n t s o f deep l e a r n i n g n e u r a l networks.
p U n d e r s t a n d t r a i n i n g a n d o p t i m i z a t i o n o f n e u r a l networks.
p Describe c o m m o n p r o b l e m s in deep learning.

3 Hu a we i Confidential
Conten ts

1 . D e e p Learning S u m m a r y

2. Training Rules

3. Activation Function

4. Normalizer

5. Optimizer
6. Types o f N e u r a l N e t w o r k s

7 . C o m m o n Problems

4 Hu a we i Confidential
Tradi ti onal M a c h i n e Learning a n d D eep Learning
l As a m o d e l based o n unsupervised fe a tu r e l e a r n i n g a n d fe a tu r e hierarchy learning, deep
l e a r n i n g has g r e a t advantages in fields such as c o m p u t e r vision, speech recognition, a n d
n a t u r a l l a n g u a g e processing.

Traditional M a c h i n e Learning D e e p Learning

H i g h e r h a r d w a r e r equir ements o n t h e c o m p u t e r : To
L o w h a r d w a r e r equir ements o n t h e c o m p u t e r : Given
execute m a t r i x operations o n massive data, t h e
t h e l i m i t e d c o m p u t i n g a m o u n t , t h e c o m p u t e r does
c o m p u t e r needs a GPU t o p e r f o r m parallel
n o t need a GPU f o r parallel c o m p u t i n g generally.
computing.
Applicable t o t r a i ni ng u n d e r a s mall d a t a a m o u n t The p e r f o r m a n c e can be h i g h w h e n h i g h -
a n d w hos e p e r f o r m a n c e c a n n o t be i m p r o v e d dimens ional w e i g h t parameters a n d massive
c ontinuous ly as t h e d a t a a m o u n t increases. t r a i ni ng d a t a are provided.
Level-by-level p r o b l e m b r e a k d o w n E2E lear ning
M a n u a l featur e selection A l g o r i t h m - b a s e d a u t o m a t i c featur e extraction

Easy-to-explain features H ar d - t o - e x p l a i n features

5 Hu a we i Confidential
Tradi ti onal M a c h i n e Learning

Issue analysis
P r o b l e m loc at ing

Data Feature Feature

cleansing ext ract io n select i o n

Model
t r ai n i n g

Question: Can w e use

an algorithm to Execute inference,
a u t o m a t i c a l l y execute prediction, a n d
t h e procedure? ident if ic at ion

6 Hu a we i Confidential
D eep Learning
l Generally, t he deep lear n in g architecture is a deep neu r a l net w o rk . " D eep " in
" d e e p l e a r n i n g " refers t o t h e n u m b e r o f layers o f t h e n e u r a l n e t w o r k .

Dendrite Synapse Output

layer

Nucleus
Hidden
layer

A xo n O u t p u t layer
I n p u t layer H id d e n layer I n p u t layer

H u m a n neural network Perc e p tro n Deep n e u r a l n e t w o r k

7 Hu a we i Confidential
Neural Network
l Currently, t h e d e f i n i t i o n o f t h e n e u ra l n e t w o r k has n o t been d e t e r m i n e d yet. H e c h t Nielsen, a
n e u ra l n e t w o r k researcher in t h e U.S., defines a n e u ra l n e t w o r k as a c o m p u t e r system co mp o se d
o f simple a n d h ig h ly interconnected processing elements, w h i c h process i n f o r m a t i o n by d y n a m i c
response t o external inputs.

l A n e u ra l n e t w o r k can be simply expressed as a n i n f o r m a t i o n processing system designed t o

i m i t a t e t h e h u m a n b ra in structure a n d functions based o n its source, features, a n d explanations.
l Artificial n e u ra l n e t w o r k ( n e u r a l n e t w o r k ) : F o r m e d by artificial neurons connected t o each other,
t h e n e u ra l n e t w o r k extracts a n d simplifies t h e h u m a n brain's micro st ru ct u re a n d functions. It is a n
i m p o r t a n t a p p ro a ch t o s i m u l a t e h u m a n intelligence a n d reflect several basic features o f h u m a n
b r a i n functions, such as concurrent i n f o r m a t i o n processing, learning, association, model
classification, a n d m e m o r y.

8 Hu a we i Confidential
D e v e l o p m e n t History o f N e u r a l N e t w o r k s

Deep
SV M
XOR n e tw o r k
Perceptron M LP

Golden age AI w i n t e r

1958 1970 1986 1995 2006

9 Hu a we i Confidential
Single-Layer Perceptron
l I n p u t vector: 𝑋 = [𝑥0 , 𝑥1, … , 𝑥 𝑛 ] 𝑇 𝑥1
l Weight: 𝑊 = [𝜔0 , 𝜔1 , … , 𝜔 𝑛 ] 𝑇 , in w h i c h 𝜔0 is t h e offset. 𝑥2

𝑥𝑛
1, 𝑛𝑒𝑡 > 0, 𝑛
l Ac tiv ation func tion: 𝑂 = 𝑠𝑖𝑔𝑛 𝑛𝑒𝑡 =
−1, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒. 𝑛𝑒𝑡 = 𝜔𝑖𝑥 𝑖 = 𝑾 𝐗 𝑻

𝑖=0
l The preceding perc eptron is equiv alent t o a classifier. It uses t h e h i g h - d i m e n s i o n a l 𝑋 vector as t h e i n p u t a n d
p e r f o r m s binar y classification o n i n p u t samples in t h e h i g h - d i m e n s i o n a l space. W h e n 𝑾𝑻𝐗 > 0, O = 1. In this
case, t h e samples are classified i n t o a type. Otherwise, O = −1. In this case, t h e samples are classified i n t o t h e
o t h e r type. The b o u n d a r y o f these t w o types is 𝑾𝑻𝐗 = 0, w h i c h is a h i g h- d i me n s i o n a l hyperplane.

Classification p o i n t Classification line Classification plane Classification hyperplane

𝐴𝑥 + 𝐵 = 0 𝐴𝑥 + 𝐵𝑦 + 𝐶 = 0 𝐴𝑥 + 𝐵𝑦 + 𝐶𝑧 + 𝐷 = 0 𝑊 𝑇X + 𝑏 =0
10 Hu a we i Confidential
XOR P r o b l e m
l In 1969, Minsky, a n A m e r i c a n m a t h e m a t i c i a n a n d AI pioneer, proved t h a t a
perceptron is essentially a linear m o d e l t h a t can o n l y deal w i t h linear
classification problems, b u t c a n n o t process n o n - l i n e a r data.

AND OR XOR

11 Hu a we i Confidential
Feedforward Neural Network

I n p u t layer O u t p u t layer
H i d d e n layer 1 H i d d e n layer 2

12 Hu a we i Confidential
Solution o f XOR

w4
XOR w5
XOR

13 Hu a we i Confidential
Imp a cts o f H i d d e n Layers o n A N e u r a l N e t w o r k

0 h id d e n layers 3 h id d e n layers 2 0 h i d d e n layers

14 Hu a we i Confidential
Conten ts

1. Deep Learning S u m m a r y

2. Training Rules

3. Activation Function

4. Normalizer

5. Optimizer
6. Types o f N e u r a l N e t w o r k s

7 . C o m m o n Problems

15 Hu a we i Confidential
Gr a d i e n t Descent a n d Loss Function
l The g ra d ien t o f th e m u lt ivaria t e f u n ction 𝑜 = 𝑓 𝑥 = 𝑓 𝑥0 , 𝑥1 , … , 𝑥𝑛 at 𝑋 ′ = [𝑥 0 ′ , 𝑥 1 ′ , … , 𝑥 𝑛 ′ ] 𝑇 is s h o w n as
follows:
𝑇
𝛻𝑓 𝑥 0 ′ , 𝑥 1 ′ , … ,𝑥𝑛 ′ = [ 𝜕𝑓 , 𝜕𝑓 , … , 𝜕𝑓 ] |𝑋 =𝑋 ′ ,
𝜕𝑥0 𝜕𝑥1
𝜕
The direction o f t h e g r a d i e n t vector is t h e fastest g r o w i n g direction o f t h e func tion. As a result, t h e direction
o f t h e negative g r a d i e n t vector −𝛻𝑓 is t h e fastest descent direction o f t h e func tion.
l D u r i n g t h e t r a i ni ng o f t h e deep lear ning n e t w o r k , t a r g e t classification errors m u s t be parameterized. A loss
function ( e r r o r f u n c t i o n ) is used, w h i c h reflects t h e error b e t w e e n t h e t a r g e t o u t p u t a n d ac tual o u t p u t o f
t h e perceptron. For a single t r a i ni ng s ample x, t h e m o s t c o m m o n error f u n c t i o n is t h e Q u ad r at i c cost
function.

𝐸 𝑤 =1 𝑑 ∈𝐷 𝑡𝑑 − 𝑜𝑑 2,
2

In t h e preceding func tion, 𝑑 is o n e n e u r o n in t h e o u t p u t layer, D is all t h e neurons in t h e o u t p u t layer, 𝑡𝑑 is t h e

t a r g e t o u t p u t , a n d 𝑜𝑑 is t h e ac tual o u t p u t .
l The g r a d i e n t descent m e t h o d enables t h e loss f u n c t i o n t o search a l o n g t h e negativ e g r a d i e n t direction a n d
u p d a t e t h e parameters iteratively, finally m i n i m i z i n g t h e loss func tion.
16 Hu a we i Confidential
Extrema o f t h e Loss Function
l Purpose: The loss f u n c t i o n 𝐸(𝑊) is defined o n t h e w e i g h t space. The objective is t o search f o r t h e w e i g h t
vector 𝑊 t h a t can m i n i m i z e 𝐸(𝑊).

l Limitation: N o effective m e t h o d can solve t h e e x t r e m u m in m a t h e m a t i c s o n t h e c o m p l e x h i g h- d i me n s i o n a l

surface o f 𝐸 𝑊 = 12 𝑑 ∈𝐷 𝑡𝑑 − 𝑜𝑑 2.

Exam ple o f g r a d ie n t descent o f

binary p a r a b o lo id

17 Hu a we i Confidential
C o m m o n Loss Functions i n D eep Learning
l Q u a d r a ti c cost func tion:

1
2
𝐸 𝑊 = 𝑡 𝑑 − 𝑜𝑑
2
𝑑 ∈𝐷

l Cross e n t r o p y error func tion:

1
𝐸 𝑊 = − [𝑡𝑑 ln 𝑜𝑑 + (1 − 𝑡 𝑑 ) ln( 1 − 𝑜𝑑 )]
𝑛
𝑥 𝑑 ∈𝐷

l The cross e n t r o p y error f u n c t i o n depicts t h e distance b e t w e e n t w o p r o b a b i l i ty

distributions, w h i c h is a w i d e l y used loss f u n c t i o n f o r classification problems.
l Generally, t h e m e a n square error f u n c t i o n is used t o solve t h e regression p r o b l e m , w h i l e
t h e cross e n t r o p y error f u n c t i o n is used t o solve t h e classification p r o b l e m .

18 Hu a we i Confidential
Batch Gr a d i e n t Descent A l g o r i t h m ( B G D )
l In t h e t r a i n i n g sa mp le set 𝐷, each sa mp le is recorded as < 𝑋, 𝑡 >, in w h i c h 𝑋 is t h e i n p u t vector, 𝑡
t h e t a r g e t o u t p u t , 𝑜 t h e actual o u t p u t , a n d 𝜂 t h e le a rn in g rate.
p Initializes each 𝑤𝑖 t o a r a n d o m value w i t h a smaller absolute value.
p Before t h e e n d c o n d it io n is m e t :
n Initializes each ∆𝑤𝑖 t o zero.
n For each < 𝑋, 𝑡 > in D:
− I n p u t 𝑋 t o this u n i t a n d calculate t h e o u t p u t 𝑜.

1 𝜕C(𝑡𝑑 ,𝑜𝑑 )
− 𝑖 -η
For each 𝑤𝑖in this unit: ∆𝑤 += x 𝑑 ∈𝐷 .
𝑛 𝜕𝑤
𝑖

n For each 𝑤𝑖 in this unit: 𝑤 𝑖 += ∆𝑤𝑖.

l The g ra d ie n t descent a l g o r i t h m o f this version is n o t c o m m o n l y used because:

p The convergence process is very s low as all t r a i ni ng samples need t o be calculated every t i m e t h e w e i g h t
is updated.

19 Hu a we i Confidential
Stochastic Gr a d i e n t Descent A l g o r i t h m (SGD)
l To address t h e BGD a l g o r i t h m defect, a c o m m o n variant called I n cre me n t a l G ra d ie n t Descent
a l g o r i t h m is used, w h i c h is also called t h e Stochastic G ra d ie n t Descent (SGD) a l g o r i t h m . O n e
i m p l e m e n t a t i o n is called O n l i n e Learning, w h i c h updates t h e g r a d i e n t based o n each sample:

1 𝜕C(𝑡 𝑑,𝑜 𝑑) 𝜕 C(𝑡 𝑑 ,𝑜 𝑑 )

∆𝑤𝑖 = −𝜂 𝑛 x 𝑑 ∈𝐷 𝜕𝑤
⟹ ∆𝑤𝑖 = −𝜂 𝑑 ∈𝐷 𝜕𝑤
.
𝑖 𝑖
l ONLINE-GRADIENT-DESCENT
p Initializes each 𝑤𝑖 t o a r a n d o m value w i t h a smaller absolute value.
p Before t h e end c ondition is me t :
p Generates a r a n d o m <X, t > f r o m D a n d does t h e f o l l o w i n g calculation:
n I n p u t X t o this u n i t a n d calculate t h e o u t p u t o.

𝜕C(𝑡 𝑑,𝑜 𝑑)
n For each 𝑤𝑖 in this unit: 𝑤𝑖 += - η 𝑑∈𝐷 𝜕 𝑖 .

20 Hu a we i Confidential
M i n i - B a t c h Gr a d i e n t Descent A l g o r i t h m ( M B G D )
l To address t h e defects o f t h e previous t w o g r a d i e n t descent algorithms, t h e M i n i - b a t c h Gra d ie n t
Descent A l g o r i t h m ( M B G D ) wa s proposed a n d has been m o s t w i d e l y used. A sma ll n u m b e r o f
Batch Size (BS) samples are used a t a t i m e t o calculate ∆𝑤𝑖, a n d t h e n t h e w e i g h t is u p d a t e d
accordingly.

l Batch-gradient-descent
p Initializes each 𝑤𝑖 t o a r a n d o m value w i t h a smaller absolute value.
p Before t h e end c ondition is me t :
n Initializes each ∆𝑤𝑖 t o zero.
n For each < 𝑋, 𝑡 > in t h e BS samples in t h e nex t b a t c h in 𝐷:
− I n p u t 𝑋 t o this u n i t a n d calculate t h e o u t p u t 𝑜.

1
− For each 𝑤 in
𝑖 this unit: ∆𝑤 +=
𝑖 -η 𝑛 x 𝑑∈𝐷 𝜕C(𝑡 𝑑 ,𝑜
)
𝜕𝑤 𝑖
n For each 𝑤𝑖 in this unit: 𝑤 𝑖 += ∆𝑤𝑖
n For t h e last batch, t h e t r a i n i n g samples are m i x e d u p in a r a n d o m order.
21 Hu a we i Confidential
Backpropagation Algorithm (1)
l Signals are p r o p a g a t e d in f o r w a r d direction,
and errors are propagated in backward F o r w a r d p r o p a g a t i o n direction
direction.

l In t h e t r a i n i n g sa mp le set D, each sample is 𝑥1

recorded as <X, t>, in w h i c h X is t h e i n p u t
𝑜1
vector, t t h e t a r g e t o u t p u t , o t h e actual o u t p u t , 𝑥2
a n d w t h e w e i g h t coefficient. 𝑜2
𝑥3
l Loss function:
O u t p u t layer
1
E (w) = å (t - o ) 2 I n p u t layer
2 (dÎD) d d H i d d e n layer

B a c k p r o p a g a t i o n direction

22 Hu a we i Confidential
Backpropagation Algorithm (2)
l Accordi n g to the f o llo w i n g f o rm u las, errors in the input, hidden, a n d o u t p u t layers a re
a c c u m u l a t e d t o generate t h e error in t h e loss function.

l w c is t h e w e i g h t coefficient b e t w e e n t h e h id d e n layer a n d t h e o u t p u t layer, w h i l e w b is t h e w e i g h t

coefficient b e t w e e n t h e i n p u t layer a n d t h e h i d d e n layer. 𝑓 is t h e activation function, D is t h e
o u t p u t layer set, a n d C a n d B are t h e h i d d e n layer set a n d i n p u t layer set respectively. Assume
t h a t t h e loss f u n c t i o n is a quadratic cost function :
1
p O u t p u t layer error: E= å
2 (dÎD )
(t d - od ) 2

2
1 1
E = å (dÎD ) [td - f (netd ) ] = å (dÎD) étd - f (å (cÎC) wc y c)ù
2
p Expanded hidden
2 2 ë û
layer error:

( )
2
1
E = å (dÎD ) êtd - f å (cÎC ) wc f (net c)
é ù
=
p Expanded i n p u t 2 ë ú
û

( ( ))
layer error: 1 ù
2
å (dÎD) éêëtd - f å (cÎC ) wc f åbÎB wb xb úû
2
23 Hu a we i Confidential
Backpropagation Algorithm (3)
l To m i n i m i z e error E, t h e g r a d i e n t descent iterative calculation can be used t o
solve w c a n d w b , t h a t is, calculating w c a n d w b t o m i n i m i z e error E.

l Formula:
¶E
Dwc = -h ,c ÎC
¶wc

Dwb = -h ¶E ,b Î B
¶wb
l If there are m u l t i p l e h i d d e n layers, chain rules are used t o t a k e a derivative f o r
each layer t o o b t a i n t h e o p t i m i z e d parameters by iteration.

24 Hu a we i Confidential
Backpropagation Algorithm (4)
l For a n e u ra l n e t w o r k w i t h a n y n u m b e r o f layers, t h e a rra n g e d f o r m u l a f o r t r a i n i n g is as follows:

Dwl = -hdl+1 f (z l )
jk k j j
'
(z
ìfj j j
l
)(t - f (z l
)), l Î outputs, (1)
dl j = ïí
j j

ïî å
d k
l +1 l
w f
jk j
'
(z l
j ),otherwise,(2)
k

l The BP a l g o r i t h m is used t o t r a i n t h e n e t w o r k as follows:

p Takes o u t t h e nex t t r a i ni ng s ample <X, T>, inputs X t o t h e n e t w o r k , a n d obtains t h e ac tual o u t p u t o.
p Calculates o u t p u t layer δ according t o t h e o u t p u t layer error f o r m u l a (1).
p Calculates δ o f each hidden layer f r o m o u t p u t t o i n p u t by i t er at i on according t o t h e hidden layer error
p r o p a g a t i o n f o r m u l a (2).

p According t o t h e δ o f each layer, t h e w e i g h t values o f all t h e layer are updated.

25 Hu a we i Confidential
Conten ts

1. Deep Learning S u m m a r y

2. Training Rules

3. Activation Function

4. Normalizer

5. Optimizer
6. Types o f N e u r a l N e t w o r k s

7 . C o m m o n Problems

26 Hu a we i Confidential
Acti va ti o n Function
l Activation functions are i m p o r t a n t f o r t h e n e u r a l n e t w o r k m o d e l t o learn a n d
u n d e r s t a n d c o m p l e x n o n - l i n e a r functions. They a l l o w i n t r o d u c t i o n o f n o n - l i n e a r
features t o t h e n e t w o r k .

l W i t h o u t activation functions, o u t p u t signals are o n l y simple linear functions.

The c o m p l e x i t y o f linear functions is limited, a n d t h e capability o f learning
c o m p l e x f u n c t i o n m a p p i n g s f r o m d a t a is low.

A ctivation Func tion

output = f (w1x1 + w2 x2 + w3 x3 ) = f (W • X ) t

27 Hu a we i Confidential
Si g m o i d

1
𝑓 𝑥 =
1 + 𝑒−𝑥

28 Hu a we i Confidential
Tanh

𝑒 𝑥 − 𝑒−𝑥
tanh 𝑥 = 𝑒 𝑥 + 𝑒−𝑥

29 Hu a we i Confidential
Softsign

𝑥
𝑓 𝑥 =
𝑥 +1

30 Hu a we i Confidential
Rectified Linear U n i t (ReLU)
𝑥, 𝑥 ≥0
𝑦=
0, 𝑥 <0

31 Hu a we i Confidential
Softplus

𝑓 𝑥 = ln 𝑒 𝑥 + 1

32 Hu a we i Confidential
Softmax
l S o f t m a x function:

𝑒
σ(z)𝑗 =
𝑘
𝑒

l The S o f t m a x f u n c t i o n is used t o m a p a K-dimensional vector o f arbitrary real

values t o a n o t h e r K-dimensional vector o f real values, w h e r e each vector
e l e m e n t is in t h e interval (0, 1). All t h e elements a d d u p t o 1.

l The S o f t m a x f u n c t i o n is o f t e n used as t h e o u t p u t layer o f a multiclass

classification task.

33 Hu a we i Confidential
Conten ts

1. Deep Learning S u m m a r y

2. Training Rules

3. Activation Function

4. Normalizer

5. Optimizer
6. Types o f N e u r a l N e t w o r k s

7 . C o m m o n Problems

34 Hu a we i Confidential
Normalizer
l Regularization is a n i m p o r t a n t a n d effective t e c h n o l o g y t o reduce generalization
errors in m a c h i n e learning. It is especially useful f o r deep learning m o d e l s t h a t
t e n d t o be over-fi t d u e t o a large n u m b e r o f parameters. Therefore, researchers
have proposed m a n y effective technologies t o prevent over-fitting, including:
p A d d i n g constraints t o parameters, such as 𝐿1 a n d 𝐿2 n o r m s
p Expanding t h e t r a i n i n g set, such as a d d i n g noise a n d t r a n s f o r m i n g d a t a

p Dropout

p Early s to p p i n g

35 Hu a we i Confidential
Penalty Parameters
l M a n y regularization m e t h o d s restrict t h e learning capability o f m o d e l s by
a d d i n g a penalty p a r a m e t e r Ω(𝜃) t o t h e objective f u n c t i o n 𝐽. Assume t h a t t h e
t a r g e t f u n c t i o n a f t e r regularization is 𝐽.

𝐽 𝜃; 𝑋, 𝑦 = 𝐽 𝜃; 𝑋, 𝑦 + 𝛼Ω(𝜃),
l W h e r e 𝛼𝜖[0, ∞) is a h y p e r p a r a m e t e r t h a t w e i g h t s t h e relative c o n t r i b u t i o n o f
t h e n o r m penalty t e r m Ω a n d t h e standard objective f u n c t i o n 𝐽(𝑋; 𝜃). If 𝛼 is set
t o 0, n o regularization is p e r f o r m e d . The p e n a l t y in regularization increases w i t h
𝛼.

36 Hu a we i Confidential
𝐿1 Regu l a ri z a ti o n
l A d d 𝐿1 n o r m constraint t o m o d e l parameters, t h a t is,

𝐽 𝑤; 𝑋, 𝑦 = 𝐽 𝑤; 𝑋, 𝑦 + 𝛼 𝑤 1 ,
l If a g r a d i e n t m e t h o d is used t o resolve t h e value, t h e p a r a m e t e r g r a d i e n t is
𝛻𝐽 𝑤 =∝ 𝑠𝑖𝑔𝑛 𝑤 + 𝛻𝐽 𝑤 .

37 Hu a we i Confidential
𝐿2 Regu l a ri z a ti o n
l A d d n o r m penalty t e r m 𝐿2 t o prevent overfitting.
1
𝐽 𝑤; 𝑋,𝑦 = 𝐽 𝑤; 𝑋, 𝑦 + 𝛼 𝑤 2,
2
2

l A p a r a m e ter o p ti m izati on m e th o d can be infe rr ed usin g a n o p t i m i z a t i o n

technol ogy (such as a g r a d i e n t m e t h o d ) :

𝑤 = 1 − 𝜀𝛼 𝜔 − 𝜀𝛻𝐽(𝑤),
l w h e r e 𝜀 is t h e learning rate. C o m p a r e d w i t h a c o m m o n g r a d i e n t o p t i m i z a t i o n
f o r m u l a , this f o r m u l a multiplies t h e p a r a m e t e r by a reduction factor.

38 Hu a we i Confidential
𝐿1 v.s. 𝐿2
l The m a j o r differences b e t w e e n 𝐿2 a n d 𝐿1 :
p Accor ding t o t h e preceding analysis, 𝐿1 can generat e a m o r e sparse m o d e l t h a n 𝐿2. W h e n t h e value o f p a r a m e t e r 𝑤 is
small, 𝐿1 r egular iz at ion can directly reduce t h e p a r a m e t e r value t o 0, w h i c h can be used f o r f e a t u r e selection.

p F r o m t h e perspective o f probability, m a n y n o r m constraints are equiv alent t o a d d in g p r io r p r o b a b ilit y dis t r ibut ion t o
parameters. In 𝐿2 regularization, t h e p a r a m e t e r value complies w i t h t h e Gaussian dis t r ibut ion rule. In 𝐿1 regularization,
t h e p a r a m e t e r value complies w i t h t h e Laplace dis t r ibut ion rule.

𝐿1 𝐿2
39 Hu a we i Confidential
Dataset Expansion
l The m o s t effective w a y t o p re ve n t o v e r - f i t t i n g is t o a d d a t r a i n i n g set. A larger t r a i n i n g set has a
smaller o v e r - f i t t i n g probability. Dataset expansion is a t ime -sa vin g m e t h o d , b u t it varies in
d i ff e r e n t fields.
p A c o m m o n m e t h o d in t h e object rec ognition field is t o r o t a t e o r scale images. (The prerequisite t o i m a g e
t r a n s f o r m a t i o n is t h a t t h e ty pe o f t h e i m a g e c a n n o t be c hanged t h r o u g h t r a n s f or ma t i o n . For example, f o r
h a n d w r i t i n g digit recognition, categories 6 a n d 9 can be easily c hanged a f t e r r otation) .

p R a n d o m noise is added t o t h e i n p u t d a t a in speech recognition.

p A c o m m o n practice o f n a t u r a l l anguage processing ( N L P ) is replacing w or ds w i t h their synonyms.
p Noise injection can a d d noise t o t h e i n p u t o r t o t h e hidden layer o r o u t p u t layer. For example, f o r S o f t ma x
classification, noise can be added using t h e label s m o o t h i n g technology. If noise is added t o categories 0
𝜀
a n d 1, t h e corresponding probabilities are c hanged t o a n d 1 − 𝑘−1 𝜀 respectively.
𝑘 𝑘

40 Hu a we i Confidential
Dropout
l D r o p o u t is a c o m m o n a n d simple r egular iz ation m e t h o d , w h i c h has been w i del y used since 2014. Simply put,
D r o p o u t r a n d o m l y discards s o m e inputs d u r i n g t h e t r a i ni ng process. In this case, t h e par ameter s
corresponding t o t h e discarded inputs are n o t updated. As a n i n t e g r a t i o n m e t h o d , D r o p o u t c ombines all sub-
n e t w o r k results a n d obtains s ub- netw or k s by r a n d o m l y d r o p p i n g inputs. See t h e figures below:

41 Hu a we i Confidential
Early Stoppi ng
l A test o n d a t a o f t h e validation set can be inserted d u r i n g t h e training. W h e n
t h e d a t a loss o f t h e verification set increases, p e r f o r m early stopping.

Early stopping

42 Hu a we i Confidential
Conten ts

1. Deep Learning S u m m a r y

2. Training Rules

3. Activation Function

4. Normalizer

5. Optimizer
6. Types o f N e u r a l N e t w o r k s

7 . C o m m o n Problems

43 Hu a we i Confidential
Optimizer
l There are various o p t i m i z e d versions o f g r a d i e n t descent algorithms. In object-
oriented l a n g u a g e i m p l e m e n t a t i o n , d i ff e r e n t g r a d i e n t descent a l g o r i t h m s are
o f t e n encapsulated i n t o objects called optimizers.

l Purposes o f t h e a l g o r i t h m o p t i m i z a t i o n include b u t are n o t l i m i t e d to:

p Accelerating a l g o r i t h m convergence.
p Preventing o r j u m p i n g o u t o f local e x t r e m e values.

p S implifying m a n u a l p a r a m e t e r setting, especially t h e l e a r n i n g r a te (LR).

l C o m m o n o ptim izers: co m m o n G D o p tim i z er, m o m e n tu m o p tim i z er, N e sterov,
AdaGrad, AdaDelta, RMSProp, A d a m , A d a M a x , a n d N a d a m .

44 Hu a we i Confidential
M o m e n t u m Optimizer
l A m o s t basic i m p r o v e m e n t is t o a d d m o m e n t u m t e r m s f o r ∆𝑤𝑗𝑖. Assume t h a t t h e w e i g h t correction o f t h e 𝑛- t h i t e r a t i o n is
∆𝑤𝑗𝑖(𝑛) . The w e i g h t correction r ule is:

l ∆𝑤 𝑙 𝑛 = −𝜂𝛿𝑙+1𝑥 𝑙(𝑛) + 𝛼∆𝑤 𝑙 𝑛 −1

𝑗𝑖 𝑖 𝑗 𝑗𝑖

l w h e r e 𝛼 is a const ant ( 0 ≤ 𝛼 < 1 ) called M o m e n t u m Coefficient a n d 𝛼∆𝑤𝑗𝑖 𝑛 − 1 is a m o m e n t u m t e r m .

l I m a g i n e a s m all ball rolls d o w n f r o m a r a n d o m p o i n t o n t h e er r or surface. The i n t r o d u c t i o n o f t h e m o m e n t u m t e r m is

equiv alent t o giving t h e s m all ball inertia.

−𝜂𝛿𝑙+1 𝑥 𝑙 (𝑛)
𝑖 𝑗

45 Hu a we i Confidential
Advantages a n d Disadvantages o f M o m e n t u m O p t i m i z e r
l A d van t a g es:
p Enhances t h e stability o f t h e g r a d i e n t correction direction a n d reduces mutations .
p In areas w h e r e t h e g r a d i e n t direction is stable, t h e ball rolls faster a n d faster ( t h e r e is a speed u p p e r l i m i t
because 𝛼 < 1), w h i c h helps t h e ball quickly overshoot t h e f l a t area a n d accelerates convergence.
p A s mall ball w i t h inertia is m o r e likely t o r oll over s o m e n a r r o w local extrema.

l Disadvantages:
p The learning r ate 𝜂 a n d m o m e n t u m 𝛼 need t o be m a n u a l l y set, w h i c h o f t e n requires m o r e experiments t o
d e t e r m i n e t h e appr opr iate value.

46 Hu a we i Confidential
AdaGrad Optimizer (1)
l The com m o n f eat u r e o f th e ra n d o m g ra d ien t d escen t a lg o rit h m (SGD), sm all- b a tc h g ra d ien t d escen t
a l g o r i t h m ( M B G D ) , a n d m o m e n t u m o p t i mi z e r is t h a t each p a r a m e t e r is u p d a t e d w i t h t h e s ame LR.

l According t o t h e appr oac h o f AdaGrad, differ ent learning rates need t o be set f o r differ ent parameters.

gt = ¶C(t, o)
¶wt Gr a d ie n t calculation

r = r + g2 Square g r a d i e n t a c c u m u l a t i o n
t t-1 t

h
Dwt = - gt Computing update
e+ rt
A pp lic a tio n u p d a t e
wt +1 =wt +Dwt
l 𝑔𝑡 indicates t h e t - t h gradient, 𝑟 is a g r a d i e n t a c c u m u l a t i o n variable, a n d t h e initial value o f 𝑟 is 0, w h i c h

increases continuously. 𝜂 indicates t h e g l o b a l LR, w h i c h needs t o be set ma nual l y. 𝜀 is a s mall constant, a n d

is set t o a b o u t 10 - 7 f o r n u me r i c a l stability.

47 Hu a we i Confidential
AdaGrad Optimizer (2)
l The A d a G r a d o p t i m i z a t i o n a l g o r i t h m shows t h a t t h e 𝑟 continues increasing w h i l e t h e
overall lear ning r ate keeps decreasing as t h e a l g o r i t h m iterates. This is because w e h o p e
LR t o decrease as t h e n u m b e r o f updates increases. In t h e initial l e a r n i n g phase, w e are
fa r a w a y f r o m t h e o p t i m a l s o lu tio n t o t h e loss func tion. As t h e n u m b e r o f updates
increases, w e are closer t o t h e o p t i m a l solution, a n d th er e fo r e LR can decrease.

l Pros:
p The le a rn in g ra t e is a u t o m a t i c a l l y updated. As t h e n u m b e r o f updates increases, t h e le a rn in g
ra te decreases.

l Cons:
p The d e n o m i n a t o r keeps a c c u m u l a t i n g so t h a t t h e le a rn in g ra t e w i l l eventually b e c o m e very
small, a n d t h e a l g o r i t h m w i l l b e c o m e ineffective.

48 Hu a we i Confidential
RMSProp O p t i m i z e r
l The RMSProp o p t i m i z e r is a n i m p r o v e d AdaG r ad optimiz er. It introduces a n a t t e n u a t i o n coefficient t o ensure a
certain a t t e n u a t i o n r a t i o f o r 𝑟 in each round.

l The RMSProp o p t i m i z e r solves t h e p r o b l e m t h a t t h e AdaG r ad o p t i m i z e r ends t h e o p t i m i z a t i o n process t o o

early. It is suitable f o r non-s table t a r g e t h a n d l i n g a n d has g o o d effects o n t h e RNN.
¶C(t, o)
gt =
¶wt G ra d ie n t calculation
r =br- + (1- b)g 2
t t 1 t Square g r a d i e n t a c c u m u l a t i o n

Dwt = -
h gt Computing update
e+ rt
wt +1 = wt + Dwt A p p lica t io n u p d a t e
l 𝑔𝑡 indicates t h e t - t h gradient, 𝑟 is a g r a d i e n t a c c u m u l a t i o n variable, a n d t h e initial value o f 𝑟 is 0, w h i c h m a y
n o t increase a n d needs t o b e adjusted using a p a r a m e t e r . 𝛽 is t h e a t t e n u a t i o n factor, 𝜂 indicates t h e g l o b a l
LR, w h i c h needs t o be set manually. 𝜀 is a s mall constant, a n d is set t o a b o u t 10 - 7 f o r n u m e r i c a l stability.

49 Hu a we i Confidential
Adam Optimizer (1)
l Adaptive M o m e n t Estimation ( A d a m ) : Developed based o n AdaGrad a n d
AdaDelta, A d a m m a i n t a i n s t w o a d d i t i o n a l variables 𝑚𝑡 a n d 𝑣𝑡 f o r each variable
t o be trained:
𝑚𝑡 = 𝛽1𝑚𝑡−1 + (1 − 𝛽1)𝑔𝑡
𝑣𝑡 = 𝛽2𝑣𝑡−1 + (1 − 𝛽2)𝑔2𝑡

l W h e r e 𝑡 represents t h e 𝑡 -th iteration a n d 𝑔𝑡 is t h e calculated gradient. 𝑚𝑡 a n d 𝑣𝑡

are m o v i n g averages o f t h e g r a d i e n t a n d square gradient. F r o m t h e statistical
perspective, 𝑚𝑡 a n d 𝑣𝑡 are estimates o f t h e first m o m e n t ( t h e average value)
a n d t h e second m o m e n t ( t h e uncentered variance) o f t h e gradients respectively,
w h i c h also explains w h y t h e m e t h o d is so n a m e d .

50 Hu a we i Confidential
Adam Optimizer (2)
l If 𝑚𝑡 and 𝑣𝑡 are initialized using t h e zero vector, 𝑚𝑡 and 𝑣𝑡 are close t o 0 d u r i n g t h e initial
iterations, especially w h e n 𝛽1 a n d 𝛽2 are close t o 1. To solve this p r o b l e m , w e use 𝑚𝑡 a n d 𝑣𝑡 :
𝑚𝑡
𝑚𝑡 =
1 − 𝛽1𝑡
𝑣𝑡
𝑣𝑡 =
1 − 𝛽2𝑡

l The w e i g h t u p d a t e ru le o f A d a m is as follows:
𝜂
𝑤𝑡+1 = 𝑤𝑡 − 𝑚𝑡
𝑣𝑡 +𝜖

l A l t h o u g h t h e rule involves m a n u a l setting o f 𝜂, 𝛽1, a n d 𝛽2, t h e setting is m u c h simpler. According

t o experiments, t h e d e f a u l t settings are 𝛽1 = 0.9, 𝛽2 = 0.999, 𝜖 = 10−8, a n d 𝜂 = 0.001. In practice, A d a m w i l l
converge quickly. W h e n convergence sa t u ra t io n is reached, xx can be reduced. A f t e r several times
o f reduction, a satisfying local e x t r e m u m w i l l be obtained. O t h e r p a ra me t e rs d o n o t need t o be
adjusted.
51 Hu a we i Confidential
O p t i m i z e r Performance C o m p a r i so n

Comparison o f o p t i m i z a t i o n Comparison o f o p t i m i z a t i o n
a l g o r i t h m s in c o n t o u r m a p s o f a l g o r i t h m s a t t h e saddle p o i n t
loss functions

52 Hu a we i Confidential
Conten ts

1. Deep Learning S u m m a r y

2. Training Rules

3. Activation Function

4. Normalizer

5. Optimizer

6. Types o f N e u r a l N e t w o r k s

7 . C o m m o n Problems

53 Hu a we i Confidential
Convolutional Neural N e tw o r k
l A c o n v o l u ti o n a l n e u r a l n e t w o r k ( C N N ) is a f e e d f o r w a r d n e u r a l n e t w o r k . Its artificial
neurons m a y respond t o surrounding units w i t h i n t h e coverage range. C N N excels a t
i m a g e processing. It includes a convolutional layer, a pooling layer, a n d a fully
connected layer.

l In t h e 1960s, H u b e l a n d Wiesel studied cats' cortex neurons used f o r local sensitivity

a n d direction selection a n d f o u n d t h a t the ir u n i q u e n e t w o r k structure c o u ld s implify
feedback n e u r a l networks. They t h e n proposed t h e CNN.

l N o w, C N N has b e c o m e o n e o f t h e research hotspots in m a n y scientific fields, especially

i n t h e p a t t e r n classification field. The n e t w o r k is w i d e l y used because i t can avoid
c o m p l e x pre-processing o f ima g e s a n d directly i n p u t o r ig in a l images.

54 Hu a we i Confidential
M a i n Concepts o f C N N
l Local receptive field: It is generally considered t h a t h u m a n perception o f t h e outside
w o r l d is f r o m local t o global. Spatial correlations a m o n g local pixels o f a n i m a g e a r e
closer t h a n those a m o n g distant pixels. Therefore, each n e u r o n does n o t need t o
k n o w t h e g l o b a l image. It o n l y needs t o k n o w t h e local image. The local i n f o r m a t i o n is
c o m b i n e d a t a h i g h e r level t o generate g l o b a l i n f o r m a t i o n .

l P a r a m e t e r sharing: O n e o r m o r e filters/kernels m a y be used t o scan i n p u t images.

Parameters carried b y t h e filters are weights. In a layer scanned b y filters, each filter
uses t h e s a me p a r a m e te r s d u r i n g w e i g h t e d c o m p u t a t i o n . W e i g h t sharing m e a n s t h a t
w h e n each filter scans a n entire image, p a r a m e te r s o f t h e filter are fixed.

55 Hu a we i Confidential
Architecture o f C o n v o l u t i o n a l N e u r a l N e t w o r k
Input Th ree- fe a tu re Th ree- fe a tu re Five- fe a tu re Five- fe a tu re Output
im age image image image image layer

Con vo lu t io n al Po o ling Con vo lu t io n al Poo lin g Fully connected

layer layer layer layer layer

Bird Pb ird

Sunset Psu n set

Dog Pd o g

Cat Pca t
Vectorization
C o n vo lu t io n + nonlinearity M a x p o o lin g

Convolution layers + p o o lin g layers Multi-

Fully connected layer c a tego ry

56 Hu a we i Confidential
Single-Filter Calculation ( 1 )
l Description o f c o n v o l u t i o n calculation

57 Hu a we i Confidential
Single-Filter Calculation ( 2 )
l D e m o n s t r a t i o n o f t h e c o n v o l u t i o n calculation

H a n Bingtao, 2017, Conv olut ional N e u r a l N e t w o r k

58 Hu a we i Confidential
C o n v o l u t i o n a l Layer
l The basic architecture o f a C N N is m u l t i - c h a n n e l c o n v o l u t i o n consisting o f m u l t i p l e single convolutions. The
o u t p u t o f t h e previous layer ( o r t h e original i m a g e o f t h e first layer) is used as t h e i n p u t o f t h e c ur r ent layer.
It is t h e n convolved w i t h t h e filter in t h e layer a n d serves as t h e o u t p u t o f this layer. The c o n v o l u t i o n kernel o f
each layer is t h e w e i g h t t o be learned. Similar t o FCN, a f t e r t h e c o n v o l u t i o n is c omplete, t h e result s hould be
biased a n d activated t h r o u g h activation func tions before being i n p u t t o t h e nex t layer.

Wn bn
Fn

Input Output
t enso r tensor
F1
W2 b2 A ctivat e
Output

W1 b1
Con vo lu t io n al Bias
kernel

59 Hu a we i Confidential
Pooling Layer
l Pooling c ombines nearby units t o reduce t h e size o f t h e i n p u t o n t h e nex t layer, r educ ing dimensions.
C o m m o n p o o l i n g includes m a x p o o l i n g a n d average pooling. W h e n m a x p o o l i n g is used, t h e m a x i m u m value
in a s mall square area is selected as t h e representative o f this area, w h i l e t h e m e a n value is selected as t h e
representative w h e n average p o o l i n g is used. The side o f this s mall area is t h e p o o l w i n d o w size. The
f o l l o w i n g figur e shows t h e m a x p o o l i n g o p e r a t i o n w hos e p o o l i n g w i n d o w size is 2.

Sliding direction

60 Hu a we i Confidential
Fully Connected Layer
l The ful l y connected layer is essentially a classifier. The features extracted o n t h e
c o n v o l u t i o n a l layer a n d p o o l i n g layer are straightened a n d placed a t t h e ful l y
connected layer t o o u t p u t a n d classify results.

l Generally, t h e S o f t m a x f u n c t i o n is used as t h e activation f u n c t i o n o f t h e final

ful l y connected o u t p u t layer t o c o m b i n e all local features i n t o g l o b a l features
a n d calculate t h e score o f each type.

𝑒𝑧𝑗
σ(z)𝑗 =
𝑘
𝑒

61 Hu a we i Confidential
Recurrent N e u r a l N e t w o r k
l The rec urrent n e u r a l n e t w o r k ( R N N ) is a n e u r a l n e t w o r k t h a t captures d y n a m i c
i n f o r m a t i o n in sequential d a t a t h r o u g h periodical connections o f h i d d e n layer nodes. It
can classify sequential data.

l U n l i k e o t h e r f o r w a r d n e u r a l networks, t h e R N N can keep a c o n te x t state a n d even

store, learn, a n d express related i n f o r m a t i o n i n c o n te x t w i n d o w s o f any length. D i ffe r e n t
f r o m t r a d i t i o n a l n e u r a l networks, it is n o t l i m i t e d t o t h e space b o u n d a r y, b u t also
supports t i m e sequences. In o t h e r words, ther e is a side b e t w e e n t h e h i d d e n layer o f t h e
c u r r e n t m o m e n t a n d t h e h i d d e n layer o f t h e n e x t m o m e n t .

l The R N N is w i d e l y used in scenarios related t o sequences, such as videos consisting o f

i m a g e frames, a u d i o consisting o f clips, a n d sentences consisting o f words.

62 Hu a we i Confidential
Recurrent N e u r a l N e t w o r k Architecture ( 1 )
l 𝑋𝑡 is t h e i n p u t o f t h e i n p u t sequence a t t i m e t.
l 𝑆𝑡 is th e m e m o ry u n i t o f the sequence a t t i m e t a n d caches
previous i n f o r m a t i o n .

𝑆𝑡 = 𝑡𝑎𝑛ℎ 𝑈𝑋𝑡 + 𝑊𝑆𝑡−1 .

l 𝑂𝑡 is t h e o u t p u t o f t h e h i d d e n layer o f t h e sequence a t t i m e t.

𝑂𝑡 = 𝑡𝑎𝑛ℎ 𝑉𝑆𝑡
l 𝑂𝑡 a ft e r t h r o u g h m u l t i p le hidden la yers, i t can g e t the f inal
o u t p u t o f t h e sequence a t t i m e t.

63 Hu a we i Confidential
Recurrent N e u r a l N e t w o r k Architecture ( 2 )

LeCun, Bengio, a n d G. H i n t o n , 2015, A Recurrent N e u r a l N e t w o r k a n d t h e

U n f o l d i n g in Ti m e o f t h e C o m p u t a t i o n Involved in Its F o r w a r d C o m p u t a t i o n

64 Hu a we i Confidential
Types o f Recurrent N e u r a l N e t w o r k s

Andr ej Karpathy, 2015, The Unreasonable Effectiveness o f Recurrent N e u r a l N e t w o r k s

65 Hu a we i Confidential
B a c k p r o p a g a t i o n T h r o u g h T i m e (BPTT)
l BPTT:
p Tradit ional b a c k p r o p a g a t i o n is t h e extension o n t h e t i m e sequence.
p There are t w o sources o f errors in t h e sequence a t t i m e o f m e m o r y unit: first is f r o m t h e h id d e n layer o u t p u t error a t t
t i m e sequence; t h e second is t h e error f r o m t h e m e m o r y cell a t t h e nex t t i m e sequence t + 1.

p The longer t h e t i m e sequence, t h e m o r e likely t h e loss o f t h e last t i m e sequence t o t h e g r a d ie n t o f w in t h e first t i m e

sequence causes t h e vanishing g r a d ie n t o r ex ploding g r a d ie n t p r o b le m .

p The t o t a l g r a d i e n t o f w e i g h t w is t h e a c c u m u l a t i o n o f t h e g r a d ie n t o f t h e w e i g h t a t all t i m e sequence.

l Three steps o f BPTT:

p C o m p u t i n g t h e o u t p u t value o f each n e u r o n t h r o u g h f o r w a r d pr opagat ion.

p C o m p u t i n g t h e error value o f each n e u r o n t h r o u g h b a c k p r o p a g a t i o n 𝛿𝑗.

p C o m p u t i n g t h e g r a d ie n t o f each w e ig h t .

l U p d a t i n g w eights using t h e SGD a l g o r i t h m .

66 Hu a we i Confidential
Recurrent N e u r a l N e t w o r k P r o b l e m
l 𝑆𝑡 = 𝜎 𝑈𝑋𝑡 + 𝑊𝑆𝑡−1 is extended o n t h e t i m e sequence.

l 𝑆𝑡 = σ 𝑈𝑋𝑡 + 𝑊 𝜎 𝑈𝑋𝑡−1 + 𝑊 𝜎 𝑈𝑋𝑡−2 + 𝑊 …

l Despite t h a t t h e s ta n d a r d R N N structure solves t h e p r o b l e m o f i n f o r m a t i o n m e m o r y,

t h e i n f o r m a t i o n a tte n u a te s d u r i n g l o n g - t e r m m e m o r y.

l I n f o r m a t i o n needs t o be saved l o n g t i m e i n m a n y tasks. For example, a h i n t a t t h e

b e g i n n i n g o f a speculative fic tio n m a y n o t be a n s w e r e d u n t i l t h e end.
l The R N N m a y n o t be able t o save i n f o r m a t i o n f o r l o n g d u e t o t h e l i m i t e d m e m o r y u n i t
capacity.

l W e expect t h a t m e m o r y units can r e m e m b e r key i n f o r m a t i o n .

67 Hu a we i Confidential
Long Short-term M e m o r y Network

Colah, 2015, Under s t anding LSTMs N e t w o r k s

68 Hu a we i Confidential
Ga t e d Recurrent U n i t ( G R U )

69 Hu a we i Confidential
Generative Adversarial N e t w o r k ( G A N )
l Generative Adversarial N e t w o r k is a f r a m e w o r k t h a t trains g e n e r a t o r G a n d dis c riminator D t h r o u g h t h e
adversarial process. T h r o u g h t h e adversarial process, t h e dis c riminator can tell w h e t h e r t h e s ample f r o m t h e
gener ator is fak e o r real. G A N adopts a m a t u r e BP a l g o r i t h m .

l ( 1 ) Generator G: The i n p u t is noise z, w h i c h complies w i t h m a n u a l l y selected pr ior pr obability distribution,

such as even dis tr ibution a n d Gaussian distribution. The g e n e r a t o r adopts t h e n e t w o r k structure o f t h e
m u l t i l a y e r per c eptr on (MLP), uses m a x i m u m lik elihood e s t i ma t i o n ( M L E ) par ameter s t o represent t h e
derivable m a p p i n g G(z), a n d m a p s t h e i n p u t space t o t h e s ample space.

l ( 2 ) D is c r iminator D: The i n p u t is t h e real s ample x a n d t h e fak e s ample G(z), w h i c h are t a g g e d as real a n d

fak e respectively. The n e t w o r k o f t h e dis c riminator can use t h e MLP carrying parameters. The o u t p u t is t h e
pr obability D ( G ( z ) ) t h a t determines w h e t h e r t h e s ample is a real o r fak e sample.

l G A N can be applied t o scenarios such as i m a g e generation, tex t generation, speech e n h a n c e me n t , i m a g e

super-resolution.

70 Hu a we i Confidential
G A N Architecture
l Generator/Discriminator

71 Hu a we i Confidential
Generative M o d e l a n d Discriminative M o d e l
l Generative n e t w o r k l Discriminator n e t w o r k
p Generates s a m p l e d a t a p
D ete r mines w h e t h e r s a m p l e d a t a is real
n Input: Gaussian w h i t e noise vector z n Input: real sa mp le d a t a 𝑥 𝑟𝑒𝑎𝑙 a n d
n O u t p u t : sa mp le d a t a vector x generated sa mp le d a t a 𝑥 = 𝐺 𝑧
n O u t p u t : p ro b a b ilit y t h a t determines
w h e t h e r t h e sa mp le is real

x = G(z;qG ) y = D(x;qD )
𝑥𝑟𝑒𝑎𝑙
G
z x D y
x

72 Hu a we i Confidential
Training Rules o f G A N
l O p t i m i z a t i o n objective:
p Value f u n c t i o n

minmaxV (D, G) = Ex~ p data（x）[logD(x)] + E z~ p z( z ) [log(1- D(G(z)))]

G D

p In t h e early t r a i n i n g stage, w h e n t h e o u t c o m e o f G is very poor, D d e te r min e s t h a t

t h e g e n e r a te d s a m p l e is fak e w i t h h i g h confidence, because t h e s a m p l e is obviously
d i ffe r e n t f r o m t r a i n i n g data. In this case, l o g ( 1 - D ( G ( z ) ) ) is s a tu r a te d ( w h e r e t h e
g r a d i e n t is 0, a n d i te r a ti o n c a n n o t be p e r f o r m e d ) . Therefore, w e choose t o t r a i n G
only by minimizing [-log(D(G(z))].

73 Hu a we i Confidential
Conten ts

1. Deep Learning S u m m a r y

2. Training Rules

3. Activation Function

4. Normalizer

5. Optimizer
6. Types o f N e u r a l N e t w o r k s

7 . C o m m o n Problems

74 Hu a we i Confidential
Data Imbalance (1)
l P r o b l e m description: In t h e dataset consisting o f various task categories, t h e n u m b e r o f
samples varies greatly f r o m o n e category t o another. O n e o r m o r e categories i n t h e
predicted categories c o n ta i n very f e w samples.

l For example, i n a n i m a g e r e c og nitio n experiment, m o r e t h a n 2,000 categories a m o n g a

t o t a l o f 4 2 5 1 t r a i n i n g ima g e s c o n ta i n just o n e i m a g e each. S ome o f t h e others have 2 - 5
images.

l Impacts:
p D u e t o t h e u n b a la n ce d n u m b e r o f samples, w e c a n n o t g e t t h e o p t i m a l r e a l - t i m e result
because m o d e l / a l g o r i t h m never examines categories w i t h very f e w samples adequately.
p Since f e w observation objects m a y n o t be representative f o r a class, w e m a y fail t o o b t a i n
a d e q u a t e samples f o r verification a n d test.

75 Hu a we i Confidential
Data Imbalance (2)

Random Random Synthetic

u n d ers am pl i ng o vers am pl i ng Minority
• Deleting r e d u n d a n t • Copying samples O v e rsam pling
samples in a Technique
category
• Sampling
• M e r g i n g samples

76 Hu a we i Confidential
Vanishing Gr a d i e n t a n d Exploding Gr a d i e n t P r o b l e m ( 1 )
l Vanishin g gradie n t: As net w o r k layers in crease, the d e rivative value of
b a c k p r o p a g a t i o n decreases, w h i c h causes a vanishing g r a d i e n t p r o b l e m .

l Exploding gradie n t: As net w o r k layers in crease, the d e rivative value o f

b a c k p r o p a g a t i o n increases, w h i c h causes a n exploding g r a d i e n t p r o b l e m .

l Cause: y𝑖 = 𝜎(𝑧𝑖) = 𝜎 𝑤𝑖𝑥𝑖 + 𝑏𝑖 Where σ is s i g m o i d function.

w2 w3 w4
b1 b2 b3 C

l Backpropagation can be deduced as follows:

𝜕C 𝜕C 𝜕𝑦 4 𝜕𝑧 4 𝜕𝑥 4 𝜕𝑧 3 𝜕𝑥 3 𝜕𝑧 2 𝜕𝑥 2 𝜕𝑧 1
=
𝜕𝑏 1 𝜕𝑦 4 𝜕𝑧 4 𝜕𝑥 4 𝜕𝑧 3 𝜕𝑥 3 𝜕𝑧 2 𝜕𝑥 2 𝜕𝑧 1 𝜕𝑏 1
𝜕C ′
= 𝜕 𝜎 𝑧4 𝑤4 𝜎′ 𝑧3 𝑤 3 𝜎 ′ 𝑧2 𝑤 2 𝜎 ′ 𝑧1 𝑥
78 Hu a we i Confidential
Vanishing Gr a d i e n t a n d Exploding Gr a d i e n t P r o b l e m ( 2 )

l The m a x i m u m value o f 𝜎 ′ (𝑥) is 14:

l H ow ev er, t h e n e t w o r k w e i g h t 𝑤 is usually s maller t h a n 1. Therefore, 𝜎 ′ 𝑧 𝑤 ≤ 1. According t o t h e chain

𝜕C
rule, as layers increase, t h e derivation result 𝜕
decreases, resulting in t h e vanishing g r a d i e n t pr oblem.

l W h e n t h e n e t w o r k w e i g h t 𝑤 is large, resulting in 𝜎 ′ 𝑧 𝑤 > 1, t h e ex ploding g r a d i e n t p r o b l e m occurs.

l Solution: For example, g r a d i e n t clipping is used t o alleviate t h e ex ploding g r a d i e n t p r o b l e m, ReLU ac tiv ation
f u n c t i o n a n d LSTM are used t o alleviate t h e vanishing g r a d i e n t pr oblem.

79 Hu a we i Confidential
Overfitting
l P r o b l e m description: The m o d e l p e r f o r m s w e l l in t h e t r a i n i n g set, b u t badly in
t h e test set.

l Root cause: There are t o o m a n y f e a t u r e dimensions, m o d e l assumptions, a n d

parameters, t o o m u c h noise, b u t very f e w t r a i n i n g data. As a result, t h e f i t t i n g
f u n c t i o n perfectly predicts t h e t r a i n i n g set, w h i l e t h e prediction result o f t h e test
set o f n e w d a t a is poor. Training d a t a is o v e r - f i t t e d w i t h o u t considering
generalization capabilities.

l Solution : For exampl e, d a ta a u g m en t a t i o n , re gula r ization , early sto p ping, and

dropout

80 Hu a we i Confidential
Summary

l This chapter describes t h e d e f i n i t i o n a n d d e v e l o p m e n t o f n e u r a l networks,

perceptrons a n d their t r a i n i n g rules, c o m m o n types o f neural n e t w o r k s
(CNN, RNN, a n d GAN), a n d t h e C o m m o n Problems o f n e u r a l n e t w o r k s in
AI engineering a n d solutions.

82 Hu a we i Confidential
Quiz

1. ( Tr ue o r false) C o m p a r e d w i t h t h e rec urrent n e u r a l n e t w o r k , t h e c o n v o l u ti o n a l

n e u r a l n e t w o r k is m o r e suitable f o r i m a g e recognition. ( )

A. True

B. False

2. (True o r false) G A N is a deep l e a r n i n g m o d e l , w h i c h is o n e o f t h e m o s t p r o m i s i n g

m e t h o d s f o r unsupervised l e a r n i n g o f c o m p l e x d is tr ib u tio n i n recent years. ( )
A. True

B. False

83 Hu a we i Confidential
Quiz
3. (Single-choice) There are m a n y types o f deep lear ning neural networks. W h i c h o f t h e f o l l o w i n g is n o t a deep
lear ning neur al n e t w o r k ? ( )

A. CNN

B. RNN

C. LSTM

D. Logistic

4. ( M u l t i - c h o i c e ) There are m a n y i m p o r t a n t " c o m p o n e n t s " in t h e c o n v o l u t i o n a l neur al n e t w o r k architecture. W h i c h o f

t h e f o l l o w i n g are t h e c onv olut ional n e u r a l n e t w o r k " c o m p o n e n t s " ? ( )

A. A ctivation f u n c t io n

B. C o n vo lu t io n a l kernel

C. Pooling

D. Fully connected layer

84 Hu a we i Confidential
Recommendations

l O n l i n e learning website
p https://e.huawei.com/cn/talent/#/home
l H u a w e i K n o w l e d g e Base
p https://support.huawei.com/enterprise/servicecenter?lang=zh

85 Hu a we i Confidential
Thank you. 把数字世界带入每个人、每个家庭、
每个组织，构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.

The information in this document may contain predictive

statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6129)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (627)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brene Brown
4/5 (1148)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (935)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4/5 (8215)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (631)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1253)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4/5 (8365)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (860)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (877)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (954)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4/5 (2923)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (484)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (277)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (4972)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (444)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2061)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4281)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (447)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (1987)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2283)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (278)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1068)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (1993)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2641)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (1936)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (125)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (692)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (1912)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4074)
(CB) L5 - Problem Recognition and Information Search
No ratings yet
(CB) L5 - Problem Recognition and Information Search
40 pages
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (75)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (830)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (901)
(Kevin Trudeau) Kevin Trudeau's Mega Memory How PDF
90% (10)
(Kevin Trudeau) Kevin Trudeau's Mega Memory How PDF
368 pages
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (143)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2544)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M L Stedman
4.5/5 (790)
Visual Inquiry Lesson: Pontiac's Rebellion
No ratings yet
Visual Inquiry Lesson: Pontiac's Rebellion
3 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Little Women
From Everand
Little Women
Louisa May Alcott
4/5 (105)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
3.5/5 (109)
Butler & Winne (1995)
No ratings yet
Butler & Winne (1995)
37 pages
Quarter 1 Week 1 - Oral Com
No ratings yet
Quarter 1 Week 1 - Oral Com
3 pages
The Role of Consultant
No ratings yet
The Role of Consultant
2 pages
F A T City Workshop Note-Taking Sheet
No ratings yet
F A T City Workshop Note-Taking Sheet
2 pages
Society of Metaphysicians - The Structure of All
No ratings yet
Society of Metaphysicians - The Structure of All
30 pages
Mapeh 5 1st Quarterly Exam
No ratings yet
Mapeh 5 1st Quarterly Exam
5 pages
Dass 21
No ratings yet
Dass 21
1 page
Literary Meaning in Context and Critical Reading Strategy
33% (3)
Literary Meaning in Context and Critical Reading Strategy
13 pages
Foundations of Human Behavior01
No ratings yet
Foundations of Human Behavior01
14 pages
Artificial Intelligence (CSC4101) : - Introduction - Machine Learning and Optimization Module
No ratings yet
Artificial Intelligence (CSC4101) : - Introduction - Machine Learning and Optimization Module
28 pages
UNIT 7 Communication Expressing Wishes and Regret
No ratings yet
UNIT 7 Communication Expressing Wishes and Regret
1 page
Eimj20211303 07
No ratings yet
Eimj20211303 07
11 pages
Moss Ton Teaching Styles
No ratings yet
Moss Ton Teaching Styles
13 pages
Linguistics For Teachers
100% (1)
Linguistics For Teachers
188 pages
David Fontana - The Nature of A Possible Afterlife
100% (1)
David Fontana - The Nature of A Possible Afterlife
10 pages
Step-By-step-guide To Critic Qual Research - 2
No ratings yet
Step-By-step-guide To Critic Qual Research - 2
8 pages
Personal Development: Quarter 1 - Week 4
No ratings yet
Personal Development: Quarter 1 - Week 4
15 pages
LearnEnglish Listening A2 Leaving A Message
No ratings yet
LearnEnglish Listening A2 Leaving A Message
3 pages
Tone-Unit and Its Structure
No ratings yet
Tone-Unit and Its Structure
18 pages
RRL
No ratings yet
RRL
11 pages
Right Parietal Lobe-Related Selflessness As The Neuropsychological Basis of Spiritual Transcendence
No ratings yet
Right Parietal Lobe-Related Selflessness As The Neuropsychological Basis of Spiritual Transcendence
17 pages
The Study of Sentence Pattern Klompok 51
No ratings yet
The Study of Sentence Pattern Klompok 51
16 pages
Complete Download Organizational Intelligence and Knowledge Analytics 1st Edition Brian Mcbreen PDF All Chapters
100% (1)
Complete Download Organizational Intelligence and Knowledge Analytics 1st Edition Brian Mcbreen PDF All Chapters
55 pages
Brochure BSKW CPP 2024
No ratings yet
Brochure BSKW CPP 2024
8 pages
Depuydt 2017 Fuzzy
No ratings yet
Depuydt 2017 Fuzzy
21 pages
A Semi Detailed Lesson Plan For Final Demo Jerla
No ratings yet
A Semi Detailed Lesson Plan For Final Demo Jerla
8 pages
Research Proposal
No ratings yet
Research Proposal
14 pages