0% found this document useful (0 votes)
5 views

3b Dynamics4

Uploaded by

jiejialing08
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

3b Dynamics4

Uploaded by

jiejialing08
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Outline

COMP9444: Neural Networks and


Deep Learning
➛ geometry of hidden unit activations (8.2)
Week 3b. Hidden Unit Dynamics
➛ limitations of 2-layer networks
Alan Blair ➛ vanishing / exploding gradients
School of Computer Science and Engineering
June 11, 2024
➛ alternative activation functions (6.3)
➛ ways to avoid overfitting in neural networks (5.2-5.3)

Encoder Networks N–2–N Encoder


Hidden Unit Space:

Inputs Outputs
10000 10000
01000 01000
00100 00100
00010 00010
00001 00001

➛ identity mapping through a bottleneck


➛ also called N–M–N task
➛ used to investigate hidden unit representations

3 4
8–3–8 Encoder Hinton Diagrams
Sharp Straight Sharp
Left Ahead Right

Exercise: 30 Output
Units
➛ Draw the hidden unit space for 2-2-2, 3-2-3, 4-2-4 and 5-2-5 encoders.
➛ Represent the input-to-hidden weights for each input unit by a point, and the 4 Hidden
Units
hidden-to-output weights for each output unit by a line.
➛ Now consider the 8-3-8 encoder with its 3-dimensional hidden unit space.
→ what shape would be formed by the 8 points representing the
input-to-hidden weights for the 8 input units?
→ what shape would be formed by the planes representing the
hidden-to-output weights for each output unit? 30x32 Sensor
Input Retina

Hint: think of two platonic solids, which are “dual” to each other.
➛ used to visualize higher dimensions
➛ white = positive, black = negative

5 6

Learning Face Direction Learning Face Direction

7 8
Weight Space Symmetry (8.2) Controlled Nonlinearity

➛ swap any pair of hidden nodes, overall function will be the same
➛ for small weights, each layer implements an approximately linear function,
➛ on any hidden node, reverse the sign of all incoming and outgoing weights so multiple layers also implement an approximately linear function.
(assuming symmetric transfer function)
➛ for large weights, transfer function approximates a step function,
➛ hidden nodes with identical input-to-hidden weights in theory would never so computation becomes digital and learning becomes very slow.
separate; so, they all have to begin with different random weights
➛ with typical weight values, two-layer neural network implements a function
➛ in practice, all hidden nodes may try to do similar job at first, then gradually which is close to linear, but takes advantage of a limited degree of nonlinearity.
specialize.

9 10

Limitations of Two-Layer Neural Networks First Hidden Layer


Some functions are difficult for a 2-layer network to learn.
6

−2

−4

−6

−6 −4 −2 0 2 4 6

For example, this Twin Spirals problem is difficult to learn with a 2-layer network,
but it can be learned using a 3-layer network.

11 12
Second Hidden Layer Network Output

13 14

Adding Hidden Layers Vanishing / Exploding Gradients

➛ training by backpropagation in networks with many layers is difficult


➛ twin spirals can be learned by 3-layer network
➛ when the weights are small, the differentials become smaller and smaller as
➛ first hidden layer learns linearly separable features we backpropagate through the layers, and end up having no effect
➛ second hidden layer combines these to produce more complex features ➛ when the weights are large, the activations in the higher layers may saturate
to extreme values
➛ learning rate and initial weight values must be small
➛ when the weights are large, the differentials may sometimes get multiplied
➛ learning can be improved using the Adam optimizer
twice in succession in places where the transfer function is steep, causing
them to blow up to large values

15 16
Vanishing / Exploding Gradients Activation Functions (6.3)
4 4

3 3

2 2

1 1

Ways to avoid vanishing / exploding gradients:


0 0

➛ new activations functions -1 -1

➛ weight initialization (Week 4) -2


-4 -2 0 2 4
-2
-4 -2 0 2 4

Sigmoid Rectified Linear Unit (ReLU)


➛ batch normalization (Week 4) 4 4

➛ skip connections (Week 4) 3 3

➛ long short term memory (LSTM) (Week 5) 2 2

1 1

0 0

-1 -1

-2 -2
-4 -2 0 2 4 -4 -2 0 2 4

Hyperbolic Tangent Scaled Exponential Linear Unit (SELU)

17 18

Activation Functions

➛ sigmoid and hyperbolic tangent traditionally used for 2-layer networks,


but suffer from vanishing gradient problem in deeper networks.

➛ rectified linear units (ReLUs) are popular for deep networks


(including convolutional networks); gradients will not vanish
because derivative is either 0 or 1.

19

You might also like