Notes_ML_02_Slides_RNN_ANN
Notes_ML_02_Slides_RNN_ANN
Dendrites
of Axon
Axon
#% 1: net > 0
OUT = f ( net ) = $ with net = ∑ wij In j
%& −1: net ≤ 0 j
Activation function
Perceptron: example
• Two inputs: x, y
1
• Two weights: w1, w2 w0
x w1
• Bias input: w0
Sf out
out = f [w0 + w1 x + w2 y] y w2
y
+1
w0 + w1x + w2 y = 0
w0 + w1x + w2 y > 0
-1
f(.):
x
w0 + w1x + w2 y < 0
7
Representational power of Perceptrons
Perceptron can represent any linearly separable function.
It represents a hyperplane decision surface
in the n-dimensional space of instances.
1 neuron: 2 classes +1 and –1
2 neurons: 4 classes +1+1, +1-1, -1+1, -1-1
N neurons: 2N classes
A
A
¬ OK B
B Not OK ®
8
Exercise Table 1
Figure 1 – Perceptron.
9
Exercise Table 1
Figure 1 – Perceptron.
Uma resposta: w0 = 1, w1= -2, w2 = 2
(0 representa NL0, 1 representa NL1) 10
The Perceptron training rule
Delta Rule
wi ← wi + Δwi
Delta Rule:
and
Δwi = η (t − o) xi
The Perceptron training algorithm
• Initialize weight matrix w
• For each training sample i, with (xi,s(xi)) = (xi,ti):
– Use current w to calculate s’(xi) = oi
– If |ti − oi| > ε then update:
w ¬ w + η.(ti – oi).xi
• Stop when ε ³ |ti − oi| for all samples (xi,ti)
• An intermediate layer is
sufficient to approximate
any continuous function.
• Two intermediate layers
are sufficient to
approximate any
mathematical function.
MLP: Multi-layer Perceptrons
X2
(111) (1-11)
A B
Classification requires 3
C straight lines that create 7
compartments and two
decision regions: one for
(-1-1-1)
X1
and one for
compartments
Hidden layer
Inputs A
Output
x1 1 Class 1
Solution: B out
x2
-1 Class 2
C
(-1-1-1),(-1-11),…..,(111)
15
Example of an MLP
• x1 = x2 = binary inputs.
• w1 = w2 = w3 = w4 = w5 = 1 e w6 = -2 x1 x2 Out
• f1 (x.w) = 1 if the activation level ³ 0.5, 0 0
0 otherwise. 0 1
• f2 (x.w) = 1 if the activation level ³ 1,5, 1 0
0 otherwise. 1 1
x1 1
0,5
1 1
0,5
1 out
1 -2
x2 1,5
x1 1
0,5
1 1
0,5
1 out
1 -2
x2 1,5
We got a network for XOR !!!
0 net
0 q net
Logistics function
Out
Out = 1 if net > q
= -1 if net < q Out
1 Out = tanh(l.net)
1
0 q net
0 net
-1
-1 Hyperbolic Tangent Function
MLP / Sigmoid
1
σ (net) = or σ (net) = tanh ( net )
1+ e−net
1
s ( x) = -x
Þ s ¢( x) = s ( x)(1 - s ( x) )
1+ e Interesting
s ( x) = tanh (x ) Þ s ( x) =
¢ (1 - s 2
)
( x ) property ...
2
derivative function 19
MLP training
• Key idea: use of gradient descent
to search the hypothesis space of
possible weight vectors to find the o
weights that best fit the training
examples.
• Learn wi’s that minimize squared x1 x1
error:
å
! 1 2
E[ w] = (td -o d )
2
d ÎD
D = training data 20
Gradient Descent
! 1
E[ w] = å (td -o d ) 2
2deD
h positive
Gradient: ! é ¶E ¶E ¶E ù
ÑE[ w] = ê , ,..., ú
¶
ë 0w ¶w1 ¶wnû
! ! ¶E
Rule: Dw = -h ÑE[w] Dwi = -h
¶wi
Because it is desired to move the weight vector in the direction that the error E decreases 21
Gradient Descent (one layer)
¶E ¶ 1
= å d d
¶wi ¶wi 2 d
(t - o ) 2
1 ¶
= å (t d - od ) 2
2 d ¶wi
1 ¶
= å 2(t d - od ) (t d - od )
2 d ¶wi
¶ ! !
= å (t d - od ) (t d - s ( w × xd ))
d ¶wi
¶ ! !
= å (t d - od ) (- s ( w × xd )) σ is the logistics function
d ¶wi
! ! ! !
= -å (t d - od ) s ( w × xd ) (1 - s ( w × xd )) xi ,d
d
od od
22
Gradient Descent: multiple Outputs
1
E[ w] = ∑ ∑ (tkd − o kd ) 2
2d∈D k ∈ outputs
23
The Backpropagation algorithm
• A feed-forward network with one hidden layer
with nhid sigmoidal units, nout outputs (sigmoidal
units), nin inputs, several training samples < x, t >
W
t (target output)
x o (net output)
nout
nin
nhid
http://www.trapexit.org/images/b/ba/Animate_ANN.gif
46
Example:
initial weights: (-0.1 a +0.1); h = 0.3
Entrada Saída
10000000 ® 10000000
01000000 ® 01000000
00100000 ® 00100000
00010000 ® 00010000
00001000 ® 00001000
00000100 ® 00000100
00000010 ® 00000010
00000001 ® 00000001
47
Learning the representation (inner layer)
1 0 0
0 1 1
Representations
Matter!!
y
θ
x r 55
56
Output
Mapping from
Output Output
features
Learning
Multiple Output
Mapping from
features
Mapping from
features
Additional
layers of more
abstract
features
Components
Hand- Hand-
Simple
designed designed Features
features
program features
Deep
Classic learning
Rule-based
machine
systems Representation
learning
learning 57
Deep Learning
https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-
machine-learning-deep-learning-ai/ 58
Historical Trends:
Growing Datasets
59
Historical Trends:
Growing Connections per Neuron
0.30
ILSVRC classification error rate
0.25
0.20
0.15
0.10
0.05
0.00
2010 2011 2012 2013 2014 2015
Year
Figure 1.12: Since deep networks reached the scale necessary to compete in the ImageNet 62
Large Scale Visual Recognition Challenge, they have consistently won the competition
every year, and yielded lower and lower error rates each time. Data from Russakovsky
Deep Learning x GPUs
• What is the relation between DL and GPUs?
63
Problems with DNN (Deep MLP)
• Overfitting:
– The more layers you have, the
more degrees of freedom you
have.
– DNNs model rare dependencies
in the training data.
• Diffusion of Gradient: error
attenuates as it propagates to early
layers.
– Early layers never learn!
https://en.wikipedia.org/wiki/Deep_learning 64
Main Deep Learning Architectures
• Deep Belief Networks / Autoencoders
– Greedy layer-wise pretraining, by Hinton et al.,
2006
• Deep Convolutional Neural Networks
– La Net, by Le Cun et al, 1998.
• Deep Recurrent Networks
– Long short-term Memory, by Sepp&Jurgen, 97.
65
Greedy layer-wise pretraining,
by Hinton et al., 2006.
AUTOENCODERS
Autoencoders
67
Autoencoders
• An auto-encoder is trained, with an absolutely
standard weight-adjustment algorithm to
reproduce the input.
• By making this happen with (many) fewer
units than the inputs, this forces the ‘hidden
layer’ units to become good feature
detectors.
68
https://www.macs.hw.ac.uk/~dwcorne/Teaching/introdl.ppt
Representation learning (hidden layer)
1 0 0
0 1 1
Autoencoder in the
second layer
Supervised learning in
the last layer
76
CNN
• We know it is good to learn a small model.
• From this fully connected model, do we really need all the
edges?
• Can some of these be shared?
https://cs.uwaterloo.ca/~mli/cs898-2017.html
Consider learning an image:
• Some patterns are much smaller than the
whole image
“beak” detector
https://cs.uwaterloo.ca/~mli/cs898-2017.html
Same pattern appears in different places:
They can be compressed!
What about training a lot of such “small” detectors
and each detector must “move around”.
“upper-left
beak” detector
“middle beak”
detector
https://cs.uwaterloo.ca/~mli/cs898-2017.html
A convolutional layer
A CNN is a neural network with some convolutional layers
(and some other layers). A convolutional layer has a number
of filters that does convolutional operation.
Beak detector
A filter
https://cs.uwaterloo.ca/~mli/cs898-2017.html
Convolution These are the network
parameters to be learned.
1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
-1 1 -1 Filter 2
0 1 0 0 1 0
0 0 1 0 1 0 -1 1 -1
…
…
6 x 6 image
Each filter detects a
small pattern (3 x 3).
https://cs.uwaterloo.ca/~mli/cs898-2017.html
1 -1 -1
Convolution-1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
https://cs.uwaterloo.ca/~mli/cs898-2017.html
1 -1 -1
Convolution-1 1 -1 Filter 1
-1 -1 1
If stride=2
1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
https://cs.uwaterloo.ca/~mli/cs898-2017.html
1 -1 -1
Convolution-1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1
6 x 6 image 3 -2 -2 -1
https://cs.uwaterloo.ca/~mli/cs898-2017.html
-1 1 -1
Convolution-1 1 -1 Filter 2
-1 1 -1
stride=1
Repeat this for each filter
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
-1 -1 -1 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1
0 1 0 0 1 0 Feature
0 0 1 0 1 0 -3 -3 Map0 1
-1 -1 -2 1
6 x 6 image 3 -2 -2 -1
-1 0 -4 3
Two 4 x 4 images
https://cs.uwaterloo.ca/~mli/cs898-2017.html Forming 2 x 4 x 4 matrix
Color image: RGB 3 channels
1 -1 -1 -1-1 11 -1-1
11 -1-1 -1-1 -1 1 -1
-1 1 -1 -1-1 11 -1-1
-1-1 11 -1-1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1-1 11 -1-1
-1-1 -1-1 11 -1 1 -1
Color image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
https://cs.uwaterloo.ca/~mli/cs898-2017.html
1 -1 -1 Filter 1 1 1
-1 1 -1 2 0
-1 -1 1 3 0
4 0 3
…
1 0 0 0 0 1
0 1 0 0 1 0 7 0
0 0 1 1 0 0 8 1
1 0 0 0 1 0 9 0
0 1 0 0 1 0 10 0
…
0 0 1 0 1 0
13 0
6 x 6 image
14 0
fewer parameters! 15 1 Only connect to 9
16 1 inputs, not fully
connected
…
https://cs.uwaterloo.ca/~mli/cs898-2017.html
1 -1 -1 1: 1
-1 1 -1 Filter 1 2: 0
-1 -1 1 3: 0
4: 0 3
…
1 0 0 0 0 1
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0 -1
0 1 0 0 1 0 10: 0
…
0 0 1 0 1 0
13: 0
6 x 6 image
14: 0
Fewer parameters 15: 1
16: 1 Shared weights
Even fewer parameters
…
https://cs.uwaterloo.ca/~mli/cs898-2017.html
The whole CNN
cat dog ……
Convolution
Max Pooling
Can
Fully Connected repeat
Feedforward network
Convolution many
times
Max Pooling
Flattened
https://cs.uwaterloo.ca/~mli/cs898-2017.html
Max Pooling
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1
3 -1 -3 -1 -1 -1 -1 -1
-3 1 0 -3 -1 -1 -2 1
-3 -3 0 1 -1 -1 -2 1
3 -2 -2 -1 -1 0 -4 3
https://cs.uwaterloo.ca/~mli/cs898-2017.html
Why Pooling
• Subsampling pixels will not change the object
bird
bird
Subsampling
https://cs.uwaterloo.ca/~mli/cs898-2017.html
Max Pooling
New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0 -1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 3 1
0 3
0 0 1 0 1 0 Pooling
2 x 2 image
6 x 6 image
Each filter
is a channel
https://cs.uwaterloo.ca/~mli/cs898-2017.html
The whole CNN
3 0
-1 1 Convolution
3 1
0 3
Max Pooling
https://cs.uwaterloo.ca/~mli/cs898-2017.html
The whole CNN
cat dog ……
Convolution
Max Pooling
Max Pooling
1
3 0
-1 1 3
3 1 -1
0 3 Flattened
1 Fully Connected
Feedforward network
3
https://cs.uwaterloo.ca/~mli/cs898-2017.html
CNNs
http://stats.stackexchange.com/questions/146413/why-
convolutional-neural-networks-belong-to-deep-learning 97
LeNet-5
LeCun, Bottou, Bengio & Haffner, 1998
98
Alexnet
Krizhevsky, Sutskever & Hinton, 2012
99
VGG
Simonyan & Zisserman, 2014
https://arxiv.org/abs/1409.1556 100
DeepFace
Taigman, Yang, Ranzato &Wolf, 2014
http://research.google.com/pubs/pub43022.html 102
Conclusion
• CNN: Special purpose net – Just for images or problems
with strong grid-like local spatial/temporal correlation
• Once trained on one problem could use same net (often
tuned) for a new similar problem – general creator of
vision features
• Autoenconder could be used to find initial parameters
• Lots of hand crafting and tuning to find the right recipe
of receptive fields, layer interconnections, etc.
– Lots more Hyperparameters than standard nets, and even than other deep
networks, since the structures of CNNs are more handcrafted
– CNNs getting wider and deeper with speed-up techniques (e.g. GPU, ReLU,
etc.) and lots of current research, excitement, and success
103
Fully Supervised Deep Learning
• Much recent success in doing fully supervised
deep learning with extensions which diminish the
effect of early learning difficulties (unstable
gradient, etc.)
• Patience (now that we know it may be worth it),
faster computers, and use of GPUs
• More efficient activation functions (e.g. ReLUs) in
terms of both computation and avoiding f'(net)
saturation
104
Open problems
• A More Scientific Approach is Needed, not
Just Building Better Systems…
– Geoff Hinton, Yoshua Bengio & Yann LeCun, NIPS
2015
105