0% found this document useful (0 votes)
8 views

Notes_ML_02_Slides_RNN_ANN

The document discusses Artificial Neural Networks (ANNs), detailing their history, design, and applications in various fields such as face recognition and autonomous vehicle navigation. It covers the structure of perceptrons, multilayer perceptrons, and the backpropagation algorithm for training these networks. Additionally, it highlights recent advances in neural networks, including deep learning and integration with fuzzy logic.

Uploaded by

Marcelo Davi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Notes_ML_02_Slides_RNN_ANN

The document discusses Artificial Neural Networks (ANNs), detailing their history, design, and applications in various fields such as face recognition and autonomous vehicle navigation. It covers the structure of perceptrons, multilayer perceptrons, and the backpropagation algorithm for training these networks. Additionally, it highlights recent advances in neural networks, including deep learning and integration with fuzzy logic.

Uploaded by

Marcelo Davi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 105

Terminal Branches

Dendrites
of Axon

Axon

Artificial Neural Network


Anna Helena Reali Costa
PCS
ANN
• A different style of computation: parallel
distributed processing
• A universal computational architecture: the
same structure carries out many different
functions
• It can learn new knowledge, therefore it is
adaptive
History
• McCulloch and Pitts introduced the artificial
neuron in 1943.
– Simplified model of a biological neuron
• Fell out of favor in the late 1960's
– Perceptron limitations (Minsky and Papert)
• Resurgence in the mid 1980's
– Nonlinear Neuron Functions
– Backpropagation training (Werbos)
• Currently resounding success with Deep NN
Applications
Face / Speech
Recognition
User Authentication.

Identification of military targets: Oil exploration:


B-52, Boeing 747, Space Shuttle Lithology, etc.

Autonomous Vehicle Navigation Prediction on


Financial Market
Design of an ANN
• The design of an ANN involves determining the
following elements:
– Neurons and activation function.
– Connections and arrangement of neurons: topology
(network architecture).
– Synaptic weights: values (in the case of learned
weights) or a training algorithm to be used and its
parameters.
– Recall: procedure to be used for the network to
calculate the outputs for given new inputs

• Unfortunately, there is no single "recipe" ...


Perceptron (Frank Rosenblatt, 1958)
1 wi1
2
wi2
N Inputs i OUT
.
. wij
j
win
n

#% 1: net > 0
OUT = f ( net ) = $ with net = ∑ wij In j
%& −1: net ≤ 0 j

Activation function
Perceptron: example
• Two inputs: x, y
1
• Two weights: w1, w2 w0
x w1
• Bias input: w0
Sf out
out = f [w0 + w1 x + w2 y] y w2

y
+1
w0 + w1x + w2 y = 0
w0 + w1x + w2 y > 0
-1
f(.):
x
w0 + w1x + w2 y < 0
7
Representational power of Perceptrons
Perceptron can represent any linearly separable function.
It represents a hyperplane decision surface
in the n-dimensional space of instances.
1 neuron: 2 classes +1 and –1
2 neurons: 4 classes +1+1, +1-1, -1+1, -1-1
N neurons: 2N classes
A
A
¬ OK B

B Not OK ®

8
Exercise Table 1

• Design a perceptron (Figure 1) x1 x2 y


which calculates the logical 0 0 1
implication function y = x1 ® x2,
0 1 1
described in Table 1.
1 0 0
1 1 1

Figure 1 – Perceptron.

9
Exercise Table 1

• Design a perceptron (Figure 1) x1 x2 y


which calculates the logical 0 0 1
implication function y = x1 ® x2,
0 1 1
described in Table 1.
1 0 0
1 1 1

Figure 1 – Perceptron.
Uma resposta: w0 = 1, w1= -2, w2 = 2
(0 representa NL0, 1 representa NL1) 10
The Perceptron training rule
Delta Rule

• Let t be the target output for the current


training example, o be the output generated by
the perceptron, and η be a positive constant
called the learning rate:

wi ← wi + Δwi
Delta Rule:
and
Δwi = η (t − o) xi
The Perceptron training algorithm
• Initialize weight matrix w
• For each training sample i, with (xi,s(xi)) = (xi,ti):
– Use current w to calculate s’(xi) = oi
– If |ti − oi| > ε then update:
w ¬ w + η.(ti – oi).xi
• Stop when ε ³ |ti − oi| for all samples (xi,ti)

If |ti – oi| = |d| £ e è w does not change


If d > 0 and |d| > e è w increases (because oi is very small)
If d < 0 and |d| > e è w decreases (because oi is very large)
Limitations of Perceptron with a
single layer
• Unfortunately, various functions of interest
are not linearly separable.
• For example, the Perceptron can not represent
XOR (exclusive OR).
X2
X1 X2 XOR
b d
a 0 0 0 1 Out = 0
b 0 1 1 Out = 1
c 1 0 1 a c
d 1 1 0 0
X1
1
MLP: Multilayer Perceptrons
• Sets of perceptrons arranged in several layers.
• At least one hidden layer.

• An intermediate layer is
sufficient to approximate
any continuous function.
• Two intermediate layers
are sufficient to
approximate any
mathematical function.
MLP: Multi-layer Perceptrons
X2
(111) (1-11)
A B
Classification requires 3
C straight lines that create 7
compartments and two
decision regions: one for
(-1-1-1)
X1
and one for
compartments
Hidden layer

Inputs A
Output
x1 1 Class 1
Solution: B out
x2
-1 Class 2

C
(-1-1-1),(-1-11),…..,(111)
15
Example of an MLP
• x1 = x2 = binary inputs.
• w1 = w2 = w3 = w4 = w5 = 1 e w6 = -2 x1 x2 Out
• f1 (x.w) = 1 if the activation level ³ 0.5, 0 0
0 otherwise. 0 1
• f2 (x.w) = 1 if the activation level ³ 1,5, 1 0
0 otherwise. 1 1

x1 1
0,5
1 1
0,5
1 out
1 -2
x2 1,5

Make the recall of this network, filling the table.


16
Example of an MLP
• x1 = x2 = binary inputs.
• w1 = w2 = w3 = w4 = w5 = 1 e w6 = -2 x1 x2 Out
• f1 (x.w) = 1 if the activation level ³ 0.5, 0 0 0
0 otherwise. 0 1 1
• f2 (x.w) = 1 if the activation level ³ 1,5, 1 0 1
0 otherwise. 1 1 0

x1 1
0,5
1 1
0,5
1 out
1 -2
x2 1,5
We got a network for XOR !!!

We now need an algorithm to train the weights ...


17
Types of Activation Function
• Linear-Threshold functions: • Sigmoid functions:
1
Out
Out = --------------------
Out 1 + exp (-l.net)
Out = 1 if net > q 1
1 = 0 if net < q

0 net
0 q net
Logistics function
Out
Out = 1 if net > q
= -1 if net < q Out
1 Out = tanh(l.net)
1
0 q net
0 net
-1
-1 Hyperbolic Tangent Function
MLP / Sigmoid
1
σ (net) = or σ (net) = tanh ( net )
1+ e−net

1
s ( x) = -x
Þ s ¢( x) = s ( x)(1 - s ( x) )
1+ e Interesting
s ( x) = tanh (x ) Þ s ( x) =
¢ (1 - s 2
)
( x ) property ...
2
derivative function 19
MLP training
• Key idea: use of gradient descent
to search the hypothesis space of
possible weight vectors to find the o
weights that best fit the training
examples.
• Learn wi’s that minimize squared x1 x1
error:

å
! 1 2
E[ w] = (td -o d )
2
d ÎD

D = training data 20
Gradient Descent
! 1
E[ w] = å (td -o d ) 2
2deD

h positive

1 neuron with two inputs


(and two weights w0 or w1)

Gradient: ! é ¶E ¶E ¶E ù
ÑE[ w] = ê , ,..., ú

ë 0w ¶w1 ¶wnû

! ! ¶E
Rule: Dw = -h ÑE[w] Dwi = -h
¶wi
Because it is desired to move the weight vector in the direction that the error E decreases 21
Gradient Descent (one layer)
¶E ¶ 1
= å d d
¶wi ¶wi 2 d
(t - o ) 2

1 ¶
= å (t d - od ) 2
2 d ¶wi
1 ¶
= å 2(t d - od ) (t d - od )
2 d ¶wi
¶ ! !
= å (t d - od ) (t d - s ( w × xd ))
d ¶wi
¶ ! !
= å (t d - od ) (- s ( w × xd )) σ is the logistics function
d ¶wi
! ! ! !
= -å (t d - od ) s ( w × xd ) (1 - s ( w × xd )) xi ,d
d
od od
22
Gradient Descent: multiple Outputs

The error should now be redefined to k outputs:

 1
E[ w] = ∑ ∑ (tkd − o kd ) 2

2d∈D k ∈ outputs

D = training data Outputs

23
The Backpropagation algorithm
• A feed-forward network with one hidden layer
with nhid sigmoidal units, nout outputs (sigmoidal
units), nin inputs, several training samples < x, t >
W

t (target output)
x o (net output)
nout
nin
nhid

Input parameters of the backpropagation algorithm:


{<x1, t1>, <x2, t2>....}, h, nhid, nin, nout 24
The Backpropagation algorithm
• Initialize w to small random numbers
• Until the termination condition is met, do:
– For each <x,t> in training-examples, do:
//Propagate the input forward:
1. Propagate the input forward and compute each ok.
//Propagate the error backward:
2. For each output unit k, calculate the error dk:
dk ¬ ok (1 - ok) (tk - ok)
3. For each hidden unit h, calculate the error dh:
dh ¬ oh (1 - oh) SkÎsaidas whk .dk
4. Update each network weight wij:
wij ¬ wij + h dj xij
Input from i to j 25
Backpropagation:
animation

http://www.trapexit.org/images/b/ba/Animate_ANN.gif
46
Example:
initial weights: (-0.1 a +0.1); h = 0.3

Entrada Saída
10000000 ® 10000000
01000000 ® 01000000
00100000 ® 00100000
00010000 ® 00010000
00001000 ® 00001000
00000100 ® 00000100
00000010 ® 00000010
00000001 ® 00000001
47
Learning the representation (inner layer)
1 0 0
0 1 1

Entrada Saída hs Saída


10000000 ® .89 .04 .08 ® 10000000
01000000 ® .15 .99 .99 ® 01000000
00100000 ® .01 .97 .27 ® 00100000
00010000 ® .99 .97 .71 ® 00010000
00001000 ® .03 .05 .02 ® 00001000
00000100 ® .01 .11 .88 ® 00000100
00000010 ® .80 .01 .98 ® 00000010
00000001 ® .60 .94 .01 ® 00000001

Intermediate representation "discovers" binary code !!!!


48
Application × ANN type
• There are many other types of ANN:
– For classification or prediction tasks, usually feed-
forward networks (such as MLP) are used.
– For clustering tasks, the types of network used
are: Simple Competitive Networks, Adaptive
Resonance Theory (ART) networks, Kohonen Self-
Organizing Maps (SOM).
– In association tasks, an ANN can be trained to
"remember" a number of patterns; the type of
ANN usually used for this task is Hopfield network.
Recent advances and future
applications of ANNs
• Integration of fuzzy logic into neural networks
• Pulsed neural networks
• Hardware specialized for neural networks
• Deep Learning:
– enable much deeper (and larger) networks (5 to 10 hidden
layers): have the ability of building up a complex hierarchy
of concepts.
– one can show that there are functions which a k-layer
network can represent compactly (with a number of
hidden units that is polynomial in the number of inputs),
that a (k − 1)-layer network cannot represent unless it has
an exponentially large number of hidden units.
References
• Russel, S.; Norvig, P. Artificial Intelligence: a
modern approach. 2nd.edition. Prentice Hall,
2003. Cap. 20.5.
• Mitchell, T.M. Machine Learning.
WCB/McGraw-Hill, 1997. Cap.4.
• Simon Haykin. Neural Networks: A Comprehensive
Foundation.
• Livro em Português: Braga, Ludermir e Carvalho. Redes
Neurais Artificiais. LTC.

There are very interesting stuff on the web.


51
Deep Learning

Anna Helena Reali Costa


“more intuitive” applications
“Baseball player is
throwing ball in game."

Generating natural language descriptions


Face recognition

Dog or Mop? Chiuaua or Muffin?


Deep Learning Applications
Machine Learning depends on
Representation
• The performance of simple machine learning
algorithms depends heavily on the
representation of the data they are given
CHAPTER 1. INTRODUCTION
(features!).
Cartesian coordinates Polar coordinates

Representations
Matter!!
y

θ
x r 55

Figure 1.1: Example of different representations: suppose we want to separate two


Representation Learning

• Solution: to use machine learning to discover


not only the mapping from representation to
output but also the representation itself.
– This approach is known as representation learning.

• Deep learning allows the computer to build


complex representations out of simpler
representations.

56
Output

Mapping from
Output Output
features

Learning
Multiple Output
Mapping from
features
Mapping from
features
Additional
layers of more
abstract
features

Components
Hand- Hand-
Simple
designed designed Features
features
program features

Input Input Input Input

Deep
Classic learning
Rule-based
machine
systems Representation
learning
learning 57
Deep Learning

https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-
machine-learning-deep-learning-ai/ 58
Historical Trends:
Growing Datasets

59
Historical Trends:
Growing Connections per Neuron

9: COTS HPC unsupervised CNN (Coates et al., 2013)


10: GoogLeNet (Szegedy et al., 2014) 60
Historical Trends:
Growing Number of Neurons

9: Echo state network (Jaeger and Haas, 2004)


20: GoogLeNet (Szegedy et al., 2014) 61
CHAPTER 1. INTRODUCTION
Historical Trends: Increasing Accuracy
• DL has solved increasingly complicated
applications with increasing accuracy:
– Image Recognition and Object Recognition:
• Large Scale Visual Recognition Challenge 2017 (ILSVRC2017)

0.30
ILSVRC classification error rate

0.25

0.20

0.15

0.10

0.05

0.00
2010 2011 2012 2013 2014 2015
Year
Figure 1.12: Since deep networks reached the scale necessary to compete in the ImageNet 62
Large Scale Visual Recognition Challenge, they have consistently won the competition
every year, and yielded lower and lower error rates each time. Data from Russakovsky
Deep Learning x GPUs
• What is the relation between DL and GPUs?

Deep Learning & GPU:


– Matrix operations
– Parallel computing paradigm

63
Problems with DNN (Deep MLP)
• Overfitting:
– The more layers you have, the
more degrees of freedom you
have.
– DNNs model rare dependencies
in the training data.
• Diffusion of Gradient: error
attenuates as it propagates to early
layers.
– Early layers never learn!

https://en.wikipedia.org/wiki/Deep_learning 64
Main Deep Learning Architectures
• Deep Belief Networks / Autoencoders
– Greedy layer-wise pretraining, by Hinton et al.,
2006
• Deep Convolutional Neural Networks
– La Net, by Le Cun et al, 1998.
• Deep Recurrent Networks
– Long short-term Memory, by Sepp&Jurgen, 97.

65
Greedy layer-wise pretraining,
by Hinton et al., 2006.

AUTOENCODERS
Autoencoders

• Autoencoders have just one layer.


• The aim of an autoencoder is to learn a
representation (encoding) for a set of data,
typically for the purpose of dimensionality
reduction.

67
Autoencoders
• An auto-encoder is trained, with an absolutely
standard weight-adjustment algorithm to
reproduce the input.
• By making this happen with (many) fewer
units than the inputs, this forces the ‘hidden
layer’ units to become good feature
detectors.

68
https://www.macs.hw.ac.uk/~dwcorne/Teaching/introdl.ppt
Representation learning (hidden layer)
1 0 0
0 1 1

Entrada Saída hs Saída


10000000 ® .89 .04 .08 ® 10000000
01000000 ® .15 .99 .99 ® 01000000
00100000 ® .01 .97 .27 ® 00100000
00010000 ® .99 .97 .71 ® 00010000
00001000 ® .03 .05 .02 ® 00001000
00000100 ® .01 .11 .88 ® 00000100
00000010 ® .80 .01 .98 ® 00000010
00000001 ® .60 .94 .01 ® 00000001

Intermediate representation: "discover" binary code !!!!


69
Greedy Layer-Wise Training
Geoffrey E. Hinton and Simon Osindero andYee-Whye Teh. A fast learning algorithm for deep belief nets, 2006

1. Train first layer using your data without the labels.


2. Then freeze the first layer parameters and start training the
second layer using the output of the first layer as the input to
the second layer.
3. Repeat this for as many layers as desired:
– This builds our set of robust features
4. Use the outputs of the final layer as inputs to a supervised
layer/model and train the last supervised layer(s) (leave early
weights frozen)
5. Unfreeze all weights and fine tune the full network by training
with a supervised approach, given the pre-processed weight
settings.
70
How to train this network?

Advanced Machine Learning and Neural Networks, Tony Martinez


http://axon.cs.byu.edu/~martinez/classes/678/ 71
http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders
How to train this network?

Autoencoder in the first


layer

Advanced Machine Learning and Neural Networks, Tony Martinez


http://axon.cs.byu.edu/~martinez/classes/678/ 72
http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders
How to train this network?

Autoencoder in the
second layer

Advanced Machine Learning and Neural Networks, Tony Martinez


http://axon.cs.byu.edu/~martinez/classes/678/ 73
http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders
How to train this network?

Supervised learning in
the last layer

Advanced Machine Learning and Neural Networks, Tony Martinez


http://axon.cs.byu.edu/~martinez/classes/678/ 74
http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders
How to train this network?

Fine tune the full NN


with supervised
learning

Advanced Machine Learning and Neural Networks, Tony Martinez


http://axon.cs.byu.edu/~martinez/classes/678/ 75
http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders
CONVOLUTIONAL NEURAL
NETWORKS

76
CNN
• We know it is good to learn a small model.
• From this fully connected model, do we really need all the
edges?
• Can some of these be shared?

https://cs.uwaterloo.ca/~mli/cs898-2017.html
Consider learning an image:
• Some patterns are much smaller than the
whole image

Can represent a small region with fewer parameters

“beak” detector
https://cs.uwaterloo.ca/~mli/cs898-2017.html
Same pattern appears in different places:
They can be compressed!
What about training a lot of such “small” detectors
and each detector must “move around”.

“upper-left
beak” detector

They can be compressed


to the same parameters.

“middle beak”
detector

https://cs.uwaterloo.ca/~mli/cs898-2017.html
A convolutional layer
A CNN is a neural network with some convolutional layers
(and some other layers). A convolutional layer has a number
of filters that does convolutional operation.

Beak detector

A filter

https://cs.uwaterloo.ca/~mli/cs898-2017.html
Convolution These are the network
parameters to be learned.

1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
-1 1 -1 Filter 2
0 1 0 0 1 0
0 0 1 0 1 0 -1 1 -1



6 x 6 image
Each filter detects a
small pattern (3 x 3).
https://cs.uwaterloo.ca/~mli/cs898-2017.html
1 -1 -1
Convolution-1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image

https://cs.uwaterloo.ca/~mli/cs898-2017.html
1 -1 -1
Convolution-1 1 -1 Filter 1
-1 -1 1
If stride=2

1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image

https://cs.uwaterloo.ca/~mli/cs898-2017.html
1 -1 -1
Convolution-1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1

6 x 6 image 3 -2 -2 -1

https://cs.uwaterloo.ca/~mli/cs898-2017.html
-1 1 -1
Convolution-1 1 -1 Filter 2
-1 1 -1
stride=1
Repeat this for each filter
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
-1 -1 -1 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1
0 1 0 0 1 0 Feature
0 0 1 0 1 0 -3 -3 Map0 1
-1 -1 -2 1

6 x 6 image 3 -2 -2 -1
-1 0 -4 3
Two 4 x 4 images
https://cs.uwaterloo.ca/~mli/cs898-2017.html Forming 2 x 4 x 4 matrix
Color image: RGB 3 channels
1 -1 -1 -1-1 11 -1-1
11 -1-1 -1-1 -1 1 -1
-1 1 -1 -1-1 11 -1-1
-1-1 11 -1-1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1-1 11 -1-1
-1-1 -1-1 11 -1 1 -1
Color image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
https://cs.uwaterloo.ca/~mli/cs898-2017.html
1 -1 -1 Filter 1 1 1
-1 1 -1 2 0
-1 -1 1 3 0
4 0 3


1 0 0 0 0 1
0 1 0 0 1 0 7 0
0 0 1 1 0 0 8 1
1 0 0 0 1 0 9 0
0 1 0 0 1 0 10 0


0 0 1 0 1 0
13 0
6 x 6 image
14 0
fewer parameters! 15 1 Only connect to 9
16 1 inputs, not fully
connected

https://cs.uwaterloo.ca/~mli/cs898-2017.html
1 -1 -1 1: 1
-1 1 -1 Filter 1 2: 0
-1 -1 1 3: 0
4: 0 3


1 0 0 0 0 1
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0 -1
0 1 0 0 1 0 10: 0


0 0 1 0 1 0
13: 0
6 x 6 image
14: 0
Fewer parameters 15: 1
16: 1 Shared weights
Even fewer parameters

https://cs.uwaterloo.ca/~mli/cs898-2017.html
The whole CNN
cat dog ……
Convolution

Max Pooling
Can
Fully Connected repeat
Feedforward network
Convolution many
times

Max Pooling

Flattened
https://cs.uwaterloo.ca/~mli/cs898-2017.html
Max Pooling
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1

3 -1 -3 -1 -1 -1 -1 -1

-3 1 0 -3 -1 -1 -2 1

-3 -3 0 1 -1 -1 -2 1

3 -2 -2 -1 -1 0 -4 3
https://cs.uwaterloo.ca/~mli/cs898-2017.html
Why Pooling
• Subsampling pixels will not change the object
bird
bird

Subsampling

We can subsample the pixels to make image


smaller fewer parameters to characterize the image
https://cs.uwaterloo.ca/~mli/cs898-2017.html
A CNN compresses a fully connected
network in two ways:
• Reducing number of connections
• Shared weights on the edges
• Max pooling further reduces the complexity

https://cs.uwaterloo.ca/~mli/cs898-2017.html
Max Pooling

New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0 -1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 3 1
0 3
0 0 1 0 1 0 Pooling
2 x 2 image
6 x 6 image
Each filter
is a channel
https://cs.uwaterloo.ca/~mli/cs898-2017.html
The whole CNN
3 0
-1 1 Convolution

3 1
0 3
Max Pooling

A new image Can


repeat
Convolution many
Smaller than the original
times
image
The number of channels Max Pooling

is the number of filters

https://cs.uwaterloo.ca/~mli/cs898-2017.html
The whole CNN
cat dog ……
Convolution

Max Pooling

Fully Connected A new image


Feedforward network
Convolution

Max Pooling

Flattened A new image


https://cs.uwaterloo.ca/~mli/cs898-2017.html
3
Flattening
0

1
3 0
-1 1 3

3 1 -1
0 3 Flattened

1 Fully Connected
Feedforward network

3
https://cs.uwaterloo.ca/~mli/cs898-2017.html
CNNs

http://stats.stackexchange.com/questions/146413/why-
convolutional-neural-networks-belong-to-deep-learning 97
LeNet-5
LeCun, Bottou, Bengio & Haffner, 1998

98
Alexnet
Krizhevsky, Sutskever & Hinton, 2012

99
VGG
Simonyan & Zisserman, 2014

https://arxiv.org/abs/1409.1556 100
DeepFace
Taigman, Yang, Ranzato &Wolf, 2014

DeepFace: Closing the Gap to Human-Level Performance in Face Verification


https://research.fb.com/publications/deepface-closing-the-gap-to-human-level-performance-in-
face-verification/ 101
GoogleLeNet
Szegedy et al, 2015.

http://research.google.com/pubs/pub43022.html 102
Conclusion
• CNN: Special purpose net – Just for images or problems
with strong grid-like local spatial/temporal correlation
• Once trained on one problem could use same net (often
tuned) for a new similar problem – general creator of
vision features
• Autoenconder could be used to find initial parameters
• Lots of hand crafting and tuning to find the right recipe
of receptive fields, layer interconnections, etc.
– Lots more Hyperparameters than standard nets, and even than other deep
networks, since the structures of CNNs are more handcrafted
– CNNs getting wider and deeper with speed-up techniques (e.g. GPU, ReLU,
etc.) and lots of current research, excitement, and success
103
Fully Supervised Deep Learning
• Much recent success in doing fully supervised
deep learning with extensions which diminish the
effect of early learning difficulties (unstable
gradient, etc.)
• Patience (now that we know it may be worth it),
faster computers, and use of GPUs
• More efficient activation functions (e.g. ReLUs) in
terms of both computation and avoiding f'(net)
saturation
104
Open problems
• A More Scientific Approach is Needed, not
Just Building Better Systems…
– Geoff Hinton, Yoshua Bengio & Yann LeCun, NIPS
2015

105

You might also like