Lecture15 Neural Nets
Lecture15 Neural Nets
1
Learning highly non-linear
functions
f: X Y
f might be non-linear function
X (vector of) continuous and/or discrete vars
Y (vector of) continuous and/or discrete vars
Activation function
n
1, if X
X xi wi Y
i 1 1, if X
Middle Layer
Input Layer Output Layer
© Eric Xing @ CMU, 2006-2011 3
Connectionist Models
Consider humans: Dendrites
Nodes
Neuron switching time
~ 0.001 second +
Synapses
+ (weights)
Number of neurons +
Axon
-
~ 1010 Synapses
-
– Loss function
Examples: Mean-squared error,
Cross Entropy
7
A Recipe for
Background
Machine Learning
1. Given training data: 3. Define goal:
8
A Recipe for
Background
Gradients
Machine Learning
1. Given training data: 3. Definecan
Backpropagation goal:
compute this
gradient!
And it’s a special case of a more
general algorithm called reverse-
2. Choose each of these:mode automatic differentiation that
– Decision function can compute
4. Train
the with SGD:of any
gradient
differentiable
(takefunction efficiently!
small steps
opposite the gradient)
– Loss function
9
A Recipe for
Background
Goals for Today’s Lecture
Machine Learning
1. 1.
Given training
Explore data:
a new class of 3. Define functions
decision goal:
(Neural Networks)
2. Consider variants of this recipe for training
2. Choose each of these:
– Decision function 4. Train with SGD:
(take small steps
opposite the gradient)
– Loss function
10
Decision
Functions Linear Regression
Output y
θ1 θ2 θ3 θM
Input x1 x2 x3 … xM
11
Decision
Functions Logistic Regression
Output y
θ1 θ2 θ3 θM
Input x1 x2 x3 … xM
12
Decision
Functions Logistic Regression
Output y
Face Face Not a face
θ1 θ2 θ3 θM
Input x1 x2 x3 … xM
13
Decision
Functions Logistic Regression
Output y
1 1 0
θ3 θM x2
θ1 θ2
x1
Input x1 x2 x3 … xM
14
Decision
Functions Logistic Regression
Output y
θ1 θ2 θ3 θM
Input x1 x2 x3 … xM
15
Neural Network Model
Inputs
.6 Output
Age 34 .4
.2 S
.1 .5 0.6
Gender 2 .3 .2
.8
S
S “Probability of
.7
beingAlive”
Stage 4 .2
Dependent
Independent Weights HiddenL Weights variable
variables ayer
Prediction
Dependent
Independent Weights HiddenL Weights variable
variables ayer
Prediction
Dependent
Independent Weights HiddenL Weights variable
variables ayer
Prediction
Dependent
Independent Weights HiddenL Weights variable
variables ayer
Prediction
Age 34 .6 .4
.2 S
.1 .5 0.6
Gender 2 .3 .2
.8
S
S “Probability of
.7
beingAlive”
Stage 4 .2
Dependent
Independent Weights HiddenL Weights variable
variables ayer
Prediction
Output y
Hidden Layer a1 a2 … aD
Input x1 x2 x3 … xM
22
Decision
Functions Neural Network
Output y
z1 z2 … zD
Hidden Layer
x1 x2 x3 … xM
Input
23
Building a Neural Net
Output y
Features x1 x2 … xM
24
Building a Neural Net
Output y
Hidden Layer a1 a2 … aD
D=M
1 1 1
Input x1 x2 … xM
25
Building a Neural Net
Output y
Hidden Layer a1 a2 … aD
D=M
Input x1 x2 … xM
26
Building a Neural Net
Output y
Hidden Layer a1 a2 … aD
D=M
Input x1 x2 … xM
27
Building a Neural Net
Output y
Hidden Layer a1 a2 … aD
D<M
Input x1 x2 x3 … xM
28
Decision Boundary
• 0 hidden layers: linear classifier
– Hyperplanes
x1 x2
x1 x2
y
• 2 hidden layers
– Combinations of convex regions
x1 x2
Output y1 … yK
Hidden Layer a1 a2 … aD
Input x1 x2 x3 … xM
32
Decision
Functions Deeper Networks
Next lecture:
Output y
a1 a2 … aD
Hidden Layer 1
x1 x2 x3 … xM
Input
33
Decision
Functions Deeper Networks
Next lecture:
Output y
b1 b2 … bE
Hidden Layer 2
a1 a2 … aD
Hidden Layer 1
x1 x2 x3 … xM
Input
34
Decision
Functions Deeper Networks
Next lecture: Output y
Making the
neural Hidden Layer 3 c1 c2 … cF
networks
deeper Hidden Layer 2 b1 b2 … bE
a1 a2 … aD
Hidden Layer 1
x1 x2 x3 … xM
Input
35
Decision Different Levels of
Functions Abstraction
• We don’t know
the “right”
levels of
abstraction
• So let the model
figure it out!
36
Example from Honglak Lee (NIPS 2010)
Decision Different Levels of
Functions Abstraction
Face Recognition:
– Deep Network
can build up
increasingly
higher levels of
abstraction
– Lines, parts,
regions
37
Example from Honglak Lee (NIPS 2010)
Decision Different Levels of
Functions Abstraction
Output y
c1 c2 … cF
Hidden Layer 3
b1 b2 … bE
Hidden Layer 2
a1 a2 … aD
Hidden Layer 1
x1 x2 x3 … xM
Input
38
Example from Honglak Lee (NIPS 2010)
ARCHITECTURES
39
Neural Network Architectures
Even for a basic Neural Network, there are
many design decisions to make:
1. # of hidden layers (depth)
2. # of units per hidden layer (width)
3. Type of activation function (nonlinearity)
4. Form of objective function
40
Activation Functions
Neural Network with sigmoid
activation functions
Output y
z1 z2 … zD
Hidden Layer
x1 x2 x3 … xM
Input
41
Activation Functions
Neural Network with arbitrary
nonlinear activation functions
Output y
z1 z2 … zD
Hidden Layer
x1 x2 x3 … xM
Input
42
Activation Functions
Sigmoid / Logistic Function So far, we’ve
assumed that the
activation function
(nonlinearity) is
always the sigmoid
function…
43
Activation Functions
• A new change: modifying the nonlinearity
– The logistic is not widely used in modern ANNs
Alternate 1:
tanh
depth 4?
sigmoid
vs.
tanh
48
Multi-Class Output
Output y1 … yK
Hidden Layer a1 a2 … aD
Input x1 x2 x3 … xM
49
Multi-Class Output
Softmax:
y1 … yK
Output
a1 a2 … aD
Hidden Layer
x1 x2 x3 … xM
Input
50
Cross-entropy vs. Quadratic loss
52
Objective Functions
Matching Quiz: Suppose you are given a neural net with a single output,
y, and one hidden layer.
1) Minimizing sum of squared 5) …MLE estimates of weights assuming
errors… target follows a Bernoulli with
parameter given by the output value
2) Minimizing sum of squared
errors plus squared Euclidean 6) …MAP estimates of weights
norm of weights… assuming weight priors are zero mean
…gives… Gaussian
3) Minimizing cross-entropy…
7) …estimates with a large margin on
4) Minimizing hinge loss… the training data
8) …MLE estimates of weights
assuming zero mean Gaussian noise on
the output value
54
A Recipe for
Background
Machine Learning
1. Given training data: 3. Define goal:
55
Training Backpropagation
• Question 1:
When can we compute the gradients of the
parameters of an arbitrary neural network?
• Question 2:
When can we make the gradient
computation efficient?
56
Training Chain Rule
Given:
Chain Rule:
y1
u1 u2 … uJ
x2
57
Training Chain Rule
Given:
Chain Rule:
y1
Backpropagation
…
is just repeated u1 u2 uJ
application of the
chain rule from
Calculus 101. x2
58
Training Chain Rule
y
1
Given:
Chain Rule:
u u u
…
1 2 J
x
2
Backpropagation:
1. Instantiate the computation as a directed acyclic graph, where each
intermediate quantity is a node
2. At each node, store (a) the quantity computed in the forward pass and
(b) the partial derivative of the goal with respect to that node’s
intermediate quantity.
3. Initialize all partial derivatives to 0.
4. Visit each node in reverse topological order. At each node, add its
contribution to the partial derivatives of its parents
60
Training Backpropagation
61
Training Backpropagation
Output y
Case 1:
Logistic θ1 θ2 θ3 θM
Regression
x1 x2 x3 … xM
Input
62
Training Backpropagation
Output y
z1 z2 … zD
Hidden Layer
x1 x2 x3 … xM
Input
63
Training Backpropagation
Output y
z1 z2 … zD
Hidden Layer
x1 x2 x3 … xM
Input
64
Training Backpropagation
Case 2:
Neural
Network
y
z z z
…
1 2 D
x x x x
…
1 2 3 M
65
Training Chain Rule
y
1
Given:
Chain Rule:
u u u
…
1 2 J
x
2
Backpropagation:
1. Instantiate the computation as a directed acyclic graph, where each
intermediate quantity is a node
2. At each node, store (a) the quantity computed in the forward pass and
(b) the partial derivative of the goal with respect to that node’s
intermediate quantity.
3. Initialize all partial derivatives to 0.
4. Visit each node in reverse topological order. At each node, add its
contribution to the partial derivatives of its parents
Given:
Chain Rule:
u u u
…
1 2 J
x
2
Backpropagation:
1. Instantiate the computation as a directed acyclic graph, where each
node represents a Tensor.
2. At each node, store (a) the quantity computed in the forward pass
and (b) the partial derivatives of the goal with respect to that node’s
Tensor.
3. Initialize all partial derivatives to 0.
4. Visit each node in reverse topological order. At each node, add its
contribution to the partial derivatives of its parents
Module 3
Module 2
Module 1
68
A Recipe for
Background
Gradients
Machine Learning
1. Given training data: 3. Definecan
Backpropagation goal:
compute this
gradient!
And it’s a special case of a more
general algorithm called reverse-
2. Choose each of these:mode automatic differentiation that
– Decision function can compute
4. Train
the with SGD:of any
gradient
differentiable
(takefunction efficiently!
small steps
opposite the gradient)
– Loss function
69
Summary
1. Neural Networks…
– provide a way of learning features
– are highly nonlinear prediction functions
– (can be) a highly parallel network of logistic
regression classifiers
– discover useful hidden representations of the
input
2. Backpropagation…
– provides an efficient way to compute gradients
– is a special case of reverse-mode automatic
differentiation
70