0% found this document useful (0 votes)
5 views

Lecture15 Neural Nets

The document discusses neural networks and their biological inspiration. It introduces the basic components of artificial neurons and how they are connected to form neural networks. It also covers early applications of neural networks in areas like speech recognition and how they learn through supervised learning and gradient descent.

Uploaded by

shvdo
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lecture15 Neural Nets

The document discusses neural networks and their biological inspiration. It introduces the basic components of artificial neurons and how they are connected to form neural networks. It also covers early applications of neural networks in areas like speech recognition and how they learn through supervised learning and gradient descent.

Uploaded by

shvdo
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 70

NEURAL NETWORKS

1
Learning highly non-linear
functions
f: X  Y
 f might be non-linear function
 X (vector of) continuous and/or discrete vars
 Y (vector of) continuous and/or discrete vars

The XOR gate Speech recognition

© Eric Xing @ CMU, 2006-2011 2


Perceptron and Neural Nets
 From biological neuron to artificial neuron (perceptron)
Synapse In
puts
Synapse Dendrites x1
Axon Lin
ear Hard
Axon w C
ombin
er L
imiter
1 O
utp
ut
 Y
Soma Soma w
2
Dendrites

x2
Synapse
T
hresh
old

 Activation function
n
 1, if X  
X   xi wi Y 
i 1  1, if X  

 Artificial neuron networks

Out put Signals


Input Signals
 supervised learning
 gradient descent

Middle Layer
Input Layer Output Layer
© Eric Xing @ CMU, 2006-2011 3
Connectionist Models
 Consider humans: Dendrites
Nodes
 Neuron switching time
~ 0.001 second +
Synapses
+ (weights)
 Number of neurons +
Axon
-
~ 1010 Synapses
-

 Connections per neuron


~ 104-5
 Scene recognition time
~ 0.1 second
 100 inference steps doesn't seem like enough
 much parallel computation
 Properties of artificial neural nets (ANN)
 Many neuron-like threshold switching units
 Many weighted interconnections among units
 Highly parallel, distributed processes

© Eric Xing @ CMU, 2006-2011 4


Why is everyone talking
Motivation
about Deep Learning?
• Because a lot of money is invested in it…
– DeepMind: Acquired by Google for $400
million
– DNNResearch: Three person startup (including
Geoff Hinton) acquired by Google for unknown
price tag
– Enlitic, Ersatz, MetaMind, Nervana, Skylab:
Deep Learning startups commanding millions
of VC dollars
• Because it made the front page of the New
York Times
5
Why is everyone talking
Motivation
about Deep Learning?
1960s
Deep learning:
– Has won numerous pattern recognition
1980s
competitions
– Does so with minimal feature
1990s
engineering
This wasn’t always the case!
2006 Since 1980s: Form of models hasn’t changed much,
but lots of new tricks…
– More hidden units
2016 – Better (online) optimization
– New nonlinear functions (ReLUs)
– Faster computers (CPUs and GPUs)
6
A Recipe for
Background
Machine Learning
1. Given training data: Face Face Not a face

2. Choose each of these:


– Decision function
Examples: Linear regression,
Logistic regression, Neural Network

– Loss function
Examples: Mean-squared error,
Cross Entropy

7
A Recipe for
Background
Machine Learning
1. Given training data: 3. Define goal:

2. Choose each of these:


– Decision function 4. Train with SGD:
(take small steps
opposite the gradient)
– Loss function

8
A Recipe for
Background
Gradients
Machine Learning
1. Given training data: 3. Definecan
Backpropagation goal:
compute this
gradient!
And it’s a special case of a more
general algorithm called reverse-
2. Choose each of these:mode automatic differentiation that
– Decision function can compute
4. Train
the with SGD:of any
gradient
differentiable
(takefunction efficiently!
small steps
opposite the gradient)
– Loss function

9
A Recipe for
Background
Goals for Today’s Lecture
Machine Learning
1. 1.
Given training
Explore data:
a new class of 3. Define functions
decision goal:
(Neural Networks)
2. Consider variants of this recipe for training
2. Choose each of these:
– Decision function 4. Train with SGD:
(take small steps
opposite the gradient)
– Loss function

10
Decision
Functions Linear Regression

Output y

θ1 θ2 θ3 θM

Input x1 x2 x3 … xM
11
Decision
Functions Logistic Regression

Output y

θ1 θ2 θ3 θM

Input x1 x2 x3 … xM
12
Decision
Functions Logistic Regression

Output y
Face Face Not a face

θ1 θ2 θ3 θM

Input x1 x2 x3 … xM
13
Decision
Functions Logistic Regression

Output y
1 1 0

θ3 θM x2
θ1 θ2
x1

Input x1 x2 x3 … xM
14
Decision
Functions Logistic Regression

Output y

θ1 θ2 θ3 θM

Input x1 x2 x3 … xM
15
Neural Network Model
Inputs
.6 Output
Age 34 .4
.2 S
.1 .5 0.6
Gender 2 .3 .2
.8
S
S “Probability of
.7
beingAlive”
Stage 4 .2

Dependent
Independent Weights HiddenL Weights variable
variables ayer
Prediction

© Eric Xing @ CMU, 2006-2011 16


“Combined logistic models”
Inputs
.6 Output
Age 34
.5 0.6
.1
Gender 2 S
.7 .8 “Probability of
beingAlive”
Stage 4

Dependent
Independent Weights HiddenL Weights variable
variables ayer
Prediction

© Eric Xing @ CMU, 2006-2011 17


Inputs
Output
Age 34
.2 .5
0.6
Gender 2 .3
S
“Probability of
.8
beingAlive”
Stage 4 .2

Dependent
Independent Weights HiddenL Weights variable
variables ayer
Prediction

© Eric Xing @ CMU, 2006-2011 18


Inputs
.6 Output
Age 34
.2 .5
.1 0.6
Gender 1 .3
S
.7 “Probability of
.8
beingAlive”
Stage 4 .2

Dependent
Independent Weights HiddenL Weights variable
variables ayer
Prediction

© Eric Xing @ CMU, 2006-2011 19


Not really,
no target for hidden units...

Age 34 .6 .4
.2 S
.1 .5 0.6
Gender 2 .3 .2
.8
S
S “Probability of
.7
beingAlive”
Stage 4 .2

Dependent
Independent Weights HiddenL Weights variable
variables ayer
Prediction

© Eric Xing @ CMU, 2006-2011 20


Jargon Pseudo-Correspondence
 Independent variable = input variable
 Dependent variable = output variable
 Coefficients = “weights”
 Estimates = “targets”

Logistic Regression Model (the sigmoid unit)


Inputs Output
Age 34
5
0.6
Gende 1 4
S “Probability of
r beingAlive”
Stage 4 8

Independent variables Coefficients Dependent variable


x1, x2, x3 a, b, c p Prediction
© Eric Xing @ CMU, 2006-2011 21
Decision
Functions Neural Network

Output y

Hidden Layer a1 a2 … aD

Input x1 x2 x3 … xM
22
Decision
Functions Neural Network

Output y

z1 z2 … zD
Hidden Layer

x1 x2 x3 … xM
Input

23
Building a Neural Net

Output y

Features x1 x2 … xM

24
Building a Neural Net

Output y

Hidden Layer a1 a2 … aD
D=M

1 1 1

Input x1 x2 … xM
25
Building a Neural Net

Output y

Hidden Layer a1 a2 … aD
D=M

Input x1 x2 … xM
26
Building a Neural Net

Output y

Hidden Layer a1 a2 … aD
D=M

Input x1 x2 … xM
27
Building a Neural Net

Output y

Hidden Layer a1 a2 … aD
D<M

Input x1 x2 x3 … xM
28
Decision Boundary
• 0 hidden layers: linear classifier
– Hyperplanes

x1 x2

Example from to Eric Postma via Jason Eisner 29


Decision Boundary
• 1 hidden layer
– Boundary of convex region (open or closed)

x1 x2

Example from to Eric Postma via Jason Eisner 30


Decision Boundary

y
• 2 hidden layers
– Combinations of convex regions

x1 x2

Example from to Eric Postma via Jason Eisner 31


Decision
Functions
Multi-Class Output

Output y1 … yK

Hidden Layer a1 a2 … aD

Input x1 x2 x3 … xM
32
Decision
Functions Deeper Networks
Next lecture:

Output y

a1 a2 … aD
Hidden Layer 1

x1 x2 x3 … xM
Input

33
Decision
Functions Deeper Networks
Next lecture:

Output y

b1 b2 … bE
Hidden Layer 2

a1 a2 … aD
Hidden Layer 1

x1 x2 x3 … xM
Input

34
Decision
Functions Deeper Networks
Next lecture: Output y

Making the
neural Hidden Layer 3 c1 c2 … cF

networks
deeper Hidden Layer 2 b1 b2 … bE

a1 a2 … aD
Hidden Layer 1

x1 x2 x3 … xM
Input

35
Decision Different Levels of
Functions Abstraction
• We don’t know
the “right”
levels of
abstraction
• So let the model
figure it out!

36
Example from Honglak Lee (NIPS 2010)
Decision Different Levels of
Functions Abstraction
Face Recognition:
– Deep Network
can build up
increasingly
higher levels of
abstraction
– Lines, parts,
regions

37
Example from Honglak Lee (NIPS 2010)
Decision Different Levels of
Functions Abstraction
Output y

c1 c2 … cF
Hidden Layer 3

b1 b2 … bE
Hidden Layer 2

a1 a2 … aD
Hidden Layer 1

x1 x2 x3 … xM
Input

38
Example from Honglak Lee (NIPS 2010)
ARCHITECTURES

39
Neural Network Architectures
Even for a basic Neural Network, there are
many design decisions to make:
1. # of hidden layers (depth)
2. # of units per hidden layer (width)
3. Type of activation function (nonlinearity)
4. Form of objective function

40
Activation Functions
Neural Network with sigmoid
activation functions

Output y

z1 z2 … zD
Hidden Layer

x1 x2 x3 … xM
Input

41
Activation Functions
Neural Network with arbitrary
nonlinear activation functions

Output y

z1 z2 … zD
Hidden Layer

x1 x2 x3 … xM
Input

42
Activation Functions
Sigmoid / Logistic Function So far, we’ve
assumed that the
activation function
(nonlinearity) is
always the sigmoid
function…

43
Activation Functions
• A new change: modifying the nonlinearity
– The logistic is not widely used in modern ANNs

Alternate 1:
tanh

Like logistic function but


shifted to range [-1, +1]

Slide from William Cohen


AI Stats 2010

depth 4?

sigmoid
vs.
tanh

Figure from Glorot & Bentio (2010)


Activation Functions
• A new change: modifying the nonlinearity
– reLU often used in vision tasks

Alternate 2: rectified linear unit

Linear with a cutoff at zero

(Implementation: clip the gradient


when you pass zero)

Slide from William Cohen


Activation Functions
• A new change: modifying the nonlinearity
– reLU often used in vision tasks

Alternate 2: rectified linear unit

Soft version: log(exp(x)+1)

Doesn’t saturate (at one end)


Sparsifies outputs
Helps with vanishing gradient

Slide from William Cohen


Objective Functions for NNs
• Regression:
– Use the same objective as Linear Regression
– Quadratic loss (i.e. mean squared error)
• Classification:
– Use the same objective as Logistic Regression
– Cross-entropy (i.e. negative log likelihood)
– This requires probabilities, so we add an additional
“softmax” layer at the end of our network

48
Multi-Class Output

Output y1 … yK

Hidden Layer a1 a2 … aD

Input x1 x2 x3 … xM
49
Multi-Class Output
Softmax:

y1 … yK
Output

a1 a2 … aD
Hidden Layer

x1 x2 x3 … xM
Input

50
Cross-entropy vs. Quadratic loss

Figure from Glorot & Bentio (2010)


A Recipe for
Background
Machine Learning
1. Given training data: 3. Define goal:

2. Choose each of these:


– Decision function 4. Train with SGD:
(take small steps
opposite the gradient)
– Loss function

52
Objective Functions
Matching Quiz: Suppose you are given a neural net with a single output,
y, and one hidden layer.
1) Minimizing sum of squared 5) …MLE estimates of weights assuming
errors… target follows a Bernoulli with
parameter given by the output value
2) Minimizing sum of squared
errors plus squared Euclidean 6) …MAP estimates of weights
norm of weights… assuming weight priors are zero mean
…gives… Gaussian
3) Minimizing cross-entropy…
7) …estimates with a large margin on
4) Minimizing hinge loss… the training data
8) …MLE estimates of weights
assuming zero mean Gaussian noise on
the output value

A. 1=5, 2=7, 3=6, 4=8 D. 1=7, 2=5, 3=6, 4=8


B. 1=5, 2=7, 3=8, 4=6 E. 1=8, 2=6, 3=5, 4=7
C. 1=7, 2=5, 3=5, 4=7 F. 1=8, 2=6, 3=8, 4=6 53
BACKPROPAGATION

54
A Recipe for
Background
Machine Learning
1. Given training data: 3. Define goal:

2. Choose each of these:


– Decision function 4. Train with SGD:
(take small steps
opposite the gradient)
– Loss function

55
Training Backpropagation
• Question 1:
When can we compute the gradients of the
parameters of an arbitrary neural network?

• Question 2:
When can we make the gradient
computation efficient?

56
Training Chain Rule

Given:
Chain Rule:

y1

u1 u2 … uJ

x2

57
Training Chain Rule

Given:
Chain Rule:

y1

Backpropagation

is just repeated u1 u2 uJ

application of the
chain rule from
Calculus 101. x2

58
Training Chain Rule
y
1

Given:
Chain Rule:
u u u

1 2 J

x
2

Backpropagation:
1. Instantiate the computation as a directed acyclic graph, where each
intermediate quantity is a node
2. At each node, store (a) the quantity computed in the forward pass and
(b) the partial derivative of the goal with respect to that node’s
intermediate quantity.
3. Initialize all partial derivatives to 0.
4. Visit each node in reverse topological order. At each node, add its
contribution to the partial derivatives of its parents

This algorithm is also called automatic differentiation in the reverse-mode 59


Training Backpropagation

60
Training Backpropagation

61
Training Backpropagation
Output y

Case 1:
Logistic θ1 θ2 θ3 θM
Regression
x1 x2 x3 … xM
Input

62
Training Backpropagation

Output y

z1 z2 … zD
Hidden Layer

x1 x2 x3 … xM
Input

63
Training Backpropagation

Output y

z1 z2 … zD
Hidden Layer

x1 x2 x3 … xM
Input

64
Training Backpropagation
Case 2:
Neural
Network
y
z z z

1 2 D
x x x x

1 2 3 M

65
Training Chain Rule
y
1

Given:
Chain Rule:
u u u

1 2 J

x
2

Backpropagation:
1. Instantiate the computation as a directed acyclic graph, where each
intermediate quantity is a node
2. At each node, store (a) the quantity computed in the forward pass and
(b) the partial derivative of the goal with respect to that node’s
intermediate quantity.
3. Initialize all partial derivatives to 0.
4. Visit each node in reverse topological order. At each node, add its
contribution to the partial derivatives of its parents

This algorithm is also called automatic differentiation in the reverse-mode 66


Training Chain Rule
y
1

Given:
Chain Rule:
u u u

1 2 J

x
2

Backpropagation:
1. Instantiate the computation as a directed acyclic graph, where each
node represents a Tensor.
2. At each node, store (a) the quantity computed in the forward pass
and (b) the partial derivatives of the goal with respect to that node’s
Tensor.
3. Initialize all partial derivatives to 0.
4. Visit each node in reverse topological order. At each node, add its
contribution to the partial derivatives of its parents

This algorithm is also called automatic differentiation in the reverse-mode 67


Training Backpropagation
Case 2:
Neural
Module 5
Network
y
Module
z z z4

1 2 D
x x x x

1 2 3 M

Module 3

Module 2

Module 1

68
A Recipe for
Background
Gradients
Machine Learning
1. Given training data: 3. Definecan
Backpropagation goal:
compute this
gradient!
And it’s a special case of a more
general algorithm called reverse-
2. Choose each of these:mode automatic differentiation that
– Decision function can compute
4. Train
the with SGD:of any
gradient
differentiable
(takefunction efficiently!
small steps
opposite the gradient)
– Loss function

69
Summary
1. Neural Networks…
– provide a way of learning features
– are highly nonlinear prediction functions
– (can be) a highly parallel network of logistic
regression classifiers
– discover useful hidden representations of the
input
2. Backpropagation…
– provides an efficient way to compute gradients
– is a special case of reverse-mode automatic
differentiation
70

You might also like