0% found this document useful (0 votes)
18 views

6.1 DeepFFNets

Uploaded by

sharanyarb534
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

6.1 DeepFFNets

Uploaded by

sharanyarb534
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Deep Learning Srihari

Deep Feedforward Networks:


Overview
Sargur N. Srihari
[email protected]

1
Deep Learning Srihari

Topics in DFF Networks


1. Overview
2. Example: Learning XOR
3.Hidden Units
4. Architecture Design
5. Backpropagation and Other Differentiation
6. Historical Notes

2
Deep Learning Srihari

Sub-topics in Overview of DFF


1. Goal of a Feed-Forward Network
2. Feedforward vs Recurrent Networks
3. Function Approximation as Goal
4. Extending Linear Models (SVM)
5. Example of XOR

3
Deep Learning Srihari

Goal of a feedforward network


• Feedforward Nets are
quintessential deep learning models
• Deep Feedforward Networks are also called as
– Feedforward neural networks or
– Multilayer Perceptrons (MLPs)
• Their Goal is to approximate some function f *
– E.g., classifier y = f * (x) maps input x to category y
– Feedforward Network defines a mapping
y = f * (x ; θ )
•and learns the values of the parameters θ that result in the
best function approximation 4
Deep Learning Srihari

Feedforward network for MNIST

MNIST 28x28 images

Source: https://towardsdatascience.com/probability-and-statistics-explained-in-the-context-of-deep-learning-ed1509b2eb3f
5
Deep Learning Srihari

Another View of 2-hidden layers

https://www.easy-tensorflow.com/tf-tutorials/neural-networks/two-layer-neural-network

6
Deep Learning Srihari

Flow of Information
• Models are called Feedforward because: y=f (x)
– To evaluate f (x): information flows one-way from
x through computations defining f s to outputs y
• There are no feedback connections
– No outputs of model are fed back into itself

7
Deep Learning Srihari

Feedforward Net: US Election


• US Presidential Election y=f (x)
• Output: y={y1, y2}
• votes of electoral college for candidate
• Input: X={x1,..x50}
• are vote vectors cast for 2 candidates
• W converts votes to electoral votes
– E.g., Winner takes all or proportionate
h is defined for each state
• h is electoral college as shown in map
• Each state has fixed no of electors
• w maps 50 states to 2 outputs
• Simple addition
8
Deep Learning Srihari

Importance of Feedforward Networks


• They are extremely important to ML practice
• Form basis for many commercial applications
1. CNNs are a special kind of feedforward networks
• They are used for recognizing objects from photos
2. They are a conceptual stepping stones to RNNs
• RNNs power many NLP applications

9
Deep Learning Srihari

Feedforward vs. Recurrent

• When feedforward neural networks are


extended to include feedback connections they
are called Recurrent Neural Networks (RNNs)

RNN Unrolled RNN


RNN with
learning
component

10
Deep Learning Srihari

Feedforward Neural Network Structures

• They are called networks because they are


composed of many different functions
• Model is associated with a directed acyclic
graph describing how functions composed
– E.g., functions f (1), f (2), f (3) connected in a chain to
form f (x)= f (3) [ f (2) [ f (1)(x)]]
• f (1) is called the first layer of network (which is a vector)
• f (2) is called the second layer, etc
• These chain structures are the most commonly
used structures of neural networks 11
Deep Learning Srihari

Definition of Depth
• Overall length of the chain is the depth of the
model
– Ex: the composite function f (x)= f (3) [ f (2) [ f (1)(x)]]
has depth of 3
• The name deep learning arises from this
terminology
• Final layer of a feedforward network, ex f (3), is
called the output layer

12
Deep Learning Srihari

Training the Network


• In network training we drive f (x) to match f* (x)
• Training data provides us with noisy,
approximate examples of f* (x) evaluated at
different training points
• Each example accompanied by label y ≈ f*(x)
• Training examples specify directly what the
output layer must do at each point x
– It must produce a value that is close to y

13
Deep Learning Srihari

Definition of Hidden Layer


• Behavior of other layers is not directly specified
by the data
• Learning algorithm must decide how to use
those layers to produce value that is close to y
• Training data does not say what individual
layers should do
• Since the desired output for these layers is not
shown, they are called hidden layers

14
Deep Learning Srihari

A net with depth 2: one hidden layer

K outputs y1,..yK for a given input x


Hidden layer consists of M units

⎛ M (2) ⎛ D (1) ⎞ ⎞
yk (x,w) = σ ⎜ ∑ wkj h ⎜ ∑ w ji x i + w (1)
j0 ⎟
+ w (2)
k0 ⎟
⎝ j =1 ⎝ i =1 ⎠ ⎠

f (x)= f (2) [ f (1)(x)]


f (1) is a vector of M dimensions and
f (2) is a vector of K dimensions

fm (1) =zm= h(xTw(1)), m=1,..M


fk (2) = σ (zTw(2)), k=1,..K

15
Deep Learning Srihari

Feedforward net with depth 2


• Recognition of printed characters (OCR)
f (x)= f (2) [ f (1)(x)]
– Hidden layer f (1) compares raw pixel inputs to
component patterns

16
Deep Learning Srihari

Width of Model
• Each hidden layer is typically vector-valued
• Dimensionality of hidden layer vector is width of
the model

17
Deep Learning Srihari

Units of a model
• Each element of vector viewed as a neuron
– Instead of thinking of it as a vector-vector function,
they are regarded as units in parallel
• Each unit receives inputs from many other units
and computes its own activation value

18
Deep Learning Srihari

Depth versus Width


• Going deeper makes network more expressive
– It can capture variations of the data better.
– Yields expressiveness more efficiently than width
• Tradeoff for more expressiveness is increased
tendency to overfit
– You will need more data or additional regularization
• network should be as deep as training data allows.
– But you can only determine a suitable depth by
experiment.
• Also computation increases with no. of layers.
Deep Learning Srihari

Very Deep CNNs


CNNs with depth 11 to 19
Depth increases from left (A) to right (E)
as more layers are added
(the added layers are shown in bold)

Convolutional layer parameters denoted


“conv (receptive field size) –(no. of channels)”

ReLU activation not shown for brevity

20
Deep Learning Srihari

Why are they neural networks?


• These networks are loosely inspired by
neuroscience
• Each unit resembles a neuron
– Receives input from many other units
– Computes its own activation value
• Choice of functions f (i)(x):
– Loosely guided by neuroscientific observations
about biological neurons
• Modern neural networks are guided by many
mathematical and engineering disciplines
• Not perfectly model the brain
21
Deep Learning Srihari

Function Approximation is goal


• Think of feedforward networks as function
approximation machines
– Designed to achieve statistical generalization
• Occasionally draw insights from what we know
about the brain
– Rather than as models of brain function

22
Deep Learning Srihari

Understanding Feedforward Nets


• Begin with linear networks and understand their
limitations
• Linear models such as logistic regression and
linear regression can be fit reliably and
efficiently using either
– Closed-form solution
– Convex optimization
• Limitation

23
Deep Learning Srihari

Extending Linear Models


• To represent non-linear functions of x
– apply linear model to transformed input ϕ(x)
• where ϕ is non-linear
– Equivalently kernel trick of SVM obtains nonlinearity
SVM Kernel trick
Deep Learning Srihari

• Many ML algos can be rewritten as dot


products between examples:
f (x)=wTx+b written as b + Σi αi xTx(i)
where x(i) is a training example and α is a vector of coeffts
– This allows us to replace x with a feature function ϕ(x) and
dot product with function k(x,x(i))=ϕ(x)Ÿϕ(x(i)) called a kernel
•The Ÿ operator represents an inner product analogous to ϕ(x)Tϕ(x(i))
•For some feature spaces we may not literally use an inner product
– In continuous spaces an inner product based on integration
– Gaussian kernel
•Consider k(u,v) = exp (-||u-v||2/2σ2)
– By expanding the square ||u-v||2 = uTu + vTv - 2uTv
– we get k(u,v)=exp(-uTu/2σ2)exp(-uTv/σ2)exp(-vTv/2σ2)
•Validity follows from kernel construction rules
SVM Prediction
Deep Learning Srihari

• Use linear regression on Lagrangian for


determining the weights αi
• We can make predictions using
– f (x)= b + Σiαi k(x,x(i))
– Function is nonlinear wrt x but relationship between
ϕ(x) and f (x) is linear
– Also the relationship between α and f (x) is linear
– We can think of ϕ as providing a set of features
• describing x or providing a new representation for x
Deep Learning Srihari

Disadvantages of Kernel Methods


• Cost of decision function evaluation: linear in m
– Because the ith example contributes term αi k(x, x(i))
to the decision function
– Can mitigate this by learning an α with mostly zeros
• Classification requires evaluating the kernel function only
for training examples that have a nonzero αi
• These are known as support vectors
• Cost of training: high with large data sets
• Generic kernels struggle to generalize well
– Neural net outperformed RBF-SVM on MNIST
• Also, how to choose the mapping ϕ? 27
Deep Learning Srihari

Options for choosing mapping ϕ


1. Generic feature function ϕ (x)
– Radial basis function
2. Manually engineer ϕ
– Feature engineering
3. Principle of Deep Learning: Learn ϕ

28
Deep Learning Srihari

Option 1 to choose the mapping ϕ


• Generic feature function ϕ (x)
– Infinite-dimensional ϕ that is implicitly used by
kernel machines based on RBF
• RBF: N(x ; x(i), σ2I) centered at x(i)
σ =mean distance
x(i):
From between
k-means each unit j and its
clustering closest neighbor

– If ϕ(x) is of high enough dimension we can have


enough capacity to fit the training set
• Generalization to test set remains poor
• Generic feature mappings are based on smoothness
– Do not include prior information to solve advanced problems 29
Deep Learning Srihari

Option 2 to choose the mapping ϕ


• Manually engineer ϕ
• This was the dominant approach until arrival of
deep learning
• Requires decades of effort
– e.g., speech recognition, computer vision
• Little transfer between domains

30
Deep Learning Srihari

Option 3 to choose the mapping ϕ


• Strategy of Deep learning: Learn ϕ
• Model is y=f (x; θ,w) = ϕ(x; θ)T w
– θ used to learn ϕ from broad class of functions
– Parameters w map from ϕ (x) to output
– Defines FFN where ϕ define a hidden layer
• Unlike other two (basis functions, manual
engineering), this approach gives-up on
convexity of training
– But its benefits outweigh harms
31
Deep Learning Srihari

Extend Linear Methods to Learn ϕ


ϕM K outputs y1,..yK for a given input x
θMD
wKM Hidden layer consists of M units

M ⎛D ⎞
yk (x; θ,w) = ∑ wkj φj ⎜⎜∑ θji x i + θj 0 ⎟⎟⎟ + wk 0
j =1 ⎜⎝ i=1 ⎟⎠

ϕ1 w10
yk = fk (x;θ,w) = ϕ (x;θ)T w
ϕ0
Can be viewed as a generalization of linear models
• Nonlinear function fk with M+1 parameters wk= (wk0 ,..wkM ) with
• M basis functions, ϕj j=1,..M each with D parameters θj= (θj1,..θjD)
• Both wk and θj are learnt from data

32
Deep Learning Srihari

Approaches to Learning ϕ
• Parameterize the basis functions as ϕ(x;θ)
– Use optimization to find θ that corresponds to a
good representation
• Approach can capture benefit of first approach
(fixed basis functions) by being highly generic
– By using a broad family for ϕ(x;θ)
• Can also capture benefits of second approach
– Human practitioners design families of ϕ(x;θ) that
will perform well
– Need only find right function family rather than
precise right function 33
Deep Learning Srihari

Importance of Learning ϕ
• Learning ϕ is discussed beyond this first
introduction to feed-forward networks
– It is a recurring theme throughout deep learning
applicable to all kinds of models
• Feedforward networks are application of this
principle to learning deterministic mappings
form x to y without feedback
• Applicable to
– learning stochastic mappings
– functions with feedback
– learning probability distributions over a single vector34
Deep Learning Srihari

Plan of Discussion: Feedforward Networks


1. A simple example: learning XOR
2. Design decisions for a feedforward network
– Many are same as for designing a linear model
• Basics of gradient descent
– Choosing the optimizer, Cost function, Form of output units
– Some are unique
• Concept of hidden layer
– Makes it necessary to have activation functions
• Architecture of network
– How many layers , How are they connected to each other, How
many units in each later
• Learning requires gradients of complicated functions
– Backpropagation and modern generalizations 35
Deep Learning Srihari

1. Ex: XOR problem


• XOR: an operation on binary variables x1 and x2
– When exactly one value equals 1 it returns 1
otherwise it returns 0
– Target function is y=f *(x) that we want to learn
• Our model is y =f ([x1, x2] ; θ) which we learn, i.e., adapt
parameters θ to make it similar to f *
• Not concerned with statistical generalization
– Perform correctly on four training points:
• X={[0,0]T, [0,1]T,[1,0]T, [1,1]T}
– Challenge is to fit the training set
• We want f ([0,0]T; θ) = f ([1,1]T; θ) = 0
• f ([0,1]T; θ) = f ([1,0]T; θ) = 1 36
Deep Learning Srihari

ML for XOR: linear model doesn’t fit


• Treat it as regression with MSE loss function
1 1 4
J(θ) = ∑ ( f * (x)− f (x;θ)) = ∑ ( f * (x n )− f (x n ;θ))
2 2

4 x∈X 4 n=1
Alternative is Cross-entropy J(θ)
– Usually not used for binary data J(θ) = −ln p(t | θ)
N
= −∑ {tn ln yn + (1 −tn )ln(1 −yn )}
– But math is simple n=1
yn= σ (θTxn)

• We must choose the form of the model


• Consider a linear model with θ ={w,b} where
f (x;w,b) = x T w +b
1 4
(
) to get closed-form solution
2

– Minimize J(θ) = ∑ tn −x nT w - b)
4 n=1
• Differentiate wrt w and b to obtain w = 0 and b=½
– Then the linear model f(x;w,b)=½ simply outputs 0.5 everywhere
– Why does this happen? 37
Deep Learning Srihari

Linear model cannot solve XOR


• Bold numbers are values system must output
• When x1=0, output has to increase with x2
• When x1=1, output has to decrease with x2

• Linear model f (x;w,b)= x1w1+x2w2+b has to assign a single


weight to x2, so it cannot solve this problem
• A better solution:
– use a model to learn a different representation
• in which a linear model is able to represent the solution
– We use a simple feedforward network
• one hidden layer containing two hidden units

38
Deep Learning Srihari

Feedforward Network for XOR

• Introduce a simple feedforward


network
– with one hidden layer containing two
units
• Same network drawn in two different
styles
– Matrix W describes mapping from x to h
– Vector w describes mapping from h to y
– Intercept parameters b are omitted
39
Deep Learning Srihari

Functions computed by Network


• Layer 1 (hidden layer): vector of hidden
units h computed by function f (1)(x; W,c)
– c are bias variables
• Layer 2 (output layer) computes
f (2)(h; w,b)
– w are linear regression weights
– Output is linear regression applied to h
rather than to x
• Complete model is
f (x; W,c,w,b)=f (2)(f (1)(x)) 40
Deep Learning Srihari

Linear vs Nonlinear functions


• If we choose both f (1) and f (2) to be linear, the
total function will still be linear f (x)=xTw’
– Suppose f (1)(x)= WTx and f (2)(h)=hTw
– Then we could represent this function as
f (x)=xTw’
f (x)=xTw’ where w’=Ww
• Since linear is insufficient, we must use a
nonlinear function to describe the features
– We use the strategy of neural networks
– by using a nonlinear activation function
h=g(WTx+c) 41
Deep Learning Srihari

Activation Function
• In linear regression we used a vector of weights
w and scalar bias b T
f (x;w,b) = x w +b

– to describe an affine transformation from an input


vector to an output scalar
• Now we describe an affine transformation from
a vector x to a vector h, so an entire vector of
bias parameters is needed
• Activation function g is typically chosen to be
applied element-wise hi=g(xTW:,i+ci)

42
Deep Learning Srihari

Default Activation Function


• Activation: g(z)=max{0,z}
– Applying this to the output of a
linear transformation yields a
nonlinear transformation
– However function remains close A principle of CS:
to linear Build complicated
systems from
• Piecewise linear with two pieces minimal components.
A Turing Machine
• Therefore preserve properties that Memory needs only 0
make linear models easy to and 1 states.
optimize with gradient-based
We can build Universal
methods Function approximator
• Preserve many properties that from ReLUs
make linear models generalize
well
Deep Learning Srihari

Specifying the Network using ReLU


• Activation: g(z)=max{0,z}
• We can now specify the complete network as
f (x; W,c,w,b)=f (2)(f (1)(x))=wT max {0,WTx+c}+b
Deep Learning Srihari

We can now specify XOR Solution


f (x; W,c,w,b)=
• Let ⎡ ⎤
⎢ 1 1 ⎥
⎡ ⎤
⎢ −1 ⎥
⎡ ⎤
W = ⎢ 1 1 ⎥, c = ⎢ 0 ⎥, w = ⎢ 1 ⎥, b = 0
⎢ −2 ⎥ wT max {0,WTx+c}+b
⎣ ⎦ ⎣ ⎦ ⎣ ⎦
• Now walk through how model processes a
batch of inputs ⎡
⎢ 0 0


⎢ ⎥
0 1
• Design matrix X of all four points: ⎡
⎢ 0 0


X = ⎢

⎢ 1 0



⎢ ⎥ ⎢ ⎥
⎢ ⎥
• First step is XW: ⎡
⎢ 0 −1
⎤ XW = ⎢⎢
⎥ ⎢
1 1
1 1



⎣ 1 1 ⎦
In this space all points lie ⎢ ⎥ ⎢ ⎥
1 0
XW +c = ⎢⎢ ⎥
• Adding c: along a line with slope 1. Cannot be
implemented by a linear model
⎢ 1 0



⎣ 2 2 ⎥

⎢ ⎥
⎤ ⎢⎣ 2 1 ⎥
• Compute h Using ReLU ⎡
⎢ 0 0 ⎥

1 0


⎢ ⎥
Has changed relationship among examples. max{0, XW +c} = ⎢⎢ ⎥
1 0 ⎥
They no longer lie on a single line. ⎢ ⎥
⎢ 2 1 ⎥
A linear model suffices ⎣ ⎦

• Finish by multiplying by w: ⎡ ⎤
⎢ 0 ⎥
• Network has obtained ⎢
f (x) = ⎢⎢

1
1




⎢ ⎥
⎢ 0 ⎥
⎣ ⎦
correct answer for all 4 examples

45
Deep Learning Srihari

Learned representation for XOR


• Two points that must have When x1=0, output has to
output 1 have been increase with x2
When x1=1, output has to
collapsed into one decrease with x2

• Points x=[0,1]T and


x=[1,0]T have been
mapped into h=[0,1]T

When h1=0, output is constant 0


• Described in linear model with h2
When h1=1, output is constant 1
– For fixed h2, output with h2
When h1=2, output is constant 0
increases in h1 with h2
46
Deep Learning Srihari

About the XOR example


• We simply specified the solution
– Then showed that it achieves zero error
• In real situations there might be billions of
parameters and billions of training examples
– So one cannot simply guess the solution
• Instead gradient descent optimization can find
parameters that produce very little error
– The solution described is at the global minimum
• Gradient descent could converge to this solution
• Convergence depends on initial values
• Would not always find easily understood integer solutions
47

You might also like