0% found this document useful (0 votes)
4 views

Lecture 03 - Feedforward Networks - 4p

The document discusses supervised machine learning with a focus on deep learning and feedforward neural networks (FFNN). It covers various topics including the structure of neural networks, the importance of non-linear models, and the use of maximum likelihood for training. Additionally, it highlights the challenges of gradient-based learning and the implications of different output units and cost functions in deep learning models.

Uploaded by

emirkan.b.yilmaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture 03 - Feedforward Networks - 4p

The document discusses supervised machine learning with a focus on deep learning and feedforward neural networks (FFNN). It covers various topics including the structure of neural networks, the importance of non-linear models, and the use of maximum likelihood for training. Additionally, it highlights the challenges of gradient-based learning and the implications of different output units and cost functions in deep learning models.

Uploaded by

emirkan.b.yilmaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

3/19/2025

“Your brain does not manufacture thoughts. Your thoughts shape neural networks.”

- D. Chopra
Supervised Machine Learning

CSE555
# of experiences observed outcome observed input

Deep Learning arg max 𝑔( 𝑦 , 𝑓(𝑥 ))



Spring 2025
Feedforward Deep Networks goodness/performance
hypothesis/program space selected hypothesis
© 2016-2025 Yakup Genç & Y. Sinan Akgul measure

March 2025 Deep Learning 2

1 2

in FFNN there is no backward information (those are special NNs e.g. time series, next output depends on previous output/state.

Deep Feedforward Networks


• Feedforward networks, also often called neural networks, or
multilayer perceptrons (MLPs)
• Used to approximate a function 𝑦 = 𝑓 ∗ (𝑥)
• A feedforward network defines a mapping and learns parameters 𝜃
Feedforward Deep Networks yielding best approximation of 𝑓 ∗ .
(these slides are summarized version of the book by Goodfellow et al)
𝑦 = 𝑓 (𝑥; 𝜃)
• Feedforward – to calculate the function, information flows through
the network from input to output
𝑦 = 𝑓 (𝑓 𝑓 𝑥 )

March 2025 Deep Learning 3 March 2025 Deep Learning 4

3 4
3/19/2025

Why Networks?
• Directed acyclic graph – defining the function decomposition
𝑦 = 𝑓 (𝑓 𝑓 𝑥 )

First layer …
𝑦
𝑓

𝑓 𝑓

𝑥 Depth …

March 2025 Deep Learning 5 March 2025 Deep Learning 6

5 6

Linear Models
• Linear models: Logistic Regression or Linear Regression
• Efficient and reliable fitting
• Closed form solutions or convex optimization
• Capacity is limited to linear functions – cannot understand the
interaction between any two input variables
𝑦 =𝑎 𝑥 +𝑎 𝑥 +⋯+𝑎 𝑥 +𝑎
vs
𝑦 = ⋯ + ~𝑥 𝑥 + ⋯

March 2025 Deep Learning 7 March 2025 Deep Learning 8

7 8
3/19/2025

Non-Linear Models RBF NN


• Apply a mapping 𝜙(𝒙) followed by a linear model
• 𝜙 is new representation of 𝒙 providing a set of new features
• How to get 𝜙(𝒙)
• Use a generic function, e.g., RBF
• If very high dimensional, can explain everything (recall bias and variance discussions)
• Manually engineer
• Done a lot, little possibility to transfer between domains
• Deep learning…

March 2025 Deep Learning 9 March 2025 Deep Learning 10

9 10

Deep Learning Strategy XOR Problem


• Learn the XOR function y = 𝑓 ∗ 𝒙 such that
Learn the model:
• (0,0) and (1,1) yields 0, and (1,0) and (0,1) yields 1

𝑦 = 𝑓 𝑥; 𝜃, 𝑤 = 𝜙(𝑥; 𝜃) 𝑤 • We use model y = 𝑓 𝒙; 𝜽 to approximate the function


• Regression problem using MSE loss function

• Broad class of functions


• Introduces concavity – but benefits outweigh harms
• Parameterized representation
• Parameters are found through optimization
• We can tilt the balance between the first and second strategy above…

March 2025 Deep Learning 11 March 2025 Deep Learning 12

11 12
3/19/2025

XOR Problem Nonlinear Solution to XOR

March 2025 Deep Learning 13 March 2025 Deep Learning 14

13 14

Nonlinear Solution to XOR Rectified Linear Activation Function

𝑦=𝑓 (𝑓 𝑥 )

𝑓 𝑥 = 𝑔(𝑊 𝑥 + 𝑐)
March 2025 Deep Learning 15 March 2025 Deep Learning 16

15 16
3/19/2025

Nonlinear Solution to XOR Batch Processing

1 1 0
𝑊= c=
1 1 −1
1
w= 0
−2 1
1
0
March 2025 Deep Learning 17 March 2025 Deep Learning 18

17 18

Shallow MLP for Regression Gradient-Based Learning


• Non-linearity of a neural network causes most interesting loss
functions to become non-convex
Gradient-based optimizers to reduce the value of the loss function
• Linear solutions for convex problems have convergence guarantee
• SGD methods do not have convergence guarantee
• Initialization important – small random values for weights, 0 for bias…
A problem is convex if its loss function has a single global minimum and no local minima.
Example: A simple quadratic function like f(x)=x^2 is convex because it has only one minimum.

If the problem is convex, simple linear methods (like gradient descent) guarantee convergence to the optimal solution.
Example: Linear regression is a convex problem, so gradient descent always finds the best solution.
However, deep neural networks are non-convex, meaning we do not have a guarantee of finding the global best solution.

March 2025 Deep Learning 19 March 2025 Deep Learning 20

19 20
3/19/2025

Solutions to Linear Models or SVM Cost Functions


• Can be solved using SGD • In most cases parametric model defines a distribution 𝑝(𝒚|𝒙; 𝜽) and
Computationally expensive to process all the data as full-batch.
Therefore split the dataset into random batches, and this is

• Useful when a lot of data is available


the place where the word stochastic comes. uses the principle of maximum likelihood
• Cross-entropy between the training data and the model’s predictions is used
• Gradient calculation is not difficult… as the cost function
• For SVM recall that.. • Simplify
• Minimize ∅ 𝒘 = 𝒘 𝒘 such that 𝑦 𝒘 𝒙 + 𝑏 ≥ 1 • Instead of predicting a complete probability distribution over y
• For FFNNs the gradient calculation is more complicated • Predict some statistics of y conditioned on x
• Simple example – MSE

March 2025 Deep Learning 21 March 2025 Deep Learning 22

21 22

Learning Conditional Distributions with Learning Conditional Distributions with


Maximum Likelihood Maximum Likelihood

• Most modern neural networks are trained using maximum likelihood • The equivalence between maximum likelihood estimation with an
• Cross-entropy between the training data and the model distribution – output distribution and minimization of mean squared error holds for
cost function a linear model
true data distribution • The equivalence holds regardless of the f(𝒙; 𝜽) used to predict the
mean of the Gaussian
• If (negative log-likelihood)
• An advantage of this approach of deriving the cost function from
predicted probability distribution maximum likelihood is that it removes the burden of designing cost
• Then functions for each model  Specifying a model 𝑝(𝒚|𝒙) automatically
determines a cost function log 𝑝(𝒚|𝒙)

March 2025 Deep Learning 23 March 2025 Deep Learning 24

23 The model assumes that the output y follows a Normal (Gaussian) distribution centered at f(x;) which is the model’s prediction.
I represents the identity matrix, indicating that the noise is assumed to be independent and identically distributed. 24
The negative log-likelihood of a Gaussian turns into the Mean Squared Error (MSE) function.
The constant term does not depend on , so it doesn't affect optimization.

Final Conclusion:
For classification problems, MLE leads to cross-entropy loss.
For regression problems with Gaussian noise, MLE leads to MSE loss.
3/19/2025

Learning Conditional Distributions with


Maximum Likelihood Negative Infinity
• The gradient of the cost function must be large and predictable • One unusual property of the cross-entropy cost used to perform maximum
likelihood estimation is that it usually does not have a minimum value when
enough to serve as a good guide for the learning algorithm applied to the models commonly used in practice.
• Functions that saturate (become very flat) undermine this objective • For discrete output variables, most models are parametrized in such a way that
because they make the gradient become very small they cannot represent a probability of zero or one, but can come arbitrarily close
to doing so.
• In many cases this happens because the activation functions used to • Logistic regression is an example of such a model. For real-valued output
produce the output of the hidden units or the output units saturate variables, if the model can control the density of the output distribution (for
example, by learning the variance parameter of a Gaussian output distribution)
• The negative log-likelihood helps to avoid this problem for many then it becomes possible to assign extremely high density to the correct training
models set outputs, resulting in cross-entropy approaching negative infinity.
• Log of exponents… • Regularization techniques can provide several different ways of modifying the
learning problem so that the model cannot reap unlimited reward in this way.
March 2025 Deep Learning 25 March 2025 Deep Learning 26

25 26

Learning Conditional Statistics Learning Conditional Statistics


• Instead of learning a full probability distribution 𝑝(𝑦|𝑥; 𝜃) we often want to learn • If we could train on infinitely many samples from the true data-
just one conditional statistics of y given x generating distribution, minimizing the mean squared error cost
• For example, we may have a predictor 𝑓(𝑥; 𝜃) that we wish to predict the mean function gives a function that predicts the mean of y for each value of
of y, MSE: x
• Unfortunately, mean squared error and mean absolute error often
lead to poor results when used with gradient-based optimization
• Some output units that saturate produce very small gradients when combined
with these cost functions
• This is one reason that the cross-entropy cost function is more
• Another one (mean absolute error): popular than mean squared error or mean absolute error, even when
it is not necessary to estimate an entire distribution 𝑝(𝑦|𝑥)

March 2025 Deep Learning 27 March 2025 Deep Learning 28

27 28
3/19/2025

Output Units Linear Units for Gaussian Output Distributions


• The choice of cost function is tightly coupled with the choice of • One simple kind of output unit is an output unit based on an affine
output unit transformation with no nonlinearity (called linear units)
𝑦 = 𝑊 ℎ+𝑏
• We focus on the use of these units as outputs of the model • Linear output layers are often used to produce the mean of a conditional
• The feedforward network provides a set of hidden features defined by Gaussian distribution:
ℎ = 𝑓(𝑥; 𝜃) p(y|x) = N(y;𝑦,I)
• The role of the output layer is to provide some additional • Maximizing the log-likelihood is equivalent to minimizing the mean
squared error
transformation from the features to complete the task that the
network must perform • Covariance estimates need special care
• Linear output units do not saturate – no difficulty with gradient
calculations and other optimization methods

March 2025 Deep Learning 29 March 2025 Deep Learning 30

29 30
affine transformation: a linear mapping method that preserves points, straight lines, and planes

Sigmoid Units for Bernoulli Output Sigmoid Units for Bernoulli Output
Distributions Distributions
• Many tasks require predicting the value of a binary variable y (e.g., • Instead we use a sigmoid output unit
classification problems)
• The maximum-likelihood approach is to define a Bernoulli distribution over y
conditioned on x
• If linear output is used:

• Difficulty with gradient when outside the unit interval – gradient will
be 0

March 2025 Deep Learning 31 March 2025 Deep Learning 32

31 32
3/19/2025

Sigmoid Units for Bernoulli Output


Sigmoid and Cost Functions
Distributions
• Two step – linear and sigmoid function • Better to use maximum likelihood when sigmoid outputs are trained
• When we use other loss functions, such as mean squared error:
• The loss function can saturate anytime σ (z) saturates
• The sigmoid activation function saturates to 0 when z becomes very negative and saturates
to 1 when z becomes very positive
• The gradient can shrink too small to be useful for learning whenever this happens, whether
the model has the correct answer or the incorrect answer
• Analytically, the logarithm of the sigmoid is always defined and finite, because
the sigmoid returns values restricted to the open interval (0, 1), rather than using
the entire closed interval of valid probabilities [0, 1]
• In software implementations, to avoid numerical problems, it is best to write the
negative log-likelihood as a function of z, rather than as a function of y = σ (z)
• Predicting probabilities in log space… • If the sigmoid function underflows to zero, then taking the logarithm of y yields
negative infinity
• Maximum likelihood −log 𝜎 ((2𝑦 − 1)𝑧)
March 2025 Deep Learning 33 March 2025 Deep Learning 34

33 34

Softmax Units for Multinoulli Output


Softplus Distributions

• Does not shrink the gradient at all… • Any time we wish to represent a probability distribution over a
discrete variable with n possible values, we may use the softmax
function

March 2025 Deep Learning 35 March 2025 Deep Learning 36

35 36
3/19/2025

Other Output Types Mixture Density

March 2025 Deep Learning 37 March 2025 Deep Learning 38

37 38

Mixture Density Hidden Units


• Rectified Linear Units
• Some units are not differentiable at all input points
• Gradient descent should have a problem here
• In practice, gradient descent still performs well enough for these models to be used for machine learning tasks
• This is in part because neural network training algorithms do not usually arrive at a local minimum of the cost
function, but instead merely reduce its value significantly
• Use left and right derivatives

March 2025 Deep Learning 39 March 2025 Deep Learning 40

39 40
3/19/2025

Rectified Linear Units and Their


Derivative
Generalizations
• 𝑔 𝑧 = max 0, 𝑧
• They are easy to optimize as they are very similar to linear units
• Derivatives
• Very large when active and consistent
• Second derivative is 0 almost everywhere
• Very useful gradient direction

March 2025 Deep Learning 41 March 2025 Deep Learning 42

41 42

Rectified Linear Units and Their


Rectified Linear Units
Generalizations
• Activation is usually applied on top of an affine transformation 𝒉 = • When 𝑧 < 0,
ℎ = 𝑔(𝑧, 𝛼) = max 0, 𝑧 + 𝛼 m𝑖𝑛 0, 𝑧
𝑔 𝑊 𝒙+𝒃
• Absolute value rectification: 𝛼 = −1
• Initialization • Object recognition
• b should be close to zero • Invariant under a polarity reversal of the input illumination

• Weights should be within [-0.1,+0.1] • Leaky ReLU: 𝛼 = 0.01


• This forces the unit to be active at the beginning and allows the derivatives to • Parametric ReLU learns 𝛼
pass through
• Still cannot learn when input is 0

March 2025 Deep Learning 43 March 2025 Deep Learning 44

43 44
3/19/2025

Rectified Linear Units and Their


Maxout
Generalizations
• Maxout units: Applying an elementwise function g(z), divides z into • Maxout unit computes the function
groups of k values. Each maxout unit then outputs the maximum max 𝑊 𝒙 + 𝒃𝟏 , 𝑊 𝒙 + 𝒃𝟐
element of one these groups • Special form leads
• Learns a piecewise linear function that responds to multiple directions in the • ReLU with 𝑊 = 0 and 𝒃𝟏 = 𝟎
input x space
• Learning the activation function • Enjoys all benefits of ReLU
• With large enough k this can learn any convex function • Linear operations
• Needs more regularization than rectified linear units – works well without • No saturation
regularization if large training set and k low • Avoids drawbacks of ReLU
• Small k for the next layer can be advantegous • Dying ReLU

March 2025 Deep Learning 45 March 2025 Deep Learning 46

45 46

Logistic Sigmoid and Hyperbolic Tangent Other Hidden Units


• When not linear… • h = cos(W x +b)
• 𝑔 𝑧 = 𝜎(𝑧) or 𝑔 𝑧 = 𝑡𝑎𝑛ℎ 𝑧
• Related by tanh 𝑧 = 2𝜎 2𝑧 − 1 • No activation at all – purely linear in some layers…
• These saturate across most of their domain
• Difficulty for gradient-based learning
• Discouraged for feedforward networks
• Their use as output units is compatible with the use of gradient-based learning when an
appropriate cost function can undo the saturation of the sigmoid in the output layer
• When a sigmoidal activation function must be used, the hyperbolic tangent
activation function typically performs better than the logistic sigmoid. It
resembles the identity function more closely.
• More commonly used in other networks such as recurrent, autoencoders,
probabilistic models

March 2025 Deep Learning 47 March 2025 Deep Learning 48

47 48
3/19/2025

Other Hidden Units Architecture Design


• Radial basis function or RBF unit • Layers
• Softplus • In chain-based architectures, the main architectural considerations are to
choose the depth of the network and the width of each layer.
• Hard tanh • A network with even one hidden layer is sufficient to fit the training set.
• Deeper networks often are able to use far fewer units per layer and far fewer
parameters and often generalize to the test set, but are also often harder to
optimize.
• The ideal network architecture for a task must be found via experimentation
guided by monitoring the validation set error.

March 2025 Deep Learning 49 March 2025 Deep Learning 50

49 50

Universal Approximation Properties and Universal Approximation Properties and


Depth Depth
• A linear model, mapping from features to outputs via matrix • The universal approximation theorem (Hornik et al 1989) states that a
multiplication, can by definition represent only linear functions. It has feedforward network with a linear output layer and at least one
the advantage of being easy to train because many loss functions hidden layer with any “squashing” activation function (such as the
result in a convex optimization problem when applied to linear logistic sigmoid activation function) can approximate any Borel
models.
measurable function from one finite-dimensional space to another
• Unfortunately, we often want to learn nonlinear functions… with any desired non-zero amount of error, provided that the network
• At first glance, we might presume that learning a nonlinear function is given enough hidden units
requires designing a specialized model family for the kind of
nonlinearity we want to learn. • The derivatives of the feedforward network can also approximate the
derivatives of the function arbitrarily well
• Fortunately, feedforward networks with hidden layers provide a
universal approximation framework.
March 2025 Deep Learning 51 March 2025 Deep Learning 52

51 52
3/19/2025

Universal Approximation Properties and


Depth Choice
Depth
• The universal approximation theorem means that regardless of what • There exist families of function that can be efficiently approximated
function we are trying to learn, we know that a large MLP will be able by a NN with depth d
to represent this function • If we restrict the depth below d, the model may need to be much
• However, we are not guaranteed that the training algorithm will be larger (less efficient) – in many cases shallow models require
able to learn that function exponential number of hidden units
• The “no free lunch” theorem says that there is no universal machine
learning algorithm
• The universal approximation theorem says that there exists a network
large enough to achieve any degree of accuracy we desire, but the
theorem does not say how large this network will be
March 2025 Deep Learning 53 March 2025 Deep Learning 54

53 54

Universal Approximation Properties and Universal Approximation Properties and


Depth Depth
• A deep model may be wanted for statistical reasons
• Anytime we choose an ML method we are implicitly stating something about
the prior beliefs we have about the problem
• Choosing a deep model  the function we want to model should
involve composition of several simpler functions
• We believe the learning problem consists of discovering a set of underlying
factors of variation – which can in turn be described in terms of other, simpler
underlying factors of variation….

March 2025 Deep Learning 55 March 2025 Deep Learning 56

55 56
3/19/2025

Universal Approximation Properties and Back-Propagation and Other Differentiation Algorithms


Depth
• During training forward propagation can continue onward until it
produces a scalar cost 𝐽 𝜃
• Backpropagation allows the cost to flow backwards through the
network in order to compute the gradient
• Stochastic Gradient Descent is used to reduce the cost using the
gradient calculated using backprop

March 2025 Deep Learning 57 March 2025 Deep Learning 58

57 58

Gradient Descent Derivatives


• Required by the learning process
• Gradient descent …
• Needed to analyze the learned model
• Not just for to cost function but other parts of the network …

March 2025 Deep Learning 59 March 2025 Deep Learning 60

59 60
3/19/2025

Example Computational Graphs


x y
𝑓 𝑥, 𝑦 = 𝑒 + 𝑥𝑦
𝑣 =𝑒 𝑣 = 𝑥𝑦

𝑣 =𝑣 +𝑣

𝑓 𝑥, 𝑦

March 2025 Deep Learning 61 March 2025 Deep Learning 62

61 62

Computational Graphs Chain Rule of Calculus


• The chain rule of calculus is used to compute the derivatives of
functions formed by composing other functions whose derivatives are
known.
• Back-propagation is an algorithm that computes the chain rule, with a
specific order of operations that is highly efficient

March 2025 Deep Learning 63 March 2025 Deep Learning 64

63 64
3/19/2025

Chain Rule of Calculus Recursively Applying the Chain Rule


• Gradient of a scalar with respect to any node in the graph can be
calculated – just need some attention
• Many sub-expressions may be repeated several times within the
overall expression for the gradient
• Store these sub expressions – memory
• Or, recalculate them as needed – time
• Naïve implementation of the chain rule is grossly inefficient

March 2025 Deep Learning 65 March 2025 Deep Learning 66

65 66

Recursively Applying the Chain Rule Forward Phase

March 2025 Deep Learning 67 March 2025 Deep Learning 68

67 68
3/19/2025

Recursively Applying the Chain Rule Backward Phase

March 2025 Deep Learning 69 March 2025 Deep Learning 70

69 70

Symbolic Computation
Back-Propagation Computation in Fully-
Connected MLP
• Algebraic expressions (and computational graphs) operate on symbols
(or variables)  symbolic representations
• When training a network  assign specific values to these symbols…
• Symbolic-to-numeric differentiation: Assign numeric input, calculate
and return numerical values describing the gradient at each node
• Torch and Caffe
• Symbolic-to-symbolic differentiation: Alternative is to add additional
nodes to the graph providing a symbolic description of the desired
derivatives
• Theano and TensorFlow
March 2025 Deep Learning 71 March 2025 Deep Learning 72

71 72
3/19/2025

Back-Propagation Computation in Fully-


Symbolic-to-Symbolic
Connected MLP

March 2025 Deep Learning 73 March 2025 Deep Learning 74

73 74

Reverse-mode Automatic Differentiation


• General form of backpropagation
• Computational graph

Thanks for listening!

March 2025 Deep Learning 75

75 76

You might also like