0% found this document useful (0 votes)

4 views

Lecture 03 - Feedforward Networks - 4p

The document discusses supervised machine learning with a focus on deep learning and feedforward neural networks (FFNN). It covers various topics including the structure of neural networks, the importance of non-linear models, and the use of maximum likelihood for training. Additionally, it highlights the challenges of gradient-based learning and the implications of different output units and cost functions in deep learning models.

Uploaded by

emirkan.b.yilmaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Lecture 03 - Feedforward Networks - 4p

Uploaded by

emirkan.b.yilmaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

3/19/2025

“Your brain does not manufacture thoughts. Your thoughts shape neural networks.”

- D. Chopra
Supervised Machine Learning

CSE555
# of experiences observed outcome observed input

Deep Learning arg max 𝑔( 𝑦 , 𝑓(𝑥 ))

∈
Spring 2025
Feedforward Deep Networks goodness/performance
hypothesis/program space selected hypothesis
© 2016-2025 Yakup Genç & Y. Sinan Akgul measure

March 2025 Deep Learning 2

1 2

in FFNN there is no backward information (those are special NNs e.g. time series, next output depends on previous output/state.

Deep Feedforward Networks

• Feedforward networks, also often called neural networks, or
multilayer perceptrons (MLPs)
• Used to approximate a function 𝑦 = 𝑓 ∗ (𝑥)
• A feedforward network defines a mapping and learns parameters 𝜃
Feedforward Deep Networks yielding best approximation of 𝑓 ∗ .
(these slides are summarized version of the book by Goodfellow et al)
𝑦 = 𝑓 (𝑥; 𝜃)
• Feedforward – to calculate the function, information flows through
the network from input to output
𝑦 = 𝑓 (𝑓 𝑓 𝑥 )

March 2025 Deep Learning 3 March 2025 Deep Learning 4

3 4
3/19/2025

Why Networks?
• Directed acyclic graph – defining the function decomposition
𝑦 = 𝑓 (𝑓 𝑓 𝑥 )

First layer …
𝑦
𝑓

𝑓 𝑓

𝑥 Depth …

March 2025 Deep Learning 5 March 2025 Deep Learning 6

5 6

Linear Models
• Linear models: Logistic Regression or Linear Regression
• Efficient and reliable fitting
• Closed form solutions or convex optimization
• Capacity is limited to linear functions – cannot understand the
interaction between any two input variables
𝑦 =𝑎 𝑥 +𝑎 𝑥 +⋯+𝑎 𝑥 +𝑎
vs
𝑦 = ⋯ + ~𝑥 𝑥 + ⋯

March 2025 Deep Learning 7 March 2025 Deep Learning 8

7 8
3/19/2025

Non-Linear Models RBF NN

• Apply a mapping 𝜙(𝒙) followed by a linear model
• 𝜙 is new representation of 𝒙 providing a set of new features
• How to get 𝜙(𝒙)
• Use a generic function, e.g., RBF
• If very high dimensional, can explain everything (recall bias and variance discussions)
• Manually engineer
• Done a lot, little possibility to transfer between domains
• Deep learning…

March 2025 Deep Learning 9 March 2025 Deep Learning 10

9 10

Deep Learning Strategy XOR Problem

• Learn the XOR function y = 𝑓 ∗ 𝒙 such that
Learn the model:
• (0,0) and (1,1) yields 0, and (1,0) and (0,1) yields 1

𝑦 = 𝑓 𝑥; 𝜃, 𝑤 = 𝜙(𝑥; 𝜃) 𝑤 • We use model y = 𝑓 𝒙; 𝜽 to approximate the function

• Regression problem using MSE loss function

• Broad class of functions

• Introduces concavity – but benefits outweigh harms
• Parameterized representation
• Parameters are found through optimization
• We can tilt the balance between the first and second strategy above…

March 2025 Deep Learning 11 March 2025 Deep Learning 12

11 12
3/19/2025

XOR Problem Nonlinear Solution to XOR

March 2025 Deep Learning 13 March 2025 Deep Learning 14

13 14

Nonlinear Solution to XOR Rectified Linear Activation Function

𝑦=𝑓 (𝑓 𝑥 )

𝑓 𝑥 = 𝑔(𝑊 𝑥 + 𝑐)
March 2025 Deep Learning 15 March 2025 Deep Learning 16

15 16
3/19/2025

Nonlinear Solution to XOR Batch Processing

1 1 0
𝑊= c=
1 1 −1
1
w= 0
−2 1
1
0
March 2025 Deep Learning 17 March 2025 Deep Learning 18

17 18

Shallow MLP for Regression Gradient-Based Learning

• Non-linearity of a neural network causes most interesting loss
functions to become non-convex
Gradient-based optimizers to reduce the value of the loss function
• Linear solutions for convex problems have convergence guarantee
• SGD methods do not have convergence guarantee
• Initialization important – small random values for weights, 0 for bias…
A problem is convex if its loss function has a single global minimum and no local minima.
Example: A simple quadratic function like f(x)=x^2 is convex because it has only one minimum.

If the problem is convex, simple linear methods (like gradient descent) guarantee convergence to the optimal solution.
Example: Linear regression is a convex problem, so gradient descent always finds the best solution.
However, deep neural networks are non-convex, meaning we do not have a guarantee of finding the global best solution.

March 2025 Deep Learning 19 March 2025 Deep Learning 20

19 20
3/19/2025

Solutions to Linear Models or SVM Cost Functions

• Can be solved using SGD • In most cases parametric model defines a distribution 𝑝(𝒚|𝒙; 𝜽) and
Computationally expensive to process all the data as full-batch.
Therefore split the dataset into random batches, and this is

• Useful when a lot of data is available

the place where the word stochastic comes. uses the principle of maximum likelihood
• Cross-entropy between the training data and the model’s predictions is used
• Gradient calculation is not difficult… as the cost function
• For SVM recall that.. • Simplify
• Minimize ∅ 𝒘 = 𝒘 𝒘 such that 𝑦 𝒘 𝒙 + 𝑏 ≥ 1 • Instead of predicting a complete probability distribution over y
• For FFNNs the gradient calculation is more complicated • Predict some statistics of y conditioned on x
• Simple example – MSE

March 2025 Deep Learning 21 March 2025 Deep Learning 22

21 22

Learning Conditional Distributions with Learning Conditional Distributions with

Maximum Likelihood Maximum Likelihood

• Most modern neural networks are trained using maximum likelihood • The equivalence between maximum likelihood estimation with an
• Cross-entropy between the training data and the model distribution – output distribution and minimization of mean squared error holds for
cost function a linear model
true data distribution • The equivalence holds regardless of the f(𝒙; 𝜽) used to predict the
mean of the Gaussian
• If (negative log-likelihood)
• An advantage of this approach of deriving the cost function from
predicted probability distribution maximum likelihood is that it removes the burden of designing cost
• Then functions for each model  Specifying a model 𝑝(𝒚|𝒙) automatically
determines a cost function log 𝑝(𝒚|𝒙)

March 2025 Deep Learning 23 March 2025 Deep Learning 24

23 The model assumes that the output y follows a Normal (Gaussian) distribution centered at f(x;) which is the model’s prediction.
I represents the identity matrix, indicating that the noise is assumed to be independent and identically distributed. 24
The negative log-likelihood of a Gaussian turns into the Mean Squared Error (MSE) function.
The constant term does not depend on , so it doesn't affect optimization.

Final Conclusion:
For classification problems, MLE leads to cross-entropy loss.
For regression problems with Gaussian noise, MLE leads to MSE loss.
3/19/2025

Learning Conditional Distributions with

Maximum Likelihood Negative Infinity
• The gradient of the cost function must be large and predictable • One unusual property of the cross-entropy cost used to perform maximum
likelihood estimation is that it usually does not have a minimum value when
enough to serve as a good guide for the learning algorithm applied to the models commonly used in practice.
• Functions that saturate (become very flat) undermine this objective • For discrete output variables, most models are parametrized in such a way that
because they make the gradient become very small they cannot represent a probability of zero or one, but can come arbitrarily close
to doing so.
• In many cases this happens because the activation functions used to • Logistic regression is an example of such a model. For real-valued output
produce the output of the hidden units or the output units saturate variables, if the model can control the density of the output distribution (for
example, by learning the variance parameter of a Gaussian output distribution)
• The negative log-likelihood helps to avoid this problem for many then it becomes possible to assign extremely high density to the correct training
models set outputs, resulting in cross-entropy approaching negative infinity.
• Log of exponents… • Regularization techniques can provide several different ways of modifying the
learning problem so that the model cannot reap unlimited reward in this way.
March 2025 Deep Learning 25 March 2025 Deep Learning 26

25 26

Learning Conditional Statistics Learning Conditional Statistics

• Instead of learning a full probability distribution 𝑝(𝑦|𝑥; 𝜃) we often want to learn • If we could train on infinitely many samples from the true data-
just one conditional statistics of y given x generating distribution, minimizing the mean squared error cost
• For example, we may have a predictor 𝑓(𝑥; 𝜃) that we wish to predict the mean function gives a function that predicts the mean of y for each value of
of y, MSE: x
• Unfortunately, mean squared error and mean absolute error often
lead to poor results when used with gradient-based optimization
• Some output units that saturate produce very small gradients when combined
with these cost functions
• This is one reason that the cross-entropy cost function is more
• Another one (mean absolute error): popular than mean squared error or mean absolute error, even when
it is not necessary to estimate an entire distribution 𝑝(𝑦|𝑥)

March 2025 Deep Learning 27 March 2025 Deep Learning 28

27 28
3/19/2025

Output Units Linear Units for Gaussian Output Distributions

• The choice of cost function is tightly coupled with the choice of • One simple kind of output unit is an output unit based on an affine
output unit transformation with no nonlinearity (called linear units)
𝑦 = 𝑊 ℎ+𝑏
• We focus on the use of these units as outputs of the model • Linear output layers are often used to produce the mean of a conditional
• The feedforward network provides a set of hidden features defined by Gaussian distribution:
ℎ = 𝑓(𝑥; 𝜃) p(y|x) = N(y;𝑦,I)
• The role of the output layer is to provide some additional • Maximizing the log-likelihood is equivalent to minimizing the mean
squared error
transformation from the features to complete the task that the
network must perform • Covariance estimates need special care
• Linear output units do not saturate – no difficulty with gradient
calculations and other optimization methods

March 2025 Deep Learning 29 March 2025 Deep Learning 30

29 30
affine transformation: a linear mapping method that preserves points, straight lines, and planes

Sigmoid Units for Bernoulli Output Sigmoid Units for Bernoulli Output
Distributions Distributions
• Many tasks require predicting the value of a binary variable y (e.g., • Instead we use a sigmoid output unit
classification problems)
• The maximum-likelihood approach is to define a Bernoulli distribution over y
conditioned on x
• If linear output is used:

• Difficulty with gradient when outside the unit interval – gradient will
be 0

March 2025 Deep Learning 31 March 2025 Deep Learning 32

31 32
3/19/2025

Sigmoid Units for Bernoulli Output

Sigmoid and Cost Functions
Distributions
• Two step – linear and sigmoid function • Better to use maximum likelihood when sigmoid outputs are trained
• When we use other loss functions, such as mean squared error:
• The loss function can saturate anytime σ (z) saturates
• The sigmoid activation function saturates to 0 when z becomes very negative and saturates
to 1 when z becomes very positive
• The gradient can shrink too small to be useful for learning whenever this happens, whether
the model has the correct answer or the incorrect answer
• Analytically, the logarithm of the sigmoid is always defined and finite, because
the sigmoid returns values restricted to the open interval (0, 1), rather than using
the entire closed interval of valid probabilities [0, 1]
• In software implementations, to avoid numerical problems, it is best to write the
negative log-likelihood as a function of z, rather than as a function of y = σ (z)
• Predicting probabilities in log space… • If the sigmoid function underflows to zero, then taking the logarithm of y yields
negative infinity
• Maximum likelihood −log 𝜎 ((2𝑦 − 1)𝑧)
March 2025 Deep Learning 33 March 2025 Deep Learning 34

33 34

Softmax Units for Multinoulli Output

Softplus Distributions

• Does not shrink the gradient at all… • Any time we wish to represent a probability distribution over a
discrete variable with n possible values, we may use the softmax
function

March 2025 Deep Learning 35 March 2025 Deep Learning 36

35 36
3/19/2025

Other Output Types Mixture Density

March 2025 Deep Learning 37 March 2025 Deep Learning 38

37 38

Mixture Density Hidden Units

• Rectified Linear Units
• Some units are not differentiable at all input points
• Gradient descent should have a problem here
• In practice, gradient descent still performs well enough for these models to be used for machine learning tasks
• This is in part because neural network training algorithms do not usually arrive at a local minimum of the cost
function, but instead merely reduce its value significantly
• Use left and right derivatives

March 2025 Deep Learning 39 March 2025 Deep Learning 40

39 40
3/19/2025

Rectified Linear Units and Their

Derivative
Generalizations
• 𝑔 𝑧 = max 0, 𝑧
• They are easy to optimize as they are very similar to linear units
• Derivatives
• Very large when active and consistent
• Second derivative is 0 almost everywhere
• Very useful gradient direction

March 2025 Deep Learning 41 March 2025 Deep Learning 42

41 42

Rectified Linear Units and Their

Rectified Linear Units
Generalizations
• Activation is usually applied on top of an affine transformation 𝒉 = • When 𝑧 < 0,
ℎ = 𝑔(𝑧, 𝛼) = max 0, 𝑧 + 𝛼 m𝑖𝑛 0, 𝑧
𝑔 𝑊 𝒙+𝒃
• Absolute value rectification: 𝛼 = −1
• Initialization • Object recognition
• b should be close to zero • Invariant under a polarity reversal of the input illumination

• Weights should be within [-0.1,+0.1] • Leaky ReLU: 𝛼 = 0.01

• This forces the unit to be active at the beginning and allows the derivatives to • Parametric ReLU learns 𝛼
pass through
• Still cannot learn when input is 0

March 2025 Deep Learning 43 March 2025 Deep Learning 44

43 44
3/19/2025

Rectified Linear Units and Their

Maxout
Generalizations
• Maxout units: Applying an elementwise function g(z), divides z into • Maxout unit computes the function
groups of k values. Each maxout unit then outputs the maximum max 𝑊 𝒙 + 𝒃𝟏 , 𝑊 𝒙 + 𝒃𝟐
element of one these groups • Special form leads
• Learns a piecewise linear function that responds to multiple directions in the • ReLU with 𝑊 = 0 and 𝒃𝟏 = 𝟎
input x space
• Learning the activation function • Enjoys all benefits of ReLU
• With large enough k this can learn any convex function • Linear operations
• Needs more regularization than rectified linear units – works well without • No saturation
regularization if large training set and k low • Avoids drawbacks of ReLU
• Small k for the next layer can be advantegous • Dying ReLU

March 2025 Deep Learning 45 March 2025 Deep Learning 46

45 46

Logistic Sigmoid and Hyperbolic Tangent Other Hidden Units

• When not linear… • h = cos(W x +b)
• 𝑔 𝑧 = 𝜎(𝑧) or 𝑔 𝑧 = 𝑡𝑎𝑛ℎ 𝑧
• Related by tanh 𝑧 = 2𝜎 2𝑧 − 1 • No activation at all – purely linear in some layers…
• These saturate across most of their domain
• Difficulty for gradient-based learning
• Discouraged for feedforward networks
• Their use as output units is compatible with the use of gradient-based learning when an
appropriate cost function can undo the saturation of the sigmoid in the output layer
• When a sigmoidal activation function must be used, the hyperbolic tangent
activation function typically performs better than the logistic sigmoid. It
resembles the identity function more closely.
• More commonly used in other networks such as recurrent, autoencoders,
probabilistic models

March 2025 Deep Learning 47 March 2025 Deep Learning 48

47 48
3/19/2025

Other Hidden Units Architecture Design

• Radial basis function or RBF unit • Layers
• Softplus • In chain-based architectures, the main architectural considerations are to
choose the depth of the network and the width of each layer.
• Hard tanh • A network with even one hidden layer is sufficient to fit the training set.
• Deeper networks often are able to use far fewer units per layer and far fewer
parameters and often generalize to the test set, but are also often harder to
optimize.
• The ideal network architecture for a task must be found via experimentation
guided by monitoring the validation set error.

March 2025 Deep Learning 49 March 2025 Deep Learning 50

49 50

Universal Approximation Properties and Universal Approximation Properties and

Depth Depth
• A linear model, mapping from features to outputs via matrix • The universal approximation theorem (Hornik et al 1989) states that a
multiplication, can by definition represent only linear functions. It has feedforward network with a linear output layer and at least one
the advantage of being easy to train because many loss functions hidden layer with any “squashing” activation function (such as the
result in a convex optimization problem when applied to linear logistic sigmoid activation function) can approximate any Borel
models.
measurable function from one finite-dimensional space to another
• Unfortunately, we often want to learn nonlinear functions… with any desired non-zero amount of error, provided that the network
• At first glance, we might presume that learning a nonlinear function is given enough hidden units
requires designing a specialized model family for the kind of
nonlinearity we want to learn. • The derivatives of the feedforward network can also approximate the
derivatives of the function arbitrarily well
• Fortunately, feedforward networks with hidden layers provide a
universal approximation framework.
March 2025 Deep Learning 51 March 2025 Deep Learning 52

51 52
3/19/2025

Universal Approximation Properties and

Depth Choice
Depth
• The universal approximation theorem means that regardless of what • There exist families of function that can be efficiently approximated
function we are trying to learn, we know that a large MLP will be able by a NN with depth d
to represent this function • If we restrict the depth below d, the model may need to be much
• However, we are not guaranteed that the training algorithm will be larger (less efficient) – in many cases shallow models require
able to learn that function exponential number of hidden units
• The “no free lunch” theorem says that there is no universal machine
learning algorithm
• The universal approximation theorem says that there exists a network
large enough to achieve any degree of accuracy we desire, but the
theorem does not say how large this network will be
March 2025 Deep Learning 53 March 2025 Deep Learning 54

53 54

Universal Approximation Properties and Universal Approximation Properties and

Depth Depth
• A deep model may be wanted for statistical reasons
• Anytime we choose an ML method we are implicitly stating something about
the prior beliefs we have about the problem
• Choosing a deep model  the function we want to model should
involve composition of several simpler functions
• We believe the learning problem consists of discovering a set of underlying
factors of variation – which can in turn be described in terms of other, simpler
underlying factors of variation….

March 2025 Deep Learning 55 March 2025 Deep Learning 56

55 56
3/19/2025

Universal Approximation Properties and Back-Propagation and Other Differentiation Algorithms

Depth
• During training forward propagation can continue onward until it
produces a scalar cost 𝐽 𝜃
• Backpropagation allows the cost to flow backwards through the
network in order to compute the gradient
• Stochastic Gradient Descent is used to reduce the cost using the
gradient calculated using backprop

March 2025 Deep Learning 57 March 2025 Deep Learning 58

57 58

Gradient Descent Derivatives

• Required by the learning process
• Gradient descent …
• Needed to analyze the learned model
• Not just for to cost function but other parts of the network …

March 2025 Deep Learning 59 March 2025 Deep Learning 60

59 60
3/19/2025

Example Computational Graphs

x y
𝑓 𝑥, 𝑦 = 𝑒 + 𝑥𝑦
𝑣 =𝑒 𝑣 = 𝑥𝑦

𝑣 =𝑣 +𝑣

𝑓 𝑥, 𝑦

March 2025 Deep Learning 61 March 2025 Deep Learning 62

61 62

Computational Graphs Chain Rule of Calculus

• The chain rule of calculus is used to compute the derivatives of
functions formed by composing other functions whose derivatives are
known.
• Back-propagation is an algorithm that computes the chain rule, with a
specific order of operations that is highly efficient

March 2025 Deep Learning 63 March 2025 Deep Learning 64

63 64
3/19/2025

Chain Rule of Calculus Recursively Applying the Chain Rule

• Gradient of a scalar with respect to any node in the graph can be
calculated – just need some attention
• Many sub-expressions may be repeated several times within the
overall expression for the gradient
• Store these sub expressions – memory
• Or, recalculate them as needed – time
• Naïve implementation of the chain rule is grossly inefficient

March 2025 Deep Learning 65 March 2025 Deep Learning 66

65 66

Recursively Applying the Chain Rule Forward Phase

March 2025 Deep Learning 67 March 2025 Deep Learning 68

67 68
3/19/2025

Recursively Applying the Chain Rule Backward Phase

March 2025 Deep Learning 69 March 2025 Deep Learning 70

69 70

Symbolic Computation
Back-Propagation Computation in Fully-
Connected MLP
• Algebraic expressions (and computational graphs) operate on symbols
(or variables)  symbolic representations
• When training a network  assign specific values to these symbols…
• Symbolic-to-numeric differentiation: Assign numeric input, calculate
and return numerical values describing the gradient at each node
• Torch and Caffe
• Symbolic-to-symbolic differentiation: Alternative is to add additional
nodes to the graph providing a symbolic description of the desired
derivatives
• Theano and TensorFlow
March 2025 Deep Learning 71 March 2025 Deep Learning 72

71 72
3/19/2025

Back-Propagation Computation in Fully-

Symbolic-to-Symbolic
Connected MLP

March 2025 Deep Learning 73 March 2025 Deep Learning 74

73 74

Reverse-mode Automatic Differentiation

• General form of backpropagation
• Computational graph

Thanks for listening!

March 2025 Deep Learning 75

75 76

Artificial intelligence with python 1st Edition Prateek Joshiinstant download
100% (4)
Artificial intelligence with python 1st Edition Prateek Joshiinstant download
48 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Module 2 Deep Feed Forward Networks
No ratings yet
Module 2 Deep Feed Forward Networks
18 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
DL-2
No ratings yet
DL-2
62 pages
Lecture04 VDL
No ratings yet
Lecture04 VDL
93 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
Deep learning
No ratings yet
Deep learning
15 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
Lecture 18. Backpropagation
No ratings yet
Lecture 18. Backpropagation
55 pages
Deep Learning Summer School 2015: Introduction To Machine Learning
No ratings yet
Deep Learning Summer School 2015: Introduction To Machine Learning
46 pages
MLSM Lecture1 050923
No ratings yet
MLSM Lecture1 050923
37 pages
Lec 04 Deep Networks 2
No ratings yet
Lec 04 Deep Networks 2
78 pages
Lecture2
No ratings yet
Lecture2
67 pages
Module 2
No ratings yet
Module 2
44 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
ML 01
No ratings yet
ML 01
24 pages
CH 1
No ratings yet
CH 1
24 pages
Intro_DL_01
No ratings yet
Intro_DL_01
64 pages
Ml2 Script v2
No ratings yet
Ml2 Script v2
123 pages
Cs 171 18 IntroLearning Old
No ratings yet
Cs 171 18 IntroLearning Old
47 pages
Theory of Deep Learning 1652786371
No ratings yet
Theory of Deep Learning 1652786371
118 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
Applying statistical learning theory to deep learning
No ratings yet
Applying statistical learning theory to deep learning
51 pages
Fit without fear- remarkable mathematical phenomena of deep learning through the prism of interpolation
No ratings yet
Fit without fear- remarkable mathematical phenomena of deep learning through the prism of interpolation
51 pages
A2.2 DNN Update 2
No ratings yet
A2.2 DNN Update 2
51 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
Deep Learning As A Building Block in Probabilistic Models: Pierre-Alexandre Mattei
No ratings yet
Deep Learning As A Building Block in Probabilistic Models: Pierre-Alexandre Mattei
57 pages
Maths For ML
No ratings yet
Maths For ML
156 pages
Matematics and Machine Learning
No ratings yet
Matematics and Machine Learning
156 pages
4 Optimization
No ratings yet
4 Optimization
48 pages
16_the key to the most powerful ML models
No ratings yet
16_the key to the most powerful ML models
25 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
123 pages
DL - M2 - Deep Feedforward NN
No ratings yet
DL - M2 - Deep Feedforward NN
97 pages
Ch2-Training, Optimization and Regularization of DNN-new (1)
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new (1)
114 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
WEEK 4
No ratings yet
WEEK 4
61 pages
DNN - M2 - Deep Feedforward NN 23dec
No ratings yet
DNN - M2 - Deep Feedforward NN 23dec
97 pages
Deep Learning A Tutorial
No ratings yet
Deep Learning A Tutorial
16 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
ML Merge
No ratings yet
ML Merge
145 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000689_2025-01-03_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000689_2025-01-03_Reference-Material-I
39 pages
RADL TQKhoat
No ratings yet
RADL TQKhoat
50 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
AIML - Unit 4 Notes
No ratings yet
AIML - Unit 4 Notes
23 pages
DL145611_03_Shallow
No ratings yet
DL145611_03_Shallow
92 pages
Introduction of Machine Learning
No ratings yet
Introduction of Machine Learning
61 pages
CS115 01
No ratings yet
CS115 01
38 pages
07 Intro to ML
No ratings yet
07 Intro to ML
38 pages
chapter02.Background-theory_5e45b9b50ccb12d028c8edf9b332c5e5
No ratings yet
chapter02.Background-theory_5e45b9b50ccb12d028c8edf9b332c5e5
20 pages
Brief Intro To ML PDF
No ratings yet
Brief Intro To ML PDF
236 pages
NN
No ratings yet
NN
12 pages
Lecture 15 - Recap and Midterm Review
No ratings yet
Lecture 15 - Recap and Midterm Review
37 pages
Selected theoretical aspects of ML and deep learning
No ratings yet
Selected theoretical aspects of ML and deep learning
46 pages
Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
ML Fundamentals by Bitspace
No ratings yet
ML Fundamentals by Bitspace
19 pages
poly_aml
No ratings yet
poly_aml
76 pages
Ds 2
No ratings yet
Ds 2
27 pages
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Image Classification: Step-by-step Classifying Images with Python and Techniques of Computer Vision and Machine Learning
From Everand
Image Classification: Step-by-step Classifying Images with Python and Techniques of Computer Vision and Machine Learning
Mark Magic
No ratings yet
CSE555 Syllabus2025
No ratings yet
CSE555 Syllabus2025
3 pages
lec9
No ratings yet
lec9
13 pages
Lecture 05 - Regularization - 4p
No ratings yet
Lecture 05 - Regularization - 4p
21 pages
Lecture 04 - Optimization - 4p
No ratings yet
Lecture 04 - Optimization - 4p
11 pages
Lecture 02 - Neural Networks - 4p
No ratings yet
Lecture 02 - Neural Networks - 4p
10 pages
Lecture 01 - Introduction to ML - 4p
No ratings yet
Lecture 01 - Introduction to ML - 4p
11 pages
AI Engineer Roadmap by ChatGPT
No ratings yet
AI Engineer Roadmap by ChatGPT
4 pages
AI Preboard II Paper G-10
No ratings yet
AI Preboard II Paper G-10
6 pages
Sign Language Detection System report
No ratings yet
Sign Language Detection System report
53 pages
Chapter 1
No ratings yet
Chapter 1
56 pages
DS Question in Mechanical Industry
No ratings yet
DS Question in Mechanical Industry
22 pages
Data Mining A Tutorial Based Primer 2nd Edition Richard J. Roiger instant download
100% (1)
Data Mining A Tutorial Based Primer 2nd Edition Richard J. Roiger instant download
73 pages
LAKSHMI Documentation
No ratings yet
LAKSHMI Documentation
36 pages
Fundamentals of Computer Vision with QA
No ratings yet
Fundamentals of Computer Vision with QA
25 pages
Artificial intelligence in operations management and supply chain management an exploratory case study (1)
No ratings yet
Artificial intelligence in operations management and supply chain management an exploratory case study (1)
19 pages
Grokking Deep Reinforcement Learning First Edition Miguel Morales - The latest ebook edition with all chapters is now available
No ratings yet
Grokking Deep Reinforcement Learning First Edition Miguel Morales - The latest ebook edition with all chapters is now available
69 pages
Instant ebooks textbook (Ebook) Computer Vision Using Deep Learning: Neural Network Architectures with Python and Keras by Vaibhav Verdhan ISBN 9781484266151, 1484266153 download all chapters
100% (6)
Instant ebooks textbook (Ebook) Computer Vision Using Deep Learning: Neural Network Architectures with Python and Keras by Vaibhav Verdhan ISBN 9781484266151, 1484266153 download all chapters
65 pages
6339-Machine Learning Applications in Smart Tourism Overview Research Challenges and the Road Ahead-final
No ratings yet
6339-Machine Learning Applications in Smart Tourism Overview Research Challenges and the Road Ahead-final
8 pages
2024_Applications of artificial intelligence in the AEC industry a review and future outlook
No ratings yet
2024_Applications of artificial intelligence in the AEC industry a review and future outlook
18 pages
Question-Bank-on-NLP,COA,ITB
No ratings yet
Question-Bank-on-NLP,COA,ITB
154 pages
Final Project Report
No ratings yet
Final Project Report
39 pages
The-Future-of-Six-Sigma-Integrating-AI-for-Continuous-Improvement
No ratings yet
The-Future-of-Six-Sigma-Integrating-AI-for-Continuous-Improvement
9 pages
ssrn-4927505[1]
No ratings yet
ssrn-4927505[1]
26 pages
2022년_고3_EBS_수능특강_영어_예상문제_6강
No ratings yet
2022년_고3_EBS_수능특강_영어_예상문제_6강
17 pages
shan-et-al-2024-a-review-of-artificial-intelligence-application-for-radiotherapy
No ratings yet
shan-et-al-2024-a-review-of-artificial-intelligence-application-for-radiotherapy
10 pages
Unit 5 - Neural Networks
No ratings yet
Unit 5 - Neural Networks
41 pages
NCA-GENL Exam Dumps
No ratings yet
NCA-GENL Exam Dumps
13 pages
Learning facial expression and body gesture visual information for video emotion recognition
No ratings yet
Learning facial expression and body gesture visual information for video emotion recognition
14 pages
IVA UNIT-5 EDITED
No ratings yet
IVA UNIT-5 EDITED
42 pages
AAI qb
No ratings yet
AAI qb
15 pages
AI_in _Power_Systems
No ratings yet
AI_in _Power_Systems
9 pages
2025-Lecture07-P2-MLP
No ratings yet
2025-Lecture07-P2-MLP
56 pages
AI ML Course
No ratings yet
AI ML Course
7 pages
NN-BNU4
No ratings yet
NN-BNU4
47 pages
AI
No ratings yet
AI
9 pages

Lecture 03 - Feedforward Networks - 4p

Uploaded by

Lecture 03 - Feedforward Networks - 4p

Uploaded by

3/19/2025

Deep Learning arg max 𝑔( 𝑦 , 𝑓(𝑥 ))

March 2025 Deep Learning 2

Deep Feedforward Networks

March 2025 Deep Learning 3 March 2025 Deep Learning 4

March 2025 Deep Learning 5 March 2025 Deep Learning 6

March 2025 Deep Learning 7 March 2025 Deep Learning 8

Non-Linear Models RBF NN

March 2025 Deep Learning 9 March 2025 Deep Learning 10

Deep Learning Strategy XOR Problem

𝑦 = 𝑓 𝑥; 𝜃, 𝑤 = 𝜙(𝑥; 𝜃) 𝑤 • We use model y = 𝑓 𝒙; 𝜽 to approximate the function

• Broad class of functions

March 2025 Deep Learning 11 March 2025 Deep Learning 12

XOR Problem Nonlinear Solution to XOR

March 2025 Deep Learning 13 March 2025 Deep Learning 14

Nonlinear Solution to XOR Rectified Linear Activation Function

Nonlinear Solution to XOR Batch Processing

Shallow MLP for Regression Gradient-Based Learning

March 2025 Deep Learning 19 March 2025 Deep Learning 20

Solutions to Linear Models or SVM Cost Functions

• Useful when a lot of data is available

March 2025 Deep Learning 21 March 2025 Deep Learning 22

Learning Conditional Distributions with Learning Conditional Distributions with

March 2025 Deep Learning 23 March 2025 Deep Learning 24

Learning Conditional Distributions with

Learning Conditional Statistics Learning Conditional Statistics

March 2025 Deep Learning 27 March 2025 Deep Learning 28

Output Units Linear Units for Gaussian Output Distributions

March 2025 Deep Learning 29 March 2025 Deep Learning 30

March 2025 Deep Learning 31 March 2025 Deep Learning 32

Sigmoid Units for Bernoulli Output

Softmax Units for Multinoulli Output

March 2025 Deep Learning 35 March 2025 Deep Learning 36

Other Output Types Mixture Density

March 2025 Deep Learning 37 March 2025 Deep Learning 38

Mixture Density Hidden Units

March 2025 Deep Learning 39 March 2025 Deep Learning 40

Rectified Linear Units and Their

March 2025 Deep Learning 41 March 2025 Deep Learning 42

Rectified Linear Units and Their

• Weights should be within [-0.1,+0.1] • Leaky ReLU: 𝛼 = 0.01

March 2025 Deep Learning 43 March 2025 Deep Learning 44

Rectified Linear Units and Their

March 2025 Deep Learning 45 March 2025 Deep Learning 46

Logistic Sigmoid and Hyperbolic Tangent Other Hidden Units

March 2025 Deep Learning 47 March 2025 Deep Learning 48

Other Hidden Units Architecture Design

March 2025 Deep Learning 49 March 2025 Deep Learning 50

Universal Approximation Properties and Universal Approximation Properties and

Universal Approximation Properties and

Universal Approximation Properties and Universal Approximation Properties and

March 2025 Deep Learning 55 March 2025 Deep Learning 56

Universal Approximation Properties and Back-Propagation and Other Differentiation Algorithms

March 2025 Deep Learning 57 March 2025 Deep Learning 58

Gradient Descent Derivatives

March 2025 Deep Learning 59 March 2025 Deep Learning 60

Example Computational Graphs

March 2025 Deep Learning 61 March 2025 Deep Learning 62

Computational Graphs Chain Rule of Calculus

March 2025 Deep Learning 63 March 2025 Deep Learning 64

Chain Rule of Calculus Recursively Applying the Chain Rule

March 2025 Deep Learning 65 March 2025 Deep Learning 66

Recursively Applying the Chain Rule Forward Phase

March 2025 Deep Learning 67 March 2025 Deep Learning 68

Recursively Applying the Chain Rule Backward Phase

March 2025 Deep Learning 69 March 2025 Deep Learning 70

Back-Propagation Computation in Fully-

March 2025 Deep Learning 73 March 2025 Deep Learning 74

Reverse-mode Automatic Differentiation

Thanks for listening!

March 2025 Deep Learning 75

You might also like