Lecture 03 - Feedforward Networks - 4p
Lecture 03 - Feedforward Networks - 4p
“Your brain does not manufacture thoughts. Your thoughts shape neural networks.”
- D. Chopra
Supervised Machine Learning
CSE555
# of experiences observed outcome observed input
1 2
in FFNN there is no backward information (those are special NNs e.g. time series, next output depends on previous output/state.
3 4
3/19/2025
Why Networks?
• Directed acyclic graph – defining the function decomposition
𝑦 = 𝑓 (𝑓 𝑓 𝑥 )
First layer …
𝑦
𝑓
𝑓 𝑓
𝑥 Depth …
5 6
Linear Models
• Linear models: Logistic Regression or Linear Regression
• Efficient and reliable fitting
• Closed form solutions or convex optimization
• Capacity is limited to linear functions – cannot understand the
interaction between any two input variables
𝑦 =𝑎 𝑥 +𝑎 𝑥 +⋯+𝑎 𝑥 +𝑎
vs
𝑦 = ⋯ + ~𝑥 𝑥 + ⋯
7 8
3/19/2025
9 10
11 12
3/19/2025
13 14
𝑦=𝑓 (𝑓 𝑥 )
𝑓 𝑥 = 𝑔(𝑊 𝑥 + 𝑐)
March 2025 Deep Learning 15 March 2025 Deep Learning 16
15 16
3/19/2025
1 1 0
𝑊= c=
1 1 −1
1
w= 0
−2 1
1
0
March 2025 Deep Learning 17 March 2025 Deep Learning 18
17 18
If the problem is convex, simple linear methods (like gradient descent) guarantee convergence to the optimal solution.
Example: Linear regression is a convex problem, so gradient descent always finds the best solution.
However, deep neural networks are non-convex, meaning we do not have a guarantee of finding the global best solution.
19 20
3/19/2025
21 22
• Most modern neural networks are trained using maximum likelihood • The equivalence between maximum likelihood estimation with an
• Cross-entropy between the training data and the model distribution – output distribution and minimization of mean squared error holds for
cost function a linear model
true data distribution • The equivalence holds regardless of the f(𝒙; 𝜽) used to predict the
mean of the Gaussian
• If (negative log-likelihood)
• An advantage of this approach of deriving the cost function from
predicted probability distribution maximum likelihood is that it removes the burden of designing cost
• Then functions for each model Specifying a model 𝑝(𝒚|𝒙) automatically
determines a cost function log 𝑝(𝒚|𝒙)
23 The model assumes that the output y follows a Normal (Gaussian) distribution centered at f(x;) which is the model’s prediction.
I represents the identity matrix, indicating that the noise is assumed to be independent and identically distributed. 24
The negative log-likelihood of a Gaussian turns into the Mean Squared Error (MSE) function.
The constant term does not depend on , so it doesn't affect optimization.
Final Conclusion:
For classification problems, MLE leads to cross-entropy loss.
For regression problems with Gaussian noise, MLE leads to MSE loss.
3/19/2025
25 26
27 28
3/19/2025
29 30
affine transformation: a linear mapping method that preserves points, straight lines, and planes
Sigmoid Units for Bernoulli Output Sigmoid Units for Bernoulli Output
Distributions Distributions
• Many tasks require predicting the value of a binary variable y (e.g., • Instead we use a sigmoid output unit
classification problems)
• The maximum-likelihood approach is to define a Bernoulli distribution over y
conditioned on x
• If linear output is used:
• Difficulty with gradient when outside the unit interval – gradient will
be 0
31 32
3/19/2025
33 34
• Does not shrink the gradient at all… • Any time we wish to represent a probability distribution over a
discrete variable with n possible values, we may use the softmax
function
35 36
3/19/2025
37 38
39 40
3/19/2025
41 42
43 44
3/19/2025
45 46
47 48
3/19/2025
49 50
51 52
3/19/2025
53 54
55 56
3/19/2025
57 58
59 60
3/19/2025
𝑣 =𝑣 +𝑣
𝑓 𝑥, 𝑦
61 62
63 64
3/19/2025
65 66
67 68
3/19/2025
69 70
Symbolic Computation
Back-Propagation Computation in Fully-
Connected MLP
• Algebraic expressions (and computational graphs) operate on symbols
(or variables) symbolic representations
• When training a network assign specific values to these symbols…
• Symbolic-to-numeric differentiation: Assign numeric input, calculate
and return numerical values describing the gradient at each node
• Torch and Caffe
• Symbolic-to-symbolic differentiation: Alternative is to add additional
nodes to the graph providing a symbolic description of the desired
derivatives
• Theano and TensorFlow
March 2025 Deep Learning 71 March 2025 Deep Learning 72
71 72
3/19/2025
73 74
75 76