0% found this document useful (0 votes)

9 views

UDL Errata

This document provides errata and corrections for errors found in the book 'Understanding Deep Learning'. It lists several equations, figures, and sections that contained mistakes along with the corrected versions. It aims to ensure readers are not confused by any errors in the original material.

Uploaded by

Santos Senior

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

UDL Errata

Uploaded by

Santos Senior

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Understanding Deep Learning Errata

January 28, 2024

Much gratitude to everyone who has pointed out mistakes. If you find a problem not
listed here, please contact me via github or by mailing me at [email protected].
2

Copyright ©2023 Simon Prince.

First printing (Dec. 2023)

1.1 Errors
These are things that might genuinely confuse you.

• Figure 4.7b had the wrong calculated numbers in it (but pattern is same). Correct
version is in figure 1.1 of this document.

• Section 7.5.1 The expectation (mean) E[fi ′ ] of the intermediate values fi ′ is:

• Equation 15.7 The optimal discriminator for an example x̃ depends on the under-
lying probabilities:

P r(x̃|real) P r(x)
P r(real|x̃) = sig f[x̃, ϕ] = = .
P r(x̃|generated) + P r(x̃|real) P r(x∗ ) + P r(x)
(1.1)
where on the right hand side, we evaluate x̃ against the generated distribution
P r(x∗ ) and the real distribution P r(x).

• Equation 15.9. First integrand should be with respect to x∗ . Correct version is:
h i
DJS P r(x∗ ) || P r(x)
P r(x∗ ) + P r(x) P r(x∗ ) + P r(x)

1 ∗ 1
= DKL P r(x ) + DKL P r(x)
2 2 2 2
∗
Z Z
1 ∗ 2P r(x ) ∗ 1 2P r(x)
= P r(x ) log dx + P r(x) log dx .
2 P r(x∗ ) + P r(x) 2 P r(x∗ ) + P r(x)
| {z } | {z }
quality coverage

• Equation 15.12.
 
h i X X
Dw P r(x)||q(x) = max  P r(x = i)fi − q(x = j)fj  , (1.2)
f
i j

Copyright ©2023 Simon Prince.

Figure 1.1 Corrected version of figure 4.7: The maximum number of linear regions
for neural networks increases rapidly with the network depth. a) Network with
Di = 1 input. Each curve represents a fixed number of hidden layers K, as
we vary the number of hidden units D per layer. For a fixed parameter budget
(horizontal position), deeper networks produce more linear regions than shallower
ones. A network with K = 5 layers and D = 10 hidden units per layer has 471
parameters (highlighted point) and can produce 161,051 regions. b) Network with
Di = 10 inputs. Each subsequent point along a curve represents ten hidden units.
Here, a model with K = 5 layers and D = 50 hidden units per layer has 10,801
parameters (highlighted point) and can create more than 1040 linear regions.

• Section 15.2.4 Consider distributions P r(x = i) and q(x = j) defined over K bins.
Assume there is a cost Cij associated with moving one unit of mass from bin i in
the first distribution to bin j in the second;

• Equation 15.14. Missing bracket and we don’t need to use x∗ notation here. Correct
version is:

h i Z Z
Dw P r(x), q(x) = min π(x1 , x2 ) · ||x1 − x2 ||dx1 dx2 .
π[•,•]

• Equation 15.15. Don’t need to use x∗ notation here, and second term on right
hand side should have q[x] term not P r(x). Correct version is:
h i Z Z
Dw P r(x), q(x) = max P r(x)f[x]dx − q(x)f[x]dx .
f[x]

• Equation 16.12 has a mistake in the second term. It should be:

b−1
!
X
f[hd , ϕ] = ϕk + (hK − b)ϕb .
k=1

Copyright ©2023 Simon Prince.

1.1 Errors 5

Figure 1.2 Corrected version of figure 19.11

• Equation 17.34.

∂ ∂
EP r(x|ϕ) f[x] = EP r(x|ϕ) f[x] log P r(x|ϕ)
∂ϕ ∂ϕ
I
1X ∂
≈ f[xi ] log P r(xi |ϕ) .
I i=1 ∂ϕ

• Figure 19.11 is wrong in that only the state-action values corresponding to the
current state-action pair should be moderated. Correct version above.

• Equation B.4. Square root sign should cover x. Correct version is:
√ x x
x! ≈ 2πx .
e

• Appendix B.3.6. Consider a matrix A ∈ RD1 ×D2 . If the number of columns D2 of

the matrix is fewer than the number of rows D1 (i.e., the matrix is “portrait”),
• Equation C.20. Erroneous minus sign on covariance matrix. Correct version is:

x = µ + Σ1/2 z.

Copyright ©2023 Simon Prince.

1.1.1 Minor fixes

These are things that are wrong and need to be fixed, but that will probably not affect
your understanding (e.g., math symbols that are in bold but should not be).

• Section 1.1: ...and what is meant by “training” a model.

• Figure 1.13: Adapted from Pablok (2017).

• Figure 2.3 legend: Each combination of parameters ϕ = [ϕ0 ,ϕ1 ]T .

• Section 2.3: 1D linear regression has the obvious drawback

• Figure 3.5 legend: The universal approximation theorem proves that, with enough
hidden units, there exists a shallow neural network that can describe any given
continuous function defined on a compact subset of RDi to arbitrary precision.

• Notes page 38 Most of these are attempts to avoid the dying ReLU problem while
limiting the gradient for negative values.

• Figure 4.1 legend: The first network maps inputs x ∈ [−1, 1] to outputs y ∈
[−1, 1] using a function comprising three linear regions that are chosen so that they
alternate the sign of their slope (fourth linear region is outside range of graph).

• Figure 4.2: Colors changed to avoid ambiguity

• Equation 4.13 is missing a prime sign:

h = a [θ 0 + θx]
′
h = a [ψ 0 + Ψh]
y′ = ϕ′0 + ϕ′ h′ ,

• Equation 4.14: ϕ′0 should not be bold.

y = ϕ′0 + ϕ′ h′

• Equation 4.17 is not technically wrong, but the product is unnecessary and it’s
unclear if the last term should be included in it (no). Better written as:

Di (K−1) X Di
D D
Nr = +1 · .
Di j=0
j

• Equation 5.10. Second line is disambiguated by adding brackets:

Copyright ©2023 Simon Prince.

1.1 Errors 7

" I #
(yi − f[xi , ϕ])2

X 1
ϕ̂ = argmin − log √ exp −
ϕ i=1 2πσ 2 2σ 2
" I #
(yi − f[xi , ϕ])2

X 1
= argmin − log √ −
ϕ i=1 2πσ 2 2σ 2
" I #
X (yi − f[xi , ϕ])2
= argmin − −
ϕ i=1
2σ 2
" I #
X
2
= argmin (yi − f[xi , ϕ]) ,
ϕ i=1

• Equation 5.12. More properly written as:

h i
ŷ = argmax P r(y|f[x, ϕ̂, σ 2 ]) . (1.3)
y

although the value of σ 2 does not actually matter or change the position of the
maximum.

• Equation 5.15. Disambiguated by adding brackets:

" I
" # #
X 1 (yi − f1 [xi , ϕ])2
ϕ̂ = argmin − log p − .
ϕ i=1
2πf2 [xi , ϕ]2 2f2 [xi , ϕ]2

• Section 5.5 The likelihood that input x has label y = k (figure 5.10) is hence:

• Section 5.6 Removed i index from this paragraph for consistency. Independence
implies that we treat the probability P r(y|f[x, ϕ]) as a product of univariate terms
for each element yd ∈ y:
Y
P r(y|f[x, ϕ]) = P r(yd |fd [x, ϕ]),
d

where fd [x, ϕ] is the dth set of network outputs, which describe the parameters of the
distribution over yd . For example, to predict multiple continuous variables yd ∈
R, we use a normal distribution for each yd , and the network outputs fd [x, ϕ]
predict the means of these distributions. To predict multiple discrete variables yd ∈
{1, 2, . . . , K}, we use a categorical distribution for each yd . Here, each set of network
outputs fd [x, ϕ] predicts the K values that contribute to the categorical distribution
for yd .

• Problem 5.8. Construct a loss function for making multivariate predictions y∈ RDi
based on independent normal distributions. . .

Copyright ©2023 Simon Prince.

• Notes page 94. However, this is strange since SGD is a special case of Adam
(when β = γ = 0)
• Section 7.3. The final derivatives from the term f0 = β0 + ω0 · xi are:
• Section 7.4. Similarly, the derivative for the weights matrix Ωk , is given by
• Section 7.5.1 and the second moment E[h2j ] will be half the variance σf2
• Figure 7.8. Not wrong, but changed to “nn.init.kaiming normal (layer in.weight)”
for compatability with text and to avoid deprecated warning.
• Problem 7.13. For the same function as in problem 7.12, compute the derivative...
• Section 8.3.3 (i.e., with four hidden units and four linear regions in the range of
the data) + minor changes in text to accommodate extra words
• Figure 8.9 number of hidden units / linear regions in range of data
• Section 8.4.1 When the number of parameters is very close to the number of training
data examples (figure 8.11b)
• Figure 9.5 legend: Effect of learning rate (LR) and batch size for 4000 training and
4000 test examples from MNIST-1D (see figure 8.1) for a neural network with two
hidden layers. a) Performance is better for large learning rates than for intermediate
or small ones. In each case, the number of iterations is 6000/LR, so each solution
has the opportunity to move the same distance.
• Figure 9.11 legend: a-c) Two sets of parameters (cyan and gray curves) sampled
from the posterior
• Page 156 Notes: Wrong marginal reference — Appendix B.3.7 Spectral Norm
• Section 10.2.1 Not wrong, but could be disambiguated: The size of the region over
which inputs are combined is termed the kernel size.
• Figure 10.3. The dilation rates are wrong by one, so should be 1,1,1, and 2 in
panels a,b,c,d, respectively.
• Section 10.2.3 The number of zeros we intersperse between the weights determines
the dilation rate.
• Section 10.2.4 With kernel size three, stride one, and dilation rate one.
• Section 10.2.7 The convolutional network has 2,050 parameters, and the fully con-
nected network has 59,065 parameters.
• Figure 10.8 Number of parameters also wrong in figure 10.8 (correct version in this
document). Recalculated curve is slightly different.
• Section 10.5.3 The first part of the network is a smaller version of VGG (fig-
ure 10.17) that contains thirteen rather than sixteen convolutional layers.

Copyright ©2023 Simon Prince.

1.1 Errors 9

Corrected version of figure 10.8

• Section 10.6 The weights and the bias are the same at every spatial position, so
there are far fewer parameters than in a fully connected network, and the number
of parameters doesn’t increase with the input image size.
• Problem 10.1 Show that the operation in equation 10.3 is equivariant with respect
to translation.
• Problem 10.2 Equation 10.3 defines 1D convolution with a kernel size of three,
stride of one, and dilation one.
• Problem 10.3 Write out the equation for the 1D dilated convolution with a kernel
size of three and a dilation rate of two.
• Problem 10.4 Write out the equation for a 1D convolution with kernel size of seven,
a dilation rate of three, and a stride of three.
• Problem 10.9 A network consists of three 1D convolutional layers. At each layer,
a zero-padded convolution with kernel size three, stride one, and dilation one is
applied.
• Problem 10.10 A network consists of three 1D convolutional layers. At each layer,
a zero-padded convolution with kernel size seven, stride one, and dilation one is
applied.
• Problem 10.11 Consider a convolutional network with 1D input x. The first hidden
layer H1 is computed using a convolution with kernel size five, stride two, and a
dilation rate of one. The second hidden layer H2 is computed using a convolution
with kernel size three, stride one, and a dilation rate of one. The third hidden
layer H3 is computed using a convolution with kernel size five, stride one, and a
dilation rate of two. What are the receptive field sizes at each hidden layer?
• Legend to figure 11.15. Computational graph for batch normalization (see prob-
lem 11.5).
• Section 12.2: Not a mistake, but this is clearer: where β v ∈ RD and Ωv ∈ RD×D
represent biases and weights, respectively.
• Section 12.3.3 to make self-attention work well

Copyright ©2023 Simon Prince.

• Section 12.4 Title changed to Transformer layers

• Section 12.4 a larger transformer

• Section 12.4 a series of these transformer layers ...

• Section 12.5 The previous section described the transformer layer... a series of
transformer layers...

• Figure 12.8 legend: The transformer → Transformer layer...The transformer layer

consists

• Figure 12.8 has some minor mistakes in the calculation. The corrected version is
shown at the end of this document.

• Figure 12.8 legend. At each iteration, the sub-word tokenizer looks for the most
commonly occurring adjacent pair of tokens

• Section 12.5.3 a series of K transformer layers

• Section 12.6 through 24 transformer layers

• Section 12.6 in the fully connected networks in the transformer is 4096

• Figure 12.10 a series of transformer layers

• Section 12.7.2 the transformer layers use masked...

• Figure 12.12 are passed through a series of transformer layers... and those of tokens
earlier

• Section 12.7.4 There are 96 transformer layers

• Section 12.7 comprises a series of transformer layers

• Section 12.8 Originally, these

• Section 12.8 a series of transformer layers... a series of transformer layers

• Section 13.5.1 Given I training graphs {Xi , Ai } and their labels yi , the parame-
ters Φ = {β k , Ωk }K
k=0 can be learned using SGD...

• Figure 15.3 legend: At the end is a tanh function that maps the...

• Figure 15.3 arctan → tanh. Corrected version nearby in this document.

• Section 15.1.3: At the final layer, the 64×64×3 signal is passed through a tanh
function to generate an image x∗

1.1 Errors 11

...

Corrected version of figure 15.3

• Equation 15.6. Minor problems with brackets in this equation. Should be:

J i 1 X I i
1X h h
L[ϕ] = log 1 − sig[f[x∗j , ϕ]] + log sig[f[xi , ϕ]]
J j=1 I i=1
h i h i
∗
≈ Ex log 1 − sig[f[x , ϕ]] + Ex log sig[f[x, ϕ]]
∗

Z h i Z h i
= P r(x∗ ) log 1 − sig[f[x∗ , ϕ]] dx∗ + P r(x) log sig[f[x, ϕ]] dx.

• Equation 16.2 (last line). For some reason, this didn’t print properly, although it
looks fine in my original pdf. Should be:
" I #
Y
ϕ̂ = argmax P r(xi |ϕ)
ϕ i=1
" I
#
X h i
= argmin − log P r(xi |ϕ)
ϕ i=1
" I
" # #
X ∂f[zi , ϕ]
= argmin log − log P r(zi ) ,
ϕ i=1
∂zi

• Equation 16.25. ϕ should change to ϕ̂ on left hand side. Correct version is:
" I
" ##
X
ϕ̂ = argmin KL δ x − f[zi , ϕ] q(x) .
ϕ i=1

• Equation 16.26. ϕ should change to ϕ̂ on left hand side. Correct version is:
" " I ##
1X
ϕ̂ = argmin KL δ[x − xi ] P r(xi , ϕ) .
ϕ I i=1

• Equation 18.24 has a minor formatting mistake. Better written as:

• Equation 18.34 missing indices on noise term:

I h
X i
− log Normxi f1 [zi1 , ϕ1 ], σ12 I

L[ϕ1...T ] = (1.4)
i=1
T 2
X 1 1 βt
+ √ z it − √ √ ϵ it − ft [zit , ϕt ] .
t=2
2σt2 1 − βt 1 − αt 1 − βt

• Section 20.2.2 Another possible explanation for the ease with which models are
trained is that some regularization methods like L2 regularization (weight decay)
make the loss surface flatter and more convex.
• Section 20.2.4 For example, Du et al. (2019a) show that residual networks converge
to zero training loss when the width of the network D (i.e., the number of hidden
units) is Ω[I 4 K 2 ] where I is the amount of training data, and K is the depth of
the network.
• Section 20.5.1 In general, the smaller the model, the larger the proportion of weights
that can...
• Section 21.7 the Conference on AI, Ethics, and Society
• Appendix A. The notation {0, 1, 2, . . .} denotes the set of non-negative integers.
• Appendix A ...big-O notation, which represents an upper bound...
• Appendix A. f[n] < c·g[n] for all n > n0
• Equation B. 18
y1 = ϕ10 + ϕ11 z1 + ϕ12 z2 + ϕ13 z3
y2 = ϕ20 + ϕ21 z1 + ϕ22 z2 + ϕ23 z3
y3 = ϕ30 + ϕ31 z1 + ϕ32 z2 + ϕ33 z3 . (1.5)

1.1 Errors 13

Corrected version of figure 12.8

• Appendix C.5.4 Accent in wrong place: The Fréchet and Wasserstein distances...
• Equation C.32.

h i
DKL Norm[µ1 , Σ1 ] Norm[µ2 , Σ2 ] =

1 |Σ2 | −1 −1
log − D + tr Σ2 Σ1 + (µ2 − µ1 )Σ2 (µ2 − µ1 ) .
2 |Σ1 |

UDL Answer Booklet Students
No ratings yet
UDL Answer Booklet Students
79 pages
Textbook of Orthodontics - Ebook - R
100% (12)
Textbook of Orthodontics - Ebook - R
781 pages
Download Stationen: Ein Kursbuch für die Mittelstufe 4th Edition Prisca Augustyn ebook All Chapters PDF
100% (2)
Download Stationen: Ein Kursbuch für die Mittelstufe 4th Edition Prisca Augustyn ebook All Chapters PDF
52 pages
Yf300 Service Manual
No ratings yet
Yf300 Service Manual
53 pages
Solution Dseclzg524!01!102020 Ec2r
100% (1)
Solution Dseclzg524!01!102020 Ec2r
6 pages
Solution Dseclzg524 05-07-2020 Ec3r
No ratings yet
Solution Dseclzg524 05-07-2020 Ec3r
7 pages
Yamaha F90D Service Manual (En)
100% (3)
Yamaha F90D Service Manual (En)
241 pages
UDL Errata
No ratings yet
UDL Errata
8 pages
UDL - Errata Data
No ratings yet
UDL - Errata Data
19 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
Theory of Deep Learning 1652786371
No ratings yet
Theory of Deep Learning 1652786371
118 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Selected theoretical aspects of ML and deep learning
No ratings yet
Selected theoretical aspects of ML and deep learning
46 pages
Montanari
No ratings yet
Montanari
10 pages
Introduction of Machine Learning
No ratings yet
Introduction of Machine Learning
61 pages
Notes On Deep Learning Theory
No ratings yet
Notes On Deep Learning Theory
68 pages
SkriptOptMach
No ratings yet
SkriptOptMach
49 pages
Understanding Deep Convolutional Networks
No ratings yet
Understanding Deep Convolutional Networks
17 pages
DLbook
No ratings yet
DLbook
165 pages
Applying statistical learning theory to deep learning
No ratings yet
Applying statistical learning theory to deep learning
51 pages
tfm_lichtner_bajjaoui_aisha
No ratings yet
tfm_lichtner_bajjaoui_aisha
18 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Fundations Data Science
No ratings yet
Fundations Data Science
16 pages
Solution PDF
No ratings yet
Solution PDF
20 pages
17-AAP1328
No ratings yet
17-AAP1328
59 pages
Instructor's Solution Manual For Neural Networks
No ratings yet
Instructor's Solution Manual For Neural Networks
40 pages
Instructor Solution Manual To Neural Networks and Deep Learning A Textbook Solutions 3319944622 9783319944623 - Compress
No ratings yet
Instructor Solution Manual To Neural Networks and Deep Learning A Textbook Solutions 3319944622 9783319944623 - Compress
40 pages
Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
A2.2 DNN Update 2
No ratings yet
A2.2 DNN Update 2
51 pages
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
No ratings yet
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
56 pages
Mathematical Theory of Deep
No ratings yet
Mathematical Theory of Deep
275 pages
Lecture2
No ratings yet
Lecture2
67 pages
Ml2 Script v2
No ratings yet
Ml2 Script v2
123 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Lec 04 Deep Networks 2
No ratings yet
Lec 04 Deep Networks 2
78 pages
DNN Cluster S2 22 MidSem Makeup
No ratings yet
DNN Cluster S2 22 MidSem Makeup
7 pages
Exercises INF 5860 Solution Hints
No ratings yet
Exercises INF 5860 Solution Hints
11 pages
Fit without fear- remarkable mathematical phenomena of deep learning through the prism of interpolation
No ratings yet
Fit without fear- remarkable mathematical phenomena of deep learning through the prism of interpolation
51 pages
SS 2020 Solutions
No ratings yet
SS 2020 Solutions
22 pages
Recognition Patterns: Jean Carlo Grandas Franco March 2020
No ratings yet
Recognition Patterns: Jean Carlo Grandas Franco March 2020
9 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000689_2025-01-03_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000689_2025-01-03_Reference-Material-I
39 pages
2501.10465v1
No ratings yet
2501.10465v1
10 pages
16_the key to the most powerful ML models
No ratings yet
16_the key to the most powerful ML models
25 pages
Deep Learning Lectures - 2
No ratings yet
Deep Learning Lectures - 2
73 pages
Lecture04 VDL
No ratings yet
Lecture04 VDL
93 pages
Solving Parabolic Periodic P-Laplacian by Deep Learning
No ratings yet
Solving Parabolic Periodic P-Laplacian by Deep Learning
15 pages
Lecture 03 - Feedforward Networks - 4p
No ratings yet
Lecture 03 - Feedforward Networks - 4p
19 pages
1-s2.0-S174680942300722X-main
No ratings yet
1-s2.0-S174680942300722X-main
5 pages
UDL Answer Booklet Students
No ratings yet
UDL Answer Booklet Students
79 pages
UDL Answer Booklet Students
No ratings yet
UDL Answer Booklet Students
79 pages
Errata
No ratings yet
Errata
9 pages
Lecture20 Backprop
No ratings yet
Lecture20 Backprop
77 pages
Homework2
No ratings yet
Homework2
3 pages
Index
No ratings yet
Index
127 pages
Deep Learning Assignment3 Solution
No ratings yet
Deep Learning Assignment3 Solution
9 pages
Deep Learning Math
No ratings yet
Deep Learning Math
282 pages
NN MTH404
No ratings yet
NN MTH404
9 pages
CSE489: Machine Vision (Sheet 7) : Yehia Zakaria
No ratings yet
CSE489: Machine Vision (Sheet 7) : Yehia Zakaria
34 pages
Lecture 13 - Perceptrons: Machine Learning March 16, 2010
No ratings yet
Lecture 13 - Perceptrons: Machine Learning March 16, 2010
49 pages
Cheat Sheet For Exam
No ratings yet
Cheat Sheet For Exam
2 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
2 PB
No ratings yet
2 PB
11 pages
ART2017951
No ratings yet
ART2017951
5 pages
14 Newton
No ratings yet
14 Newton
24 pages
Rep1 (Repaired) - 084910
No ratings yet
Rep1 (Repaired) - 084910
39 pages
Che305 3-1
No ratings yet
Che305 3-1
30 pages
Litreature Review
No ratings yet
Litreature Review
11 pages
WE - Writing A Literature Review
No ratings yet
WE - Writing A Literature Review
2 pages
CEN333 Project Descriptions-1
No ratings yet
CEN333 Project Descriptions-1
18 pages
DC 3
No ratings yet
DC 3
32 pages
ROS Plastic Injection Molding
No ratings yet
ROS Plastic Injection Molding
19 pages
Chapter 2-Ethics
No ratings yet
Chapter 2-Ethics
17 pages
Dukane Audio Enhancement System
100% (1)
Dukane Audio Enhancement System
4 pages
Asiri Healthcare_Assignment 4-1
No ratings yet
Asiri Healthcare_Assignment 4-1
16 pages
Principle Theories, Constructive Theories, and Explanation in Modern Physics PDF
No ratings yet
Principle Theories, Constructive Theories, and Explanation in Modern Physics PDF
9 pages
Baltic Amber Handbook PDF
No ratings yet
Baltic Amber Handbook PDF
154 pages
IOCL Interview Questions
No ratings yet
IOCL Interview Questions
4 pages
PHP Answer 1
No ratings yet
PHP Answer 1
5 pages
How To Beat The Board Exam Using Es991 Plus PDF
No ratings yet
How To Beat The Board Exam Using Es991 Plus PDF
21 pages
Blue Eye Technology
No ratings yet
Blue Eye Technology
41 pages
Bharti Zain Ppt-Final
50% (2)
Bharti Zain Ppt-Final
17 pages
Article - Novel Nano-Dispersion - PCI Mag
No ratings yet
Article - Novel Nano-Dispersion - PCI Mag
7 pages
Oracle Fusion CloudData Masking Service
No ratings yet
Oracle Fusion CloudData Masking Service
32 pages
Maximum Overpull Worksheet: Well:Grenadier EDC - 90
No ratings yet
Maximum Overpull Worksheet: Well:Grenadier EDC - 90
1 page
Six Step Commutation - Generate switching sequence for six-step commutation of brushless DC (BLDC) motor - Simulink - MathWorks India
No ratings yet
Six Step Commutation - Generate switching sequence for six-step commutation of brushless DC (BLDC) motor - Simulink - MathWorks India
5 pages
IPS E-Max ZirCAD Labside ENG
No ratings yet
IPS E-Max ZirCAD Labside ENG
60 pages
The Historical Research: Theory, Methodology and Historiography
No ratings yet
The Historical Research: Theory, Methodology and Historiography
2 pages
DQE January 2001: Additional Information
No ratings yet
DQE January 2001: Additional Information
12 pages
Glenn Neely Neowave, Inc.: Friday, January 4, 2008
No ratings yet
Glenn Neely Neowave, Inc.: Friday, January 4, 2008
1 page
Protecting Victims of Violent Patients While Protecting Confidentiality
No ratings yet
Protecting Victims of Violent Patients While Protecting Confidentiality
7 pages
Ipes Q4 3RD Summative Test
No ratings yet
Ipes Q4 3RD Summative Test
16 pages
ViroGene CMV QPCR Kit 1.0
No ratings yet
ViroGene CMV QPCR Kit 1.0
10 pages
The Doctrine of Jediism
No ratings yet
The Doctrine of Jediism
9 pages
Art Club
No ratings yet
Art Club
2 pages
Simple Stress and Strain Relationship: Stress and Strain in Two Dimensions, Principal Stresses, Stress Transformation, Mohr's Circle
No ratings yet
Simple Stress and Strain Relationship: Stress and Strain in Two Dimensions, Principal Stresses, Stress Transformation, Mohr's Circle
67 pages
LFR Local Flight Rules Rev 3
No ratings yet
LFR Local Flight Rules Rev 3
97 pages
Observation Checklist Task 3.1 - BSBOPS404 V1-1
No ratings yet
Observation Checklist Task 3.1 - BSBOPS404 V1-1
7 pages

UDL Errata

Uploaded by

UDL Errata

Uploaded by

Understanding Deep Learning Errata

January 28, 2024

Copyright ©2023 Simon Prince.

Copyright ©2023 Simon Prince.

• Equation 16.12 has a mistake in the second term. It should be:

Copyright ©2023 Simon Prince.

Figure 1.2 Corrected version of figure 19.11

• Appendix B.3.6. Consider a matrix A ∈ RD1 ×D2 . If the number of columns D2 of

Copyright ©2023 Simon Prince.

1.1.1 Minor fixes

• Section 1.1: ...and what is meant by “training” a model.

• Figure 1.13: Adapted from Pablok (2017).

• Figure 2.3 legend: Each combination of parameters ϕ = [ϕ0 ,ϕ1 ]T .

• Section 2.3: 1D linear regression has the obvious drawback

• Figure 4.2: Colors changed to avoid ambiguity

• Equation 4.13 is missing a prime sign:

• Equation 4.14: ϕ′0 should not be bold.

• Equation 5.10. Second line is disambiguated by adding brackets:

Copyright ©2023 Simon Prince.

• Equation 5.12. More properly written as:

• Equation 5.15. Disambiguated by adding brackets:

Copyright ©2023 Simon Prince.

Copyright ©2023 Simon Prince.

Corrected version of figure 10.8

Copyright ©2023 Simon Prince.

• Section 12.4 Title changed to Transformer layers

• Section 12.4 a larger transformer

• Section 12.4 a series of these transformer layers ...

• Figure 12.8 legend: The transformer → Transformer layer...The transformer layer

• Section 12.5.3 a series of K transformer layers

• Section 12.6 through 24 transformer layers

• Section 12.6 in the fully connected networks in the transformer is 4096

• Figure 12.10 a series of transformer layers

• Section 12.7.2 the transformer layers use masked...

• Section 12.7.4 There are 96 transformer layers

• Section 12.7 comprises a series of transformer layers

• Section 12.8 Originally, these

• Section 12.8 a series of transformer layers... a series of transformer layers

• Figure 15.3 arctan → tanh. Corrected version nearby in this document.

Copyright ©2023 Simon Prince.

Corrected version of figure 15.3

Copyright ©2023 Simon Prince.

• Equation 18.24 has a minor formatting mistake. Better written as:

• Equation 18.34 missing indices on noise term:

Copyright ©2023 Simon Prince.

Corrected version of figure 12.8

Copyright ©2023 Simon Prince.

You might also like