UDL Errata
UDL Errata
Much gratitude to everyone who has pointed out mistakes. If you find a problem not
listed here, please contact me via github or by mailing me at [email protected].
2
1.1 Errors
These are things that might genuinely confuse you.
• Figure 4.7b had the wrong calculated numbers in it (but pattern is same). Correct
version is in figure 1.1 of this document.
• Section 7.5.1 The expectation (mean) E[fi ′ ] of the intermediate values fi ′ is:
• Equation 15.7 The optimal discriminator for an example x̃ depends on the under-
lying probabilities:
P r(x̃|real) P r(x)
P r(real|x̃) = sig f[x̃, ϕ] = = .
P r(x̃|generated) + P r(x̃|real) P r(x∗ ) + P r(x)
(1.1)
where on the right hand side, we evaluate x̃ against the generated distribution
P r(x∗ ) and the real distribution P r(x).
• Equation 15.9. First integrand should be with respect to x∗ . Correct version is:
h i
DJS P r(x∗ ) || P r(x)
P r(x∗ ) + P r(x) P r(x∗ ) + P r(x)
1 ∗ 1
= DKL P r(x ) + DKL P r(x)
2 2 2 2
∗
Z Z
1 ∗ 2P r(x ) ∗ 1 2P r(x)
= P r(x ) log dx + P r(x) log dx .
2 P r(x∗ ) + P r(x) 2 P r(x∗ ) + P r(x)
| {z } | {z }
quality coverage
• Equation 15.12.
h i X X
Dw P r(x)||q(x) = max P r(x = i)fi − q(x = j)fj , (1.2)
f
i j
Figure 1.1 Corrected version of figure 4.7: The maximum number of linear regions
for neural networks increases rapidly with the network depth. a) Network with
Di = 1 input. Each curve represents a fixed number of hidden layers K, as
we vary the number of hidden units D per layer. For a fixed parameter budget
(horizontal position), deeper networks produce more linear regions than shallower
ones. A network with K = 5 layers and D = 10 hidden units per layer has 471
parameters (highlighted point) and can produce 161,051 regions. b) Network with
Di = 10 inputs. Each subsequent point along a curve represents ten hidden units.
Here, a model with K = 5 layers and D = 50 hidden units per layer has 10,801
parameters (highlighted point) and can create more than 1040 linear regions.
• Section 15.2.4 Consider distributions P r(x = i) and q(x = j) defined over K bins.
Assume there is a cost Cij associated with moving one unit of mass from bin i in
the first distribution to bin j in the second;
• Equation 15.14. Missing bracket and we don’t need to use x∗ notation here. Correct
version is:
h i Z Z
Dw P r(x), q(x) = min π(x1 , x2 ) · ||x1 − x2 ||dx1 dx2 .
π[•,•]
• Equation 15.15. Don’t need to use x∗ notation here, and second term on right
hand side should have q[x] term not P r(x). Correct version is:
h i Z Z
Dw P r(x), q(x) = max P r(x)f[x]dx − q(x)f[x]dx .
f[x]
b−1
!
X
f[hd , ϕ] = ϕk + (hK − b)ϕb .
k=1
• Equation 17.34.
∂ ∂
EP r(x|ϕ) f[x] = EP r(x|ϕ) f[x] log P r(x|ϕ)
∂ϕ ∂ϕ
I
1X ∂
≈ f[xi ] log P r(xi |ϕ) .
I i=1 ∂ϕ
• Figure 19.11 is wrong in that only the state-action values corresponding to the
current state-action pair should be moderated. Correct version above.
• Equation B.4. Square root sign should cover x. Correct version is:
√ x x
x! ≈ 2πx .
e
x = µ + Σ1/2 z.
These are things that are wrong and need to be fixed, but that will probably not affect
your understanding (e.g., math symbols that are in bold but should not be).
• Figure 3.5 legend: The universal approximation theorem proves that, with enough
hidden units, there exists a shallow neural network that can describe any given
continuous function defined on a compact subset of RDi to arbitrary precision.
• Notes page 38 Most of these are attempts to avoid the dying ReLU problem while
limiting the gradient for negative values.
• Figure 4.1 legend: The first network maps inputs x ∈ [−1, 1] to outputs y ∈
[−1, 1] using a function comprising three linear regions that are chosen so that they
alternate the sign of their slope (fourth linear region is outside range of graph).
h = a [θ 0 + θx]
′
h = a [ψ 0 + Ψh]
y′ = ϕ′0 + ϕ′ h′ ,
y = ϕ′0 + ϕ′ h′
• Equation 4.17 is not technically wrong, but the product is unnecessary and it’s
unclear if the last term should be included in it (no). Better written as:
Di (K−1) X Di
D D
Nr = +1 · .
Di j=0
j
" I #
(yi − f[xi , ϕ])2
X 1
ϕ̂ = argmin − log √ exp −
ϕ i=1 2πσ 2 2σ 2
" I #
(yi − f[xi , ϕ])2
X 1
= argmin − log √ −
ϕ i=1 2πσ 2 2σ 2
" I #
X (yi − f[xi , ϕ])2
= argmin − −
ϕ i=1
2σ 2
" I #
X
2
= argmin (yi − f[xi , ϕ]) ,
ϕ i=1
although the value of σ 2 does not actually matter or change the position of the
maximum.
• Section 5.5 The likelihood that input x has label y = k (figure 5.10) is hence:
• Section 5.6 Removed i index from this paragraph for consistency. Independence
implies that we treat the probability P r(y|f[x, ϕ]) as a product of univariate terms
for each element yd ∈ y:
Y
P r(y|f[x, ϕ]) = P r(yd |fd [x, ϕ]),
d
where fd [x, ϕ] is the dth set of network outputs, which describe the parameters of the
distribution over yd . For example, to predict multiple continuous variables yd ∈
R, we use a normal distribution for each yd , and the network outputs fd [x, ϕ]
predict the means of these distributions. To predict multiple discrete variables yd ∈
{1, 2, . . . , K}, we use a categorical distribution for each yd . Here, each set of network
outputs fd [x, ϕ] predicts the K values that contribute to the categorical distribution
for yd .
• Problem 5.8. Construct a loss function for making multivariate predictions y∈ RDi
based on independent normal distributions. . .
• Notes page 94. However, this is strange since SGD is a special case of Adam
(when β = γ = 0)
• Section 7.3. The final derivatives from the term f0 = β0 + ω0 · xi are:
• Section 7.4. Similarly, the derivative for the weights matrix Ωk , is given by
• Section 7.5.1 and the second moment E[h2j ] will be half the variance σf2
• Figure 7.8. Not wrong, but changed to “nn.init.kaiming normal (layer in.weight)”
for compatability with text and to avoid deprecated warning.
• Problem 7.13. For the same function as in problem 7.12, compute the derivative...
• Section 8.3.3 (i.e., with four hidden units and four linear regions in the range of
the data) + minor changes in text to accommodate extra words
• Figure 8.9 number of hidden units / linear regions in range of data
• Section 8.4.1 When the number of parameters is very close to the number of training
data examples (figure 8.11b)
• Figure 9.5 legend: Effect of learning rate (LR) and batch size for 4000 training and
4000 test examples from MNIST-1D (see figure 8.1) for a neural network with two
hidden layers. a) Performance is better for large learning rates than for intermediate
or small ones. In each case, the number of iterations is 6000/LR, so each solution
has the opportunity to move the same distance.
• Figure 9.11 legend: a-c) Two sets of parameters (cyan and gray curves) sampled
from the posterior
• Page 156 Notes: Wrong marginal reference — Appendix B.3.7 Spectral Norm
• Section 10.2.1 Not wrong, but could be disambiguated: The size of the region over
which inputs are combined is termed the kernel size.
• Figure 10.3. The dilation rates are wrong by one, so should be 1,1,1, and 2 in
panels a,b,c,d, respectively.
• Section 10.2.3 The number of zeros we intersperse between the weights determines
the dilation rate.
• Section 10.2.4 With kernel size three, stride one, and dilation rate one.
• Section 10.2.7 The convolutional network has 2,050 parameters, and the fully con-
nected network has 59,065 parameters.
• Figure 10.8 Number of parameters also wrong in figure 10.8 (correct version in this
document). Recalculated curve is slightly different.
• Section 10.5.3 The first part of the network is a smaller version of VGG (fig-
ure 10.17) that contains thirteen rather than sixteen convolutional layers.
• Section 12.5 The previous section described the transformer layer... a series of
transformer layers...
• Figure 12.8 has some minor mistakes in the calculation. The corrected version is
shown at the end of this document.
• Figure 12.8 legend. At each iteration, the sub-word tokenizer looks for the most
commonly occurring adjacent pair of tokens
• Figure 12.12 are passed through a series of transformer layers... and those of tokens
earlier
• Section 13.5.1 Given I training graphs {Xi , Ai } and their labels yi , the parame-
ters Φ = {β k , Ωk }K
k=0 can be learned using SGD...
• Figure 15.3 legend: At the end is a tanh function that maps the...
• Section 15.1.3: At the final layer, the 64×64×3 signal is passed through a tanh
function to generate an image x∗
...
J i 1 X I i
1X h h
L[ϕ] = log 1 − sig[f[x∗j , ϕ]] + log sig[f[xi , ϕ]]
J j=1 I i=1
h i h i
∗
≈ Ex log 1 − sig[f[x , ϕ]] + Ex log sig[f[x, ϕ]]
∗
Z h i Z h i
= P r(x∗ ) log 1 − sig[f[x∗ , ϕ]] dx∗ + P r(x) log sig[f[x, ϕ]] dx.
• Equation 16.2 (last line). For some reason, this didn’t print properly, although it
looks fine in my original pdf. Should be:
" I #
Y
ϕ̂ = argmax P r(xi |ϕ)
ϕ i=1
" I
#
X h i
= argmin − log P r(xi |ϕ)
ϕ i=1
" I
" # #
X ∂f[zi , ϕ]
= argmin log − log P r(zi ) ,
ϕ i=1
∂zi
• Equation 16.25. ϕ should change to ϕ̂ on left hand side. Correct version is:
" I
" ##
X
ϕ̂ = argmin KL δ x − f[zi , ϕ] q(x) .
ϕ i=1
• Equation 16.26. ϕ should change to ϕ̂ on left hand side. Correct version is:
" " I ##
1X
ϕ̂ = argmin KL δ[x − xi ] P r(xi , ϕ) .
ϕ I i=1
P r(x, z1...T |ϕ1...T )
log
q(z1...T |x)
"Q #
T
P r(x|z1 , ϕ1 ) t=2 P r(zt−1 |zt , ϕt ) · q(zt−1 |x)
h i
= log + log QT + log P r(z T )
q(z1 |x) t=2 q(zt−1 |zt , x) · q(zt |x)
"Q #
T
t=2 P r(zt−1 |zt , ϕt ) P r(zT )
= log [P r(x|z1 , ϕ1 )] + log QT + log
t=2 q(zt−1 |zt , x)
q(zT |x)
T
X P r(zt−1 |zt , ϕt )
≈ log [P r(x|z1 , ϕ1 )] + log ,
t=2
q(zt−1 |zt , x)
I h
X i
− log Normxi f1 [zi1 , ϕ1 ], σ12 I
L[ϕ1...T ] = (1.4)
i=1
T 2
X 1 1 βt
+ √ z it − √ √ ϵ it − ft [zit , ϕt ] .
t=2
2σt2 1 − βt 1 − αt 1 − βt
• Section 20.2.2 Another possible explanation for the ease with which models are
trained is that some regularization methods like L2 regularization (weight decay)
make the loss surface flatter and more convex.
• Section 20.2.4 For example, Du et al. (2019a) show that residual networks converge
to zero training loss when the width of the network D (i.e., the number of hidden
units) is Ω[I 4 K 2 ] where I is the amount of training data, and K is the depth of
the network.
• Section 20.5.1 In general, the smaller the model, the larger the proportion of weights
that can...
• Section 21.7 the Conference on AI, Ethics, and Society
• Appendix A. The notation {0, 1, 2, . . .} denotes the set of non-negative integers.
• Appendix A ...big-O notation, which represents an upper bound...
• Appendix A. f[n] < c·g[n] for all n > n0
• Equation B. 18
y1 = ϕ10 + ϕ11 z1 + ϕ12 z2 + ϕ13 z3
y2 = ϕ20 + ϕ21 z1 + ϕ22 z2 + ϕ23 z3
y3 = ϕ30 + ϕ31 z1 + ϕ32 z2 + ϕ33 z3 . (1.5)
h i
DKL Norm[µ1 , Σ1 ] Norm[µ2 , Σ2 ] =
1 |Σ2 | −1 −1
log − D + tr Σ2 Σ1 + (µ2 − µ1 )Σ2 (µ2 − µ1 ) .
2 |Σ1 |