L11 - Regularization
L11 - Regularization
Khoat Than
School of Information and Communication Technology
Hanoi University of Science and Technology
2022
Contents 2
¨ However,
Bishop, Figure 1.5 those functions might generalize badly.
1 f(x)
Training
Test
ERMS
Error
0.5
0
0 3 6 9
M
Complexity
x
The Bias-Variance Decomposition 4
¡ The more complex the model 𝑓# 𝑥; 𝑫 is, the more data 7.3 Th
points it can capture, and the lower the bias can be.
v However, higher complexity will make the model "move" more to
8 2. capture
Overview of the dataLearning
Supervised points, and hence its variance will be larger.
k−NN − Regression
High Bias Low Bias
0.4
Low Variance High Variance
Prediction Error
0.3
Expected
prediction
error
0.2
Test Sample
Bias
0.1
Training Sample
Variance
0.0
50 40 30 20 10 0
Model Complexity
Regularization: introduction 7
¨ L1-norm: w 1 =∑w
i=1
n
2
¨ L2-norm: w 2 = ∑ wi2
i=1
p p p
¨ Lp-norm: w p= w1 +... + wn
Regularization in Ridge regression 12
𝜆 "#|% |
𝑝 𝑤! 𝜆) = 𝑒 #
2
¨ The larger λ, the more possibility that wi = 0.
Regularization in SVM 14
𝑤 ∗ = arg max log Pr(𝑫|𝑤) + log Pr(𝑤) = arg max log Pr(𝑤|𝑫)
0∈𝑾 0∈𝑾
¡ Vector w* = (w0, s1, s2, s3, s4, s5, s6, Age, Sex, BMI, BP)
changes when λ changes in Ridge regression.
¨ w* goes to 0 as λ increases.
λ
Regularization: practical effectiveness 22
¨ Why??
Bias-Variance tradeoff: revisit 23
Prediction Error
v Lower bias, higher variance Test Sample
¡ Modern phenomenon:
Very rich models such as neural networks
Training Sample
v
are trained to exactly fit the data, but Model Complexity
Low High
often obtain high accuracy on test dataFIGURE 2.11. Test and training error as a function of model complexi
[Belkin et al., 2019; Zhang et al., 2021]
be close to f (x0 ). As k grows, the neighbors are further away, and
v 𝐵𝑖𝑎𝑠 ≅ 0 B anything can happen.
The variance term is simply the variance of an average here, and
GPT-3, ResNets,
creases as the inverse of k. So as k varies, there is a bias–variance trad
v More generally, as the model complexity of our procedure is increased
VGG, StyleGAN,
Risk (Error)
variance tends to increase and the squared bias tends to decrease. The
posite behavior occurs as the model complexity is decreased. For k-nea
DALLE-3, … neighbors, the model complexity is controlled by k.
Typically we would like to choose our model complexity to trade
off with variance in such a way as to minimize !the test error. An obv
¡ Why??? 1 2
estimate of test error is the training error N i (yi − ŷi ) . Unfortuna
training error is not a good estimate of test error, as it does not prop
account for model complexity.
Figure 2.11 shows the typical behavior of the test and training erro
Modeliscomplexity
model complexity varied. The training error tends to decrease when
Regularization: summary 24
¡ Advantages:
¨ Avoid overfitting.
¡ Limitations:
¨ Consume time to select a good regularization constant.
¡ Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine-
learning practice and the classical bias–variance trade-off. Proceedings of the
National Academy of Sciences, 116(32), 15849-15854.
¡ Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network
training by reducing internal covariate shift. In International Conference on
Machine Learning (pp. 448-456).
¡ Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with
deep convolutional neural networks. Advances in Neural Information Processing
Systems, 25, 1097-1105.
¡ Tibshirani, R (1996). Regression shrinkage and selection via the Lasso. Journal of the
Royal Statistical Society, vol. 58(1), pp. 267-288.
¡ Trevor Hastie, Robert Tibshirani, Jerome Friedman. The Elements of Statistical
Learning. Springer, 2009.
¡ Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2021). Understanding
deep learning (still) requires rethinking generalization. Communications of the
ACM, 64(3), 107-115.