0% found this document useful (0 votes)

7 views

08_eval-intro_notes (1)

The lecture notes for STAT 479: Machine Learning by Sebastian Raschka cover model evaluation concepts such as overfitting, underfitting, bias, and variance. The notes explain how these concepts relate to a model's performance on training and test sets, emphasizing the importance of generalization to unseen data. Additionally, the notes introduce bias-variance decomposition for both squared loss and 0-1 loss, providing a framework for understanding model accuracy and error.

Uploaded by

saurabh kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

08_eval-intro_notes (1)

Uploaded by

saurabh kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

STAT 479: Machine Learning

Lecture Notes

Sebastian Raschka
Department of Statistics
University of Wisconsin–Madison

http://stat.wisc.edu/∼sraschka/teaching/stat479-fs2018/

Fall 2018

Contents
8 Model Evaluation 1: Overfitting and Underfitting 1
8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
8.2 Overfitting and Underfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
8.3 Bias and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
8.4 Bias-Variance Decomposition of the Squared Loss . . . . . . . . . . . . . . . . 6
8.5 Bias-Variance Decomposition of the 0-1 Loss . . . . . . . . . . . . . . . . . . 7
8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
STAT 479: Machine Learning
Lecture Notes

Sebastian Raschka
Department of Statistics
University of Wisconsin–Madison

http://stat.wisc.edu/∼sraschka/teaching/stat479-fs2018/

Fall 2018

8 Model Evaluation 1: Overfitting and Underfitting

8.1 Overview

• In this lecture, we discuss some of the basic terms and machine learning fundamentals
that are relevant for model evaluation, namely, bias and variance, and overfitting and
underfitting.

Figure 1: Overview of topics being covered in this lecture in the context of topics related to model
evaluation that we will cover at a later point in time.

8.2 Overfitting and Underfitting

• The overall goal in machine learning is to obtain a model/hypothesis that generalizes

well to new, unseen data.
Sebastian Raschka STAT479 FS18. L01: Intro to Machine Learning Page 2

• In other words, we want a model that generalizes well to unseen data, which we can
measure, for example, by using an independent test set – while it sounds like this
should be very straightforward, there are some pitfalls which we will discuss in the
next lecture.
• Some of the evaluation metrics we can use to measure the performance on the test set
are the prediction accuracy and misclassification error in the context of classification
models – we say that a good model has a “high generalization accuracy” or “low
generalization error” (or, simply “good generalization performance”).
• The assumptions we generally make are the following:

– i.i.d. assumption: inputs are independent, and training and test examples are
identically distributed (drawn from the same probability distribution).
– For some random model that has not been fit to the training set, we expect both
the training and test error to be equal.
– The training error or accuracy provides an (optimistically) biased estimate of the
generalization performance.

Now, overfitting and underfitting are two terms that we can use to diagnose a machine
learning model based on the training and test set performance. I.e., a model that suffers
from underfitting does perform well on the test AND training set. In contrast, a model that
overfits (e.g., from fitting the noise in the training dataset) can be usually recognized by a
high training set accuracy, but low test set accuracy. Intuitively, as a rule of thumb, the
larger the hypothesis space a model has access to, the higher the risk of overfitting.
A more technical term for the size of the hypothesis space is the so-called capacity. There
are different measures for specific models and datasets that can be used to calculate the
capacity of the model such as the VC dimension12 (however, topics from learning theory
such as VC dimension are beyond the scope of this course).

Figure 2: Illustration of overfitting and underfitting in relation to the training and test error.
1 VC dimension stands for Vapnik-Chervonenkis dimension.
2 Vladimir N Vapnik and A Ya Chervonenkis. “On the uniform convergence of relative frequencies of
events to their probabilities”. In: Measures of complexity. Springer, 2015, pp. 11–30.
Sebastian Raschka STAT479 FS18. L01: Intro to Machine Learning Page 3

8.3 Bias and Variance

Often, researchers use the terms bias and variance or “bias-variance tradeoff” to describe
the performance of a model – i.e., you may stumble upon talks, books, or articles where
people say that a model has a high variance or high bias. So, what does that mean? In
general, we might say that “high variance” is proportional to overfitting, and “high bias”
is proportional to underfitting. However, in this lecture, we are going to define these terms
"[...] model has high bias/variance" -- What does that mean?
more precisely.

Figure 3: Results from searching the terms “model has high variance” and “model has high bias”
on GoogleScholar Sebastian Raschka STAT 479: Machine Learning FS 2018 8

Note that the so-called Bias-Variance decomposition we are talking about in this lecture was
initially formulated for regression losses (i.e., mean squared error3 ); however, we are also
going to look into formulations for the 0-1 loss that we use to measure the misclassification
error (or accuracy).

• Why are we doing this? The Decomposition of the loss into bias and variance help us
understand learning algorithms, concepts are correlated to underfitting and overfitting.
• Thinking back of the ensemble lecture, the bias-variance decomposition and tradeoff
help explain why ensemble methods might perform better than single models (i.e.,
why bagging reduces the variance, and why a boosting model has a lower bias than
individual weak learners like decision tree stumps).
3 In particular, in statistics, we evaluate the goodness of an estimator in relation to the true parameter

or function
Sebastian Raschka STAT479 FS18. L01: Intro to Machine Learning Page 4

Low Variance High Variance

(Precise) (Not Precise)

(Accurate)
Low Bias
(Not Accurate)
High Bias

Figure 4: Bias-variance intuition.

To use the more formal terms for bias and variance, assume we have a point estimator θ̂ of
some parameter or function θ. Then, the bias is commonly defined as the difference between
the expected value of the estimator and the parameter that we want to estimate:

Bias = E[θ̂] − θ. (1)

If the bias is larger than zero, we also say that the estimator is positively biased, if the
bias is smaller than zero, the estimator is negatively biased, and if the bias is exactly zero,
the estimator is unbiased. Similarly, we define the variance as the difference between the
expected value of the squared estimator minus the squared expectation of the estimator:

2

2
Var(θ̂) = E θ̂ − E θ̂ . (2)

Note that in the context of this lecture, it will be more convenient to write the variance in
its alternative form:

Var(θ̂) = E[(E[θ̂] − θ̂)2 ]. (3)

Bias and Variance Intuition
Sebastian Raschka STAT479 FS18. L01: Intro to Machine Learning Page 5

High bias

Figure 5: Suppose there is an unknown target function or “true function” to which we do want
(Therewe
to approximate. No, suppose arehave
two pointstraining
different wheresetsthedrawn
biasfrom
is zero)
an unknown distribution
defined as “true function + noise.” This plot shows different linear regression models, each fit to

Bias and Variance Example

a different training set. None of these hypotheses approximate the true function well, except at
two points (around x=-10 and x=6). Here, we can say that the bias is large because the difference
between the true value and the Raschka
Sebastian predicted value, onMachine
STAT 479: average (here, average
Learning FS 2018means “expectation of 17
the training sets” not “expectation over examples in the training set”), is large

High variance

Figure 6: Suppose there is an unknown target function or “true function” to which we do want
to approximate. No, suppose What we happens if we
have different takesets
training thedrawn
average?
from an unknown distribution
defined as “true function + noise.” This plot shows different unpruned decision tree models, each
Does this remind you of something?
fit to a different training set. Note that these hypotheses fit the training data very closely. However,
if we would consider the expectation over training sets, the average hypothesis would fit the true
function perfectly (given that the noise is unbiased and has an expected value of 0). However, as we
can see, the variance is very high,
Sebastian since on STAT
Raschka average, a prediction
479: Machine Learning differs
FSa2018
lot from the expectation 21
value of the prediction.
Sebastian Raschka STAT479 FS18. L01: Intro to Machine Learning Page 6

8.4 Bias-Variance Decomposition of the Squared Loss

We can decompose a loss function such as the squared loss into three terms, a variance, bias,
and a noise term (and the same is true for the decomposition of the 0-1 loss later). However,
for simplicity, we will ignore the noise term in this lecture (some of the literature referenced
in later sections include the noise term if you are eager to learn more about variance-bias
decomposition.).
Before we introduce the bias-variance decomposition of the 0-1 loss for classification, let us
start with the decomposition of the squared loss as an easy warm-up exercise to get familiar
with the overall concept.
The previous section already listed the common formal definitions of bias and variance,
however, let us define them again for convenience:

Bias(θ̂) = E[θ̂] − θ, Var(θ̂) = E[(E[θ̂] − θ̂)2 ]. (4)

Recall that in the context of these machine learning lecture (notes), we defined

• the true or target function as y = f (x),

• the predicted target value as ŷ = fˆ(x) = h(x),

• and the squared loss as S = (y − ŷ)2 . (I use S here because it will be easier to tell it
apart from the E, which we use for the expectation in this lecture.)

Note that unless noted otherwise, the expectation is over training sets!
To get started with the squared error loss decomposition into bias and variance, let use do
some algebraic manipulation, i.e., adding and subtracting the expected value of ŷ and then
expanding the expression using the quadratic formula (a + b)2 = a2 + b2 + 2ab):

S = (y − ŷ)2
(y − ŷ)2 = (y − E[ŷ] + E[ŷ] − ŷ)2 (5)
2 2
= (y − E[ŷ]) + (E[ŷ] − y) + 2(y − E[ŷ])(E[ŷ] − ŷ).

Next, we just use the expectation on both sides, and we are already done:

E[S] = E[(y − ŷ)2 ]

E[(y − ŷ)2 ] = (y − E[ŷ])2 + E[(E[ŷ] − ŷ)2 ] (6)
2
= [Bias] + Variance

You may wonder what happened to the “2ab” term (2(y − E[ŷ])(E[ŷ] − ŷ)) when we used
the expectation. It turns that it evaluates to zero and hence vanishes from the equation,
which can be shown as follows:

E[2(y − E[ŷ])(E[ŷ] − ŷ)] = 2E[(y − E[ŷ])(E[ŷ] − ŷ)]

= 2(y − E[ŷ])E[(E[ŷ] − ŷ)]
= 2(y − E[ŷ])(E[E[ŷ]] − E[ŷ]) (7)
= 2(y − E[ŷ])(E[ŷ] − E[ŷ])
= 0.
Sebastian Raschka STAT479 FS18. L01: Intro to Machine Learning Page 7

So, this is the canonical decomposition of the squared error loss into bias and variance. The
next section will discuss some approaches that have been made to decompose the 0-1 loss
that we commonly use for classification accuracy or error.

Figure 7: A sketch of variance and bias in relation to the training error and generalization error
– how high variance related to overfitting, and how large bias relates to underfitting.

8.5 Bias-Variance Decomposition of the 0-1 Loss

Note that decomposing the 0-1 loss into bias and variance components is not as straight-
forward as for the squared error loss. To quote Pedro Domingos, a well-known machine learn-
ing researcher and professor at University of Washington: “several authors have proposed
bias-variance decompositions related to zero-one loss (Kong & Dietterich, 1995; Breiman,
1996b; Kohavi & Wolpert, 1996; Tibshirani, 1996; Friedman, 1997). However, each of these
decompositions has significant shortcomings.”4 . In fact, the paper this quote was taken from
may offer the most intuitive and general formulation at this point. However, we will first,
for simplicity, go over Kong & Dietterich formulation5 of the 0-1 loss decomposition, which
is the same as Domingos’s but excluding the noise term (for simplicity).
The table below summarizes the relevant terms we used for the squared loss in relation to
the 0-1 loss. Recall that the 0-1 loss, L, is 0 if a class label is predicted correctly, and one
otherwise. The main prediction for the squared error loss is simply the average over the
predictions E[ŷ] (the expectation is over training sets), for the 0-1 loss Kong & Dietterich
and Domingos defined it as the mode. I.e., if a model predicts the label one more than
50% of the time (considering all possible training sets), then the main prediction is 1, and
0 otherwise.

Squared Loss 0-1 Loss

2
Single loss (y − ŷ) L(y, ŷ)
Expected loss E[(y − ŷ)2 ] E[L(y, ŷ)]
Main prediction E[ŷ] mean (average) mode
Bias2 (y − E[ŷ])2 L(y, E[ŷ])
Variance E[(E[ŷ] − ŷ)2 ] E[L(ŷ, E[ŷ])]

4 Pedro Domingos. “A unified bias-variance decomposition”. In: Proceedings of 17th International Con-

ference on Machine Learning. 2000, pp. 231–238.

5 Thomas G Dietterich and Eun Bae Kong. Machine learning bias, statistical bias, and statistical variance

of decision tree algorithms. Tech. rep. Technical report, Department of Computer Science, Oregon State
University, 1995.
Sebastian Raschka STAT479 FS18. L01: Intro to Machine Learning Page 8

Hence, as result from using the mode to define the main prediction of the 0-1 loss, the bias
is 1 if the main prediction does not agree with the true label y, and 0 otherwise:
(
1 if y 6= E[ŷ],
Bias = (8)
0 otherwise.

The variance of the 0-1 loss is defined as the probability that the predicted label does not
match the main prediction:

V ariance = P (ŷ 6= E[ŷ]). (9)

Next, let us take a look at what happens to the loss if the bias is 0. Given the general
definition of the loss, loss = bias + variance, if the bias is 0, then we define the loss as the
variance:

Loss = 0 + V ariance = Loss = P (ŷ 6= y) = V ariance = P (ŷ 6= E[ŷ]). (10)

In other words, if a model has zero bias, it’s loss is entirely defined by the variance, which
is intuitive if we think of variance in the context of being proportional overfitting.
The more surprising scenario is if the bias is equal to 1. If the bias is equal to 1, as explained
by Pedro Domingos, the increasing the variance can decrease the loss, which is an interesting
observation. This can be seen by first rewriting the 0-1 loss function as

Loss = P (ŷ 6= y) = 1 − P (ŷ = y). (11)

(Note that we have not done anything new, yet.) Now, if we look at the previous equation
of the bias, if the bias is 1, we have $ y 6= E[{ŷ}]$.If y is not equal to the main prediction,
but y is also is equal to ŷ, then ŷ must be equal to the main prediction. Using the “inverse”
(“1 minus”), we can then write the loss as

Loss = P (ŷ 6= y) = 1 − P (ŷ = y) = 1 − P (ŷ 6= E[ŷ]). (12)

Since the bias is 1, the loss is hence defined as “loss = bias - variance” if the bias is 1 (or
“loss = 1 - variance”). This might be quite unintuitive at first, but the explanations Kong,
Dietterich, and Domingos offer was that if a model has a very high bias such that it main
prediction is always wrong, increasing the variance can be beneficial, since increasing the
variance would push the decision boundary, which might lead to some correct predictions
just by chance then. In other words, for scenarios with high bias, increasing the variance
can improve (decrease) the loss!

8.6 Conclusion

In this lecture, we decomposed the squared error loss into variance and bias terms and
discussed how these components relate to overfitting and underfitting. Then, we referred
to a bias-variance decomposition that Kong & Dietterich defined for the 0-1 loss. Pedro
Domingos later generalized this further, including the noise term, which we did not discuss
in this lecture. However, interested students are encouraged to read the original paper:

• Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th

International Conference on Machine Learning (pp. 231-238).6
6 https://homes.cs.washington.edu/∼pedrod/bvd.pdf
Sebastian Raschka STAT479 FS18. L01: Intro to Machine Learning Page 9

Now, we should be more familiar with the terms bias and variance (or, as statistics students,
consider this as a refresher) and how it relates to overfitting and underfitting if we say that
a model has a high variance or high bias, respectively.
In the next lecture, we will take a closer look at the holdout method for model evaluation
(and estimating the generalization performance). Also, we will discuss several methods for
constructing confidence intervals.

Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Bais and Variance
No ratings yet
Bais and Variance
4 pages
08 Eval-Intro Slides
No ratings yet
08 Eval-Intro Slides
57 pages
Bias-variance
No ratings yet
Bias-variance
8 pages
Bias and Variance.pptx
No ratings yet
Bias and Variance.pptx
21 pages
Chapter2 1 22
No ratings yet
Chapter2 1 22
9 pages
module 3 modified
No ratings yet
module 3 modified
48 pages
Unit 4
No ratings yet
Unit 4
50 pages
12 Bias-Variance_Underfit_overfit
No ratings yet
12 Bias-Variance_Underfit_overfit
4 pages
Machine Learning Math Essentials _12.02.2025
No ratings yet
Machine Learning Math Essentials _12.02.2025
88 pages
machine learning-unit 3
No ratings yet
machine learning-unit 3
18 pages
Bias, Variance, and Tradeoff
No ratings yet
Bias, Variance, and Tradeoff
8 pages
DL_Unit1 (1)
100% (1)
DL_Unit1 (1)
79 pages
Bias - Variance Trade Off
No ratings yet
Bias - Variance Trade Off
11 pages
Bias Variance Overfitting
No ratings yet
Bias Variance Overfitting
3 pages
uf,of, bias-variance tradeoff
No ratings yet
uf,of, bias-variance tradeoff
3 pages
4 - Bias-Variance Tradeoff
No ratings yet
4 - Bias-Variance Tradeoff
28 pages
Ensemble Method
No ratings yet
Ensemble Method
12 pages
UNDERFITTING_OVERFITTING
No ratings yet
UNDERFITTING_OVERFITTING
7 pages
Biasvariancetradeoff 210313075413
No ratings yet
Biasvariancetradeoff 210313075413
13 pages
Merge +1
No ratings yet
Merge +1
107 pages
Bias-Variance Tradeoff Presentation
No ratings yet
Bias-Variance Tradeoff Presentation
11 pages
Bias and Variance in Machine Learning
100% (1)
Bias and Variance in Machine Learning
7 pages
2 1 TXT Bias Variance
No ratings yet
2 1 TXT Bias Variance
4 pages
Bias Variance dichotomy
No ratings yet
Bias Variance dichotomy
11 pages
Bias and Variance
No ratings yet
Bias and Variance
4 pages
Bias and Variance in Machine Learning - Javatpoint
100% (2)
Bias and Variance in Machine Learning - Javatpoint
6 pages
40_Machine_Learning_Interview_Questions
No ratings yet
40_Machine_Learning_Interview_Questions
55 pages
Bias and Variance
No ratings yet
Bias and Variance
7 pages
MACHINE LEARNING NOTES ANNA UNIVERSITY
No ratings yet
MACHINE LEARNING NOTES ANNA UNIVERSITY
9 pages
Lec 3
No ratings yet
Lec 3
13 pages
machine learning notes
No ratings yet
machine learning notes
3 pages
[Technical] Machine Learning U3-6 [2019 Pattern]
No ratings yet
[Technical] Machine Learning U3-6 [2019 Pattern]
101 pages
Chp1 Bias Variance Tradeoff
No ratings yet
Chp1 Bias Variance Tradeoff
7 pages
Csa202 Unit 2
No ratings yet
Csa202 Unit 2
36 pages
Jkkklphftbbhuii
No ratings yet
Jkkklphftbbhuii
17 pages
ML Bias and Variance
No ratings yet
ML Bias and Variance
14 pages
Underfitting & Overfitting
No ratings yet
Underfitting & Overfitting
13 pages
BiasVarianceTradeOff
No ratings yet
BiasVarianceTradeOff
4 pages
Lec8 (1)
No ratings yet
Lec8 (1)
19 pages
Bias and Variance
No ratings yet
Bias and Variance
6 pages
Unit 2
No ratings yet
Unit 2
97 pages
Bias–variance_tradeoff
No ratings yet
Bias–variance_tradeoff
7 pages
emsemble methods-pages-deleted
No ratings yet
emsemble methods-pages-deleted
2 pages
Bias vs. Variance
No ratings yet
Bias vs. Variance
8 pages
Bias_and_Variance
No ratings yet
Bias_and_Variance
4 pages
Theory in Machine Learning
No ratings yet
Theory in Machine Learning
60 pages
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
100% (2)
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
26 pages
ML Lec-7
No ratings yet
ML Lec-7
12 pages
ML Decode
No ratings yet
ML Decode
130 pages
Over Fitting
No ratings yet
Over Fitting
19 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
lec24
No ratings yet
lec24
8 pages
Model Evaluation
No ratings yet
Model Evaluation
29 pages
ML MU Unit 2
100% (2)
ML MU Unit 2
42 pages
Bias and Variance
No ratings yet
Bias and Variance
36 pages
Machine_Learning_Yearning
No ratings yet
Machine_Learning_Yearning
40 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
From Everand
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
Peter Bradley
No ratings yet
StackGAN and AttnGAN
No ratings yet
StackGAN and AttnGAN
78 pages
SVM Dual Problem notes
No ratings yet
SVM Dual Problem notes
10 pages
very good for transformer
No ratings yet
very good for transformer
34 pages
Python notes
No ratings yet
Python notes
76 pages
Interview Question for Data science
No ratings yet
Interview Question for Data science
33 pages
01 Internet Communication
No ratings yet
01 Internet Communication
4 pages
Group 3 Operations Management: Dinapama
No ratings yet
Group 3 Operations Management: Dinapama
13 pages
3.2 Interoperability Testing v1.03 w USB Type-C
No ratings yet
3.2 Interoperability Testing v1.03 w USB Type-C
145 pages
LCR Practical Questions
No ratings yet
LCR Practical Questions
29 pages
Theory of Automata - CS402 Spring 2004 Assignment 01 Solution
No ratings yet
Theory of Automata - CS402 Spring 2004 Assignment 01 Solution
3 pages
YouTube Monetization Hack
No ratings yet
YouTube Monetization Hack
18 pages
Review Lynn Hunt-Family Romance of The French Revolution
No ratings yet
Review Lynn Hunt-Family Romance of The French Revolution
6 pages
#14785 Nandos
No ratings yet
#14785 Nandos
7 pages
Brosur Belec-BCPHLC-1604-GB-CompHLC-A4LR
No ratings yet
Brosur Belec-BCPHLC-1604-GB-CompHLC-A4LR
4 pages
iBTW20 Online Manual
No ratings yet
iBTW20 Online Manual
1 page
EMWI-F.5011.01.Battery 125V Test Report Form1 - 1
No ratings yet
EMWI-F.5011.01.Battery 125V Test Report Form1 - 1
4 pages
MiniPack Sealmatic - 56T-79T
No ratings yet
MiniPack Sealmatic - 56T-79T
78 pages
Advanced SAS Questions
No ratings yet
Advanced SAS Questions
3 pages
01__361-Emerald_Pratt-3611866_Link
No ratings yet
01__361-Emerald_Pratt-3611866_Link
12 pages
Mohsin Affendi - SRE
No ratings yet
Mohsin Affendi - SRE
8 pages
Vicat Test: Report1
No ratings yet
Vicat Test: Report1
5 pages
FPGA Architecture
No ratings yet
FPGA Architecture
20 pages
Excavator PC4000-11 Spec Sheet
No ratings yet
Excavator PC4000-11 Spec Sheet
4 pages
Nano - SOI MG FET
No ratings yet
Nano - SOI MG FET
28 pages
SWR
No ratings yet
SWR
15 pages
LERIS Professional Regulation Commission
No ratings yet
LERIS Professional Regulation Commission
1 page
Special and Leadership Awards Form
No ratings yet
Special and Leadership Awards Form
3 pages
Best Grocery Online Shopping in Zimbabwe - Myzimstore
No ratings yet
Best Grocery Online Shopping in Zimbabwe - Myzimstore
16 pages
EFMS 5v00 Quick Start Guide (EN)
No ratings yet
EFMS 5v00 Quick Start Guide (EN)
41 pages
Pipeline Defect Assessment Manual - Short Review
0% (1)
Pipeline Defect Assessment Manual - Short Review
7 pages
Engine Monitoring System
No ratings yet
Engine Monitoring System
10 pages
hotellist_20210811
No ratings yet
hotellist_20210811
4 pages
Chapter 2
No ratings yet
Chapter 2
8 pages
Sample Statement of The Problem in Thesis Writing
100% (3)
Sample Statement of The Problem in Thesis Writing
7 pages
IIIT Nagpur Digital Signal Processing Exam Paper During Covid-19 2020 Pandemic.
No ratings yet
IIIT Nagpur Digital Signal Processing Exam Paper During Covid-19 2020 Pandemic.
1 page

08_eval-intro_notes (1)

Uploaded by

08_eval-intro_notes (1)

Uploaded by

STAT 479: Machine Learning

8 Model Evaluation 1: Overfitting and Underfitting

8.2 Overfitting and Underfitting

• The overall goal in machine learning is to obtain a model/hypothesis that generalizes

8.3 Bias and Variance

Low Variance High Variance

Figure 4: Bias-variance intuition.

Bias = E[θ̂] − θ. (1)

Var(θ̂) = E[(E[θ̂] − θ̂)2 ]. (3)

Bias and Variance Example

8.4 Bias-Variance Decomposition of the Squared Loss

Bias(θ̂) = E[θ̂] − θ, Var(θ̂) = E[(E[θ̂] − θ̂)2 ]. (4)

• the true or target function as y = f (x),

• the predicted target value as ŷ = fˆ(x) = h(x),

E[S] = E[(y − ŷ)2 ]

E[2(y − E[ŷ])(E[ŷ] − ŷ)] = 2E[(y − E[ŷ])(E[ŷ] − ŷ)]

8.5 Bias-Variance Decomposition of the 0-1 Loss

Squared Loss 0-1 Loss

ference on Machine Learning. 2000, pp. 231–238.

V ariance = P (ŷ 6= E[ŷ]). (9)

Loss = 0 + V ariance = Loss = P (ŷ 6= y) = V ariance = P (ŷ 6= E[ŷ]). (10)

Loss = P (ŷ 6= y) = 1 − P (ŷ = y). (11)

Loss = P (ŷ 6= y) = 1 − P (ŷ = y) = 1 − P (ŷ 6= E[ŷ]). (12)

• Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th

You might also like