0% found this document useful (0 votes)

20 views

w4 Generalisation

This document discusses generalization, overfitting, bias-variance tradeoff, and regularization techniques like ridge and lasso regression. It explains that the goal of learning is to generalize well to novel examples, but overfitting can occur if the hypothesis is too complex. Regularization helps reduce overfitting by penalizing complex models, balancing bias and variance. Ridge regression minimizes error while shrinking coefficients, using an L2 penalty. Lasso similarly shrinks coefficients but uses an L1 penalty.

Uploaded by

Swastik Sindhani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

w4 Generalisation

Uploaded by

Swastik Sindhani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 42

GENERALISATION

Dr. Srikanth Allamsetty

Over-fitting, Bias and Variance
Relationship
Generalization
• Given a collection of examples (x(i) f (x(i))); i = 1, 2, …, N, of a function f (x),
return a function h(x) that approximates f (x).
• h(x) is called hypothesis function and f(x) is called true function.
• It is not easy to tell whether any particular h(.) is a good approximation of f (.).
• A good approximation will generalize well—that is, will predict novel patterns
correctly.
• Generalization performance is, thus, the fundamental problem in inductive
learning (Inductive Learning, also known as Concept Learning, is how AI
systems attempt to use a generalized rule to carry out observations).
• The off-training set error—the error on points not in the training set, will be
used as a measure of generalization performance (generally called testing error).
Generalization
• If one algorithm appears to outperform another in a specific situation, it is as a
result of its fit to the specific learning problem, not the general supremacy of
the algorithm.
• Even if the algorithms are widely used and grounded in terms of theory, they
will not perform well on certain problems.
Bias and variance
• A measure of how close the mapping function h(x; Ɗj) is to the desired one is,
therefore, given by the error function,

• The value of this quantity will depend on the dataset Ɗj on which it is trained.
• We write the average over the complete ensemble of datasets as,
Expected value;
Arithmetic mean of a large number of independent realizations

• It may be that the hypothesis function h(.) is on average, different from the
regression function f (x). This is called bias.
• It may be that the hypothesis function is very sensitive to the particular dataset
Ɗj, so that for a given x, it is larger than the required value for some datasets,
and smaller for other datasets. This is called variance.
Bias and variance

(w.r.t original) (range of the results of hypothesis fn)

• The bias measures the level to which the average (over all datasets) of the
hypothesis function is different from the desired function f (x).
• The bias term reflects the approximation error that the hypothesis h(.) is expected to
have on average when trained on datasets of same finite size.
• In general, higher the complexity of the hypothesis function (more flexible function
with large number of adjustable parameters), the lower is the approximation error.
Bias and variance

(w.r.t original) (range of the results of hypothesis fn)

• The variance is a measure of the level to which the hypothesis function h(x; Ɗj) is sensitive to
the specific selection of the dataset.
• The variance term reflects the capability of the trained model on a data sample to generalize
to other data samples.
• Low variance means that the estimate of f (x) based on a data sample does not change much
on the average as the data sample varies.
• Unfortunately, the higher the complexity of the hypothesis function (which results in low
bias/low approximation error), the higher is the variance.
Bias and variance
• To minimize the overall mean-square-error, we need a hypothesis that
results in low bias and low variance.
• This is known as bias-variance dilemma or bias-variance trade-off.
5th order polynomial:
more terms
Bias and variance
• To minimize the overall mean-square-error, we need a hypothesis that
results in low bias and low variance.
• This is known as bias-variance dilemma or bias-variance trade-off.
Bias and variance
• To minimize the overall mean-square-error, we need a hypothesis that
results in low bias and low variance.
• This is known as bias-variance dilemma or bias-variance trade-off.
Over-fitting
Over-fitting

Price

Price
Price

Size Size Size

High bias “Just right” High variance

(underfit) (overfit)
Over-fitting
• Occam’s Razor Principle: The simpler explanations are more reasonable,
and any unnecessary complexity should be shaved off.
• ‘simpler’ may imply needing lesser parameters, lesser training time, fewer
attributes for data representation, and lesser computational complexity.
• Occam’s razor principle suggests hypothesis functions that avoid overfitting
of the training data.
• A learning machine is said to overfit the training examples if certain other
learning machine that fits the training examples less well, but actually
performs better over the total distribution of patterns (i.e., including
patterns beyond the training set).
Over-fitting

/ regression model

(more number of weights in the neural network, large number of nodes in the
decision tree, more number of rules in a fuzzy logic model, etc.)
Addressing overfitting:
Options:
1. Reduce number of features.
― Manually select which features to keep.
― Model selection algorithm (later in course).
2. Regularization.
― Keep all the features, but reduce magnitude/values of
parameters or β.
― Works well when we have a lot of features, each of which
contributes a bit to predicting .
Regularization

Price
Price

Size of house Size of house

Over-fitting can be avoided if suppose we penalize and make ,

really small.
Regularization
• This is a form of regression, that
constrains/ regularizes or shrinks
the coefficient estimates towards
zero.
• In other words, this technique
discourages learning a more
complex or flexible model, so as to
avoid the risk of overfitting.
Regularization
• Y ≈ β0 + β1X1 + β2X2 + …+ βpXp
• Here Y represents the learned relation and β represents the coefficient
estimates for different variables or predictors (X).
• The fitting procedure involves a loss function, known as residual sum of
squares or RSS.
• The coefficients are chosen, such that they minimize this loss function.
Regularization

• Now, this will adjust the coefficients based on your training data.
• If there is noise in the training data (or more terms in the equation),
then the estimated coefficients won’t generalize well to the future data.
• This is where regularization comes in and shrinks or regularizes these
learned estimates towards zero.
• Regularization, significantly reduces the variance of the model, without
substantial increase in its bias.
RIDGE Regression
Basics
Basics

• Above image shows ridge regression, where the RSS is modified by adding the
shrinkage quantity.
• Now, the coefficients are estimated by minimizing this function.
• Here, λ is the tuning parameter that decides how much we want to penalize
the flexibility of our model.
• The increase in flexibility of a model is represented by increase in its
coefficients
• if we want to minimize the above function, then these coefficients need to be
small.
Basics

• This is how the Ridge regression technique prevents coefficients from rising
too high.
• Also, notice that we shrink the estimated association of each variable with
the response, except the intercept β0, This intercept is a measure of the
mean value of the response when xi1 = xi2 = …= xip = 0.
• When λ = 0, the penalty term has no eﬀect, and the estimates produced by
ridge regression will be equal to least squares.
• However, as λ→∞, the impact of the shrinkage penalty grows, and the
ridge regression coeﬃcient estimates will approach zero.
RIDGE vs Oridinary Least Squares

• Solving this for gives the the ridge regression estimates

I - Identity matrix
• You might have remembered that the optimal solution in case of
Oridinary least Squares (OLS) is
* = (XXT)–1Xy
RIDGE vs Oridinary least Squares
• The λ parameter is the regularization penalty. Notice that:

• So, setting λ to 0 is the same as using the OLS, while the larger its value,
the stronger is the coefficients' size penalized.
Constraints
• As can be seen, selecting a good value of λ is critical.
• Cross validation comes in handy for this purpose.
• The larger LAMBDA means, our prediction became less sensitive to the
independent variables.
• The coefficient estimates produced by this method are also known as the L2
norm.
• The coefficients that are produced by the standard least squares method are
scale equivariant, i.e. if we multiply each input by c then the corresponding
coefficients are scaled by a factor of 1/c.
• Therefore, regardless of how the predictor is scaled, the multiplication of
predictor and coefficient (Xjβj) remains the same.
Standardizing the predictors
• However, this is not the case with ridge regression, and therefore, we need to
standardize the predictors or bring the predictors to the same scale before
performing ridge regression.
• The formula used to do this is given below.

j=1,2,3.....p (no. of features)

i=1,2,3.....n (no. of examples)
LASSO Regression
Basics
Basics

• Lasso is a variation, in which the above function is minimized.

• Its clear that this variation differs from ridge regression only in
penalizing the high coefficients.
• It uses |βj|(modulus) instead of squares of β, as its penalty.
• In statistics, this is known as the L1 norm.
Basics

• Lasso regression stands for Least Absolute Shrinkage and Selection

Operator.
• It also adds penalty term to the cost function.
• This term is the absolute sum of the coefficients.
LASSO vs RIDGE
• Lasso regression differs from ridge regression in a way that it uses
absolute values within the penalty function, rather than that of squares.
• This leads to penalizing (or equivalently constraining the sum of the
absolute values of the estimates) values which causes some of the
parameter estimates to turn out exactly zero.
• The more penalty is applied, the more the estimates get shrunk towards
absolute zero.
• This helps to variable selection out of given range of n variables.
LASSO vs RIDGE
• The ridge regression can be thought of as solving an equation, where
summation of squares of coefficients is less than or equal to s.
• And the Lasso can be thought of as an equation where summation of
modulus of coefficients is less than or equal to s.
• Here, s is a constant that exists for each value of shrinkage factor λ.
• Consider there are 2 parameters in a given problem.
• Then according to above formulation, lasso regression is expressed by |
β1|+|β2|≤ s and the ridge regression is expressed by β1² + β2² ≤ s.
• This implies that ridge regression coefficients have the smallest RSS (loss
function) for all points that lie within the circle given by β1² + β2² ≤ s.
• Similarly, the lasso coefficients have the smallest RSS (loss function) for
all points that lie within the diamond given by |β1|+|β2|≤ s.
LASSO vs RIDGE

• This implies that ridge regression coefficients have the smallest RSS (loss
function) for all points that lie within the circle given by β1² + β2² ≤ s.
• Similarly, the lasso coefficients have the smallest RSS (loss function) for
all points that lie within the diamond given by |β1|+|β2|≤ s.
LASSO vs RIDGE
Details are in the next slides
LASSO vs RIDGE

• The above image shows the constraint functions(green areas), for

lasso(left) and ridge regression(right), along with contours for RSS (red
ellipse).
• Points on the ellipse share the value of RSS.
LASSO vs RIDGE

• For a very large value of s, the green regions will contain the center of
the ellipse, making coefficient estimates of both regression techniques,
equal to the least squares estimates.
• But, in this above case, the lasso and ridge regression coefficient
estimates are given by the ﬁrst point at which an ellipse contacts the
constraint region.
LASSO vs RIDGE

• Since ridge regression has a circular constraint with no sharp points, this
intersection will not generally occur on an axis, and so the ridge
regression coeﬃcient estimates will be exclusively non-zero.
• However, the lasso constraint has corners at each of the axes, and so the
ellipse will often intersect the constraint region at an axis.
LASSO vs RIDGE

• When this occurs, one of the coeﬃcients will equal zero.

• In higher dimensions (where parameters are much more than 2), many of
the coefficient estimates may equal zero simultaneously.
• That means, in the case of the lasso, the L1 penalty has the effect of
forcing some of the coefficient estimates to be exactly equal to zero
when the tuning parameter λ is sufficiently large.
LASSO vs RIDGE
• This sheds light on the obvious disadvantage of ridge regression, which is
model interpretability.
• It will shrink the coefficients for least important predictors, very close to
zero.
• But it will never make them exactly zero.
• In other words, the final model will include all predictors.
• Therefore, the lasso method also performs variable selection and is said
to yield sparse models.
• You may visit
https://www.pluralsight.com/guides/linear-lasso-ridge-regression-scikit-l
earn
to see a numerical example.
What does Regularization achieve?
• The tuning parameter λ, used in the regularization techniques
described above, controls the impact on bias and variance.
• As the value of λ rises, it reduces the value of coefficients and thus
reducing the variance.
• Till a point, this increase in λ is beneficial as it is only reducing the
variance (hence avoiding overfitting), without loosing any important
properties in the data.
• But after certain value, the model starts loosing important properties,
giving rise to bias in the model and thus underfitting.
• Therefore, the value of λ should be carefully selected.
Debugging a learning algorithm:
Suppose you have implemented regularized linear regression to predict
housing prices. However, when you test your hypothesis in a new set of
houses, you find that it makes unacceptably large errors in its
prediction. What should you try next?
- Get more training examples
- Try smaller sets of features
- Try getting additional features
- Try adding polynomial features
- Try decreasing
- Try increasing

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6412)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (640)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1173)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (991)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1852)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4101)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (627)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1015)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1138)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (581)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (297)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4355)
Ethics Module
81% (26)
Ethics Module
57 pages
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (278)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1087)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2032)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2876)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Lecture 4 ARDL - Bounds - Test
No ratings yet
Lecture 4 ARDL - Bounds - Test
58 pages
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Day 005 - LOGIC 2
No ratings yet
Day 005 - LOGIC 2
107 pages
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (835)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (918)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (814)
BADM 572 - Stats Homework Answers 6
No ratings yet
BADM 572 - Stats Homework Answers 6
7 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
On the Epistemology of Data Science Conceptual Tools for a New Inductivism Wolfgang Pietsch download
100% (2)
On the Epistemology of Data Science Conceptual Tools for a New Inductivism Wolfgang Pietsch download
50 pages
09 Practice Problem 1
No ratings yet
09 Practice Problem 1
2 pages
B2 305 KTQT C5 12.2023
No ratings yet
B2 305 KTQT C5 12.2023
26 pages
Chapter 10: Hypothesis Tests Involving A Sample Mean or Proportion
No ratings yet
Chapter 10: Hypothesis Tests Involving A Sample Mean or Proportion
110 pages
Kinds of Truth
0% (1)
Kinds of Truth
10 pages
The Four Theories in Research and Their Importance in The Modern World
No ratings yet
The Four Theories in Research and Their Importance in The Modern World
4 pages
Induction and Deduction
100% (1)
Induction and Deduction
2 pages
Effect Size Calculator 17
No ratings yet
Effect Size Calculator 17
5 pages
Gagne VS Bruner
100% (1)
Gagne VS Bruner
5 pages
CriticalThinking_Week3
No ratings yet
CriticalThinking_Week3
56 pages
Ujian Tengah Semester Statistika Ii Nama: Cokorda Gede Budha Hary Baskara NIM: 119211349 Program Studi: Akuntansi Sore
No ratings yet
Ujian Tengah Semester Statistika Ii Nama: Cokorda Gede Budha Hary Baskara NIM: 119211349 Program Studi: Akuntansi Sore
2 pages
Jon Kolko: Abductive Thinking and Sensemaking: The Drivers of Design Synthesis
No ratings yet
Jon Kolko: Abductive Thinking and Sensemaking: The Drivers of Design Synthesis
14 pages
(Ebook) Paradox by Doris Olin ISBN 9781902683812, 9781902683829, 1902683811, 190268382X pdf download
100% (1)
(Ebook) Paradox by Doris Olin ISBN 9781902683812, 9781902683829, 1902683811, 190268382X pdf download
61 pages
Endogeneity
No ratings yet
Endogeneity
73 pages
Course Outline - Stat
0% (1)
Course Outline - Stat
3 pages
Econometrics I: TA Session 5: Giovanna Ubida
No ratings yet
Econometrics I: TA Session 5: Giovanna Ubida
20 pages
Course Manual Business Research Methods
No ratings yet
Course Manual Business Research Methods
48 pages
1-1 Patterns and Inductive Reasoning
No ratings yet
1-1 Patterns and Inductive Reasoning
11 pages
Group Assignment 1
No ratings yet
Group Assignment 1
25 pages
Statistics Notes
No ratings yet
Statistics Notes
23 pages
(eBook PDF) The Power of Critical Thinking: Fifth 5th Canadian Edition pdf download
100% (1)
(eBook PDF) The Power of Critical Thinking: Fifth 5th Canadian Edition pdf download
53 pages
Group 1 Eapp Written Report
No ratings yet
Group 1 Eapp Written Report
3 pages
8D - Eight Disciplines of Problem Solving - Quality-One
No ratings yet
8D - Eight Disciplines of Problem Solving - Quality-One
11 pages
02 Seagram
No ratings yet
02 Seagram
12 pages
Conquer Logical Fallacies - 28 N - Thinknetic
100% (9)
Conquer Logical Fallacies - 28 N - Thinknetic
116 pages
EDU 3103 w10
No ratings yet
EDU 3103 w10
42 pages

w4 Generalisation

Uploaded by

w4 Generalisation

Uploaded by

GENERALISATION

Dr. Srikanth Allamsetty

(w.r.t original) (range of the results of hypothesis fn)

(w.r.t original) (range of the results of hypothesis fn)

Size Size Size

High bias “Just right” High variance

Size of house Size of house

Over-fitting can be avoided if suppose we penalize and make ,

• Solving this for gives the the ridge regression estimates

j=1,2,3.....p (no. of features)

• Lasso is a variation, in which the above function is minimized.

• Lasso regression stands for Least Absolute Shrinkage and Selection

• The above image shows the constraint functions(green areas), for

• When this occurs, one of the coeﬃcients will equal zero.

You might also like