Handout 41100 PolynomialRegression
Handout 41100 PolynomialRegression
Max H. Farrell
BUS 41100
August 28, 2015
In class we talked about polynomial regression and the point was made that we always keep “lower
order” terms whenever we put additional polynomials into the model. This handout explains
the intuition and interpretation reasons behind this, with examples. The bottom line is that by
assuming a certain coefficient is exactly equal to zero, you are making a strong assumption on how
Y responds to X, one that you have no business making.
Contents
1 Building Intuition with the Intercept 1
1.1 Example: House Prices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Example: Wage Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Let’s return to simple linear regression and consider leaving out the intercept. This will give us
good intuition for what will happen we run polynomial regression but exclude lower order terms.
The general model is: E[Y |X] = β0 + β1 X + ε. Remember that β0 , β1 , and σ 2 are unknown. In
particular, we don’t know if β0 = 0 or not. But suppose we force least squares to set b0 = 0. What
have we done?
If b0 = 0, the intercept of the line is at zero: that is, when X = 0, we predict Ŷ = 0. So that
means that we force our prediction to be exactly zero at the value X = 0, no matter what. This is
a geometric assumption: the graph of the line must pass through (0,0).
Remember how we usually don’t put much stock in the intercept? Now we’re giving it a strong
interpretation, and requiring a lot of knowledge about it! Do you have a very good reason to believe
that is true? Usually not. Moreover, in general there is nothing special about the point Y = 0 or
X = 0. These will change if we measure the variables differently.
Remember that our goal is to extract the general trend in how Y changes with X. We used to
interpret b1 as the change in Y as X increases. If the intercept is zero, we don’t have this anymore!
We can only say that b1 measures the change in Y as X increase assuming the intercept fixed at
zero. That is because setting b0 = 0 only gives the “right” answer for b1 if the true β0 = 0 too.
1
That is, you must assume that E[Y |X] = β1 X + ε. This is a very strong assumption! Let’s consider
some examples.
Return to the house price data from Lecture 2. We have data on house prices (in thousands of
dollars) and size (in thousands of square feet). What is our goal for this analysis? We want to find
out how price increases with size. So, I want to be able to answer questions like: On average, how
much more expensive is a 3,000 sq. ft. house than a 2,000 sq. ft. house? This is exactly how we
interpret b1 from the linear regression
pricei = b0 + b1 sizei + ei .
What if we force b0 = 0? Now we are assuming β0 = 0, i.e. that a zero square foot house costs
nothing. Is that reasonable? Let’s look at the data:
●
150
●
Price ($ in 1000s)
●
●
100
●
●
●
●
●
●
●
●
●
50
General line
Force intercept=0
0
0 1 2 3 4
2
The most important thing is that the slope estimate went up! This makes it look like square footage
is much more expensize: the “no intercept” model says that every extra 1,000 square feet costs
$53k, compared to only $35k from the other regression. Which model is better? (Notice the R2 is
higher on the right, but who cares?!? Remember R2 doesn’t mean anything.)
What happened? By forcing the intercept to be zero, we had to crank the line way up, artificially.
Note that I had to manually expand the range of the graph, so we could see both intercepts.
Here’s the main question: which one of these would you say better captures the general trend in
the response of price to size?
Now suppose I told you that all house sales are subject to a flat tax of $5,000. Then, only (price
- 5000) is under control of the buyer and seller, so we shift all the prices down by 5000. This
shouldn’t affect the slope of the line at all. But if you are still forcing the intercept to be zero, the
slope will have to change!1
Let’s return to the wage data first used at the end of Lecture 1. Our goal is to find how the
number of hours worked (hours) responds to hourly pay rate (pay.rate). What’s different about
this example? Here, it makes sense to assume that β0 = 0, because that means that if you get paid
nothing, you work zero hours. Let’s see what the data turns up:
The interpretation of the intercept on the left makes no sense: you work almost 2000 hours if you
are paid nothing! But the interpretation of the slope is fine: for each extra dollar per hour, you
work an extra 80 hours per year. How about on the right? Assuming you work 0 hours if you aren’t
paid you will work an extra 750 hours for each additional dollar per hour. So someone working for
$1/hour (in 1966 remember) will work 750 hours, someone making $2/hour will work 1500 hours,
and so on. Which is more reasonable to you? Look at the picture:
1
Since b0 = Ȳ − b1 X̄, we can immediately
solve for b1 = Ȳ /X̄. So our forecast for Ŷ at the average X, i.e. at
X̄, is Ŷ (X = X̄) = b0 + b1 X̄ = 0 + Ȳ /X̄ X̄ = Ȳ . So we fit a line that goes through the point (0, 0) and the point
(X̄, Ȳ )! If we shift all the prices down by 5,000, the line will rotate toward being flatter, because the new b1 is the
old b1 minus 5, 000/X̄.
3
● ●
● ●● ●● ●
●● ●● ●●
● ●●
● ● ●
● ●
●
● ●
● ●
2000
●
● ● ●
●● ● ●●
●
1500
Hours worked
1000
500
General line
Force intercept=0
0
0 1 2 3
Pay rate
What’s going on here? Again, the restriction of b0 = 0 is forcing the line to slope up too fast.
Why? Look how far out of the sample you are “predicting” by assuming that β0 = 0 (no pay = no
work). The data we have are for working people (everyone has positive hours and a positive wage),
so this data doesn’t tell us anything about people that don’t work or aren’t paid. And yet we are
making a very strong assumption. What about social security or disability (no work, but positive
pay)? What about working odd jobs (no formal hours or pay)?
And again, there’s nothing special about Y = 0 or X = 0. It might make sense to measure pay
rate as hourly wage above mimimum wage, or measure hours per year relative to a standard work
week. Both of these would change the “zero” point.
Intuitively, the same problem will crop up for polynomial regression, that is, a geometric problem.
For now, let’s stick to squared terms. We are considering fitting
yi = b0 + b1 xi + b2 x2i + ei
and setting b1 = 0, that is, leaving out the linear term. Just like forcing the intercept to be zero
was a restriction on the graph of a line, this will also be a geometric/graphical restriction. Recall
the equation of a parabola:
y = a(x + vx )2 + vy .
The point (x = −vx , y = vy ) is the vertex (the bottom or the top of the parabola). If a > 0 the
parabola opens upward (like the letterr “U”); if a < 0 the parabola opens downward. Multiplying
this out we get
y = vy + vx2 + 2avx x + |{z}
a x2 .
| {z } |{z}
β0 β1 β2
4
So if we force least squares to fit b1 = 0 then we are assuming the vertex of the parabola is at x = 0
(the bottom if a > 0, the top if a < 0). Suppose a > 0 (that is, β2 > 0) so the parabola opens
upward. Then the minimum response of Y to X occurs at X = 0, by construction. This is again a
very strong assumption! This is something you are forcing about the shape of how Y responds to
X: when X = 0, Y responds very little.
Again, there’s nothing special about the point X = 0. Suppose you measure the variable X, but
for some number C, suppose I measure X by X + C instead. For example, if X is wage, I measure
it as dollars/hour above minimum wage. This should not affect how Y responds to X at all, that
is, the predicted values shouldn’t change. Assuming there is no linear term, the response should be
Y = β0 + β2 (X + C)2 = β0 + 2β2 C X + β2 X 2 ,
| {z }
β1 ???
Let’s return the call center data from week 3. The goal is to predict productivity (measured by
calls per day) using work experience (months of employment). In class we had a quadratic fit:
yi = b0 + b1 months + b2 months2 + ei .
There’s really no great reason to leave out the linear term here. Conceptually, what would that
mean in this example? It means that people’s productivity gain is the slowest when they are brand
new. Is that reasonable from an intuitive level? Probably not. And we have no data at months=0,
so it does not make sense to impose that the minimum productivity is there.
We can also leave out the intercept and the linear term, setting b0 = b1 = 0. This implies that
the parabola goes through (0,0), so that employees with zero experience make zero calls. Is that
reasonable? Probably not, even on your first day you could presumably make one call.
I’m omitting the R output here, but here is the graph.2 As you can see, it has all the same issues
as before. The blue and green curves do not fit the data at all: they are doing a terrible job of
extracting the general trend of how Y changes with X.
2
It looks like the general curve goes through (0,0) too, but it does not: the intercept is b0 = −0.1404712.
5
35
●
● ● ● ●
● ●
● ● ● ●
30
● ●
●
25
●
●
●
20
calls ●
●
15
10
General quadratic
5
Force b_1=0
Force b_1=0, b_0=0
0
0 5 10 15 20 25 30
months
The story is the same for higher order polynomials, but more intricate. The graphical/geometric
interpretations of the above two cases are pretty clear. But what does it mean to leave the linear
term out of a cubic fit? To really understand it, you have to go back to the equation for a cubic
curve and figure out exactly what restriction you are imposing. I will not delve into the details.
The message should be clear: you are making a strong restriction on how Y responds to X.
In multiple linear regression we interpret each coefficient conditional on what else is in the model.
In the first section, when we interpret b1 from a linear model, the interpretation depends on what
we assume about the intercept. If we force b0 , then the slope is interpreted conditional on this
choice. In the quadratic model, forcing b1 = 0 implies a very specific mechanism for changes in Y .
In multiple linear regression, say with two variables X1 and X2 , we estimate yi = b0 + b1 x1,i +
b2 x2,i + ei . The interpretation of b2 is conditional on X1 being in the model. So b2 measures the
change in Y as X2 increase controlling for X1 , holding it fixed at any given value (this is where the
term “controlling” for comes from in the popular press).
For example, suppose X1 = education and X2 = experience and our goal is predict Y = wages.
If we run the full model, then b1 measures the return to education holding experience constant.
That is, it gives the wage difference between two people who have exactly the same number of years
on the job, but one graduated college and the other only finished high school. If we set b2 = 0
(that is, regress wages on only education), then b1 now measures just the returns to education.
So now it gives the wage difference between two people where one graduated college and the other
only finished high school, no matter how many years on the job they have.