10 Linear Regression
10 Linear Regression
CS 360 Lecture 10
1
Solving unconstrained convex
optimization
2
How to solve convex optimization problems
Assume no constraints, then the problem
arg min𝑥 𝑓(𝑥) , where 𝑓(𝑥) is a convex function
Is equivalent to
∇𝑓 𝑥 = 0
𝜕 𝜕 𝜕
∇𝑓 𝑥 = 𝑓 𝑥 , 𝑓 𝑥 ,… 𝑓 𝑥
𝑥0 𝑥1 𝑥𝑝
3
Example
Objective: 𝑓 𝑥 = 𝑥 2 + 1
𝑑𝑓(𝑥)
Solve: = 2𝑥 = 0
𝑑𝑥
𝑥=0
4
Example in 2D
Objective:𝑓 𝑥 = 𝑥12 + 𝑥22 + 𝑥1𝑥2 − 𝑥1 + 1
𝜕𝑓(𝑥)
Solve: = 2𝑥1 + 𝑥2 − 1
𝑑𝑥1
𝜕𝑓(𝑥)
= 2𝑥2 + 𝑥1
𝑑𝑥2
𝜕 𝜕
∇𝑓 𝑥 = 𝑓 𝑥 , 𝑓 𝑥 = 0,0
𝑥1 𝑥2
5
Optimization using Scipy
8
Gradient Descent
9
Gradient Descent
10
Gradient
Each arrow is the gradient evaluated at that point. The gradients point towards the direction of
steepest ascent.
11
Gradient Descent Algorithm
● Choose initial guess 𝑥0
● While not converged (𝜂 is the learning rate)
𝑥𝑡+1 = 𝑥𝑡 − 𝜂∇𝑓(𝑥𝑡 )
12
Choice of learning rate
13
Debugging gradient descent
14
Example in 2D (revisited)
Objective:𝑓 𝑥 = 𝑥12 + 𝑥22 + 𝑥1𝑥2 − 𝑥1 + 1
𝜕𝑓(𝑥)
Solve: = 2𝑥1 + 𝑥2 − 1
𝜕𝑥1
𝜕𝑓(𝑥)
= 2𝑥2 + 𝑥1
𝜕𝑥2
𝜕 𝜕
∇𝑓 𝑥 = 𝑓 𝑥 , 𝑓 𝑥
𝜕𝑥1 𝜕𝑥2
15
Example in 2D (revisited)
Algorithm:
2𝑥1 + 𝑥2 − 1
∇𝑓 𝑥 =
2𝑥2 + 𝑥1
𝑥𝑡+1 ← 𝑥𝑡 − 𝜂∇𝑓(𝑥)
Try it for 𝜂 = 0.3
16
Caveat: Perils of non-convex functions
19
Example
Find the minimum of function 𝑔 𝑢, 𝑣 = 𝑢 + 2𝑣 2 + 10 sin 𝑢 + 2𝑢2 − 𝑢𝑣 − 2 /30
20
Example
Find the minimum of function 𝑔 𝑢, 𝑣 = 𝑢 + 2𝑣 2 + 10 sin 𝑢 + 2𝑢2 − 𝑢𝑣 − 2 /30
21
Linear Regression
22
Example
Suppose that we are statistical consultants hired by a client to provide advice on how to
improve sales of a particular product.
The Advertising data set consists of the sales of that product in 200 different markets,
along with advertising budgets for the product in each of those markets for three
different media: TV, radio, and newspaper.
23
Linear Regression
We assume the model:
y= 𝑤0 + 𝑤1𝑥1 + ⋯ + 𝑤𝑑 𝑥𝑑 + 𝜀
We interpret 𝑤𝑗 as the average effect on y of a one unit increase in 𝑥𝑗 , holding all other
predictors fixed. In the advertising example, the model becomes
24
Estimate the parameters
Minimize the mean squared error/loss
𝑛 𝑛
1 2 1 2
𝑀𝑆𝐸 = 𝐿(𝑤) = 𝑒𝑖 = 𝑦𝑖 − 𝑤0 − 𝑤1 𝑥𝑖,1 … 𝑤𝑑 𝑥𝑖,𝑑
𝑛 𝑛
𝑖=1 𝑖=1
Let 𝑦ො𝑖 be the prediction using the estimated parameters:
𝑦ො𝑖 = 𝑤0 + 𝑤1 𝑥𝑖,1 + ⋯ + 𝑤𝑑 𝑥𝑖,𝑑
25
Estimate the parameters
Matrix representation
𝑛
suppose we have training data (𝑥𝑖 , 𝑦𝑖 ) 𝑖=1 where 𝑥𝑖 = 1, 𝑥𝑖,1 , … , 𝑥𝑖,𝑑
𝑦 = 𝑋𝑤
1 𝑥1,1 … 𝑥1,𝑑
𝑦1 ⎯⎯ 𝑥1 ⎯⎯
1 𝑥2,1 … 𝑥2,𝑑
𝑦= ⋮ 𝑋= ⋮ =
𝑦𝑛 ⎯⎯ 𝑥𝑛 ⎯⎯ ⋮ ⋮ ⋱ ⋮
1 𝑥𝑛,1 … 𝑥𝑛,𝑑
where 𝑤 = (𝑤0 , 𝑤1 , …, 𝑤𝑑 )𝑇. And the loss function is
𝑛
1 2
𝑀𝑆𝐸 = 𝑦𝑖 − 𝑤0 − 𝑤1 𝑥𝑖,1 … −𝑤𝑝 𝑥𝑖,𝑑
𝑛
𝑖=1
1 𝑇
= 𝑦 − 𝑋𝑤 𝑦 − 𝑋𝑤
𝑛
26
Estimate the parameters
● Differentiate MSE w.r.t 𝑤
−𝑋 𝑇 2 𝑦 − 𝑋𝑤
ෝ =0
ෝ = 𝑋 𝑇 𝑋 −1 𝑋 𝑇 𝑦
𝑤
● When 𝑋 𝑇 𝑋 is not invertible, the least square solution is NOT Unique.
27
Interpretation of the Minimizer: Projection onto the Column Space
2
min 𝑦 − 𝑋𝑤 2
𝑤
min ‖ 𝑦 − 𝑧‖22
z∈Col(X)
The minimizer z should be 𝑦’s projection onto the column space of 𝑋.
I.e., 𝑧 = 𝐻𝑦 where 𝐻 is the projection matrix for column space of 𝑋.
ෝ = 𝑋 𝑇 𝑋 −1𝑋 𝑇 𝑦
𝑤
𝑋𝑤ෝ = 𝑋 𝑋 𝑇 𝑋 −1 𝑋 𝑇 𝑦
The equation for projection matrix: 𝐻 = 𝑋 𝑋 𝑇 𝑋 −1 𝑋 𝑇
𝐻𝑦 = 𝑋 𝑋 𝑇 𝑋 −1𝑋 𝑇 𝑦 = 𝑋𝑤 ෝ
28
Advertising example
Linear regression model is easy to interpret
𝑦 = 2.939+0.046∗TV+0.189∗Radio−0.001∗Newspaper
29
Multicollinearity
● Multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a
multiple regression model can be linearly predicted from the others.
○ The coefficient estimates of the multiple regression may change erratically in response to small changes
in the model or the data.
● The parameter estimation for each individual predictor may not be interpretable, since the
variable may be redundant to others
● With the existence of perfect multicollinearity, the data matrix 𝑋 is not full-rank, and therefore
𝑋 𝑇 𝑋 cannot be inverted. The ordinary least squares estimator 𝑋 𝑇 𝑋 −1
𝑋 𝑇 𝑦 does not exist.
30
Estimate the parameters – Gradient Descent
Goal: to find 𝑤 to minimize the loss function L(𝑤)
Iterative approach:
(0) (0)
● Begin with some initial value 𝑤 , for example 𝑤 = (0, 0, . . , 0)𝑇
● Repeat until converge:
Evaluate the partial derivative of L(𝑤) at current value of 𝑤 and update 𝑤 using
𝜕𝐿 𝑤
𝑤 (𝑡+1) ← 𝑤 (𝑡) − 𝜂 ቚ
𝜕𝑤 𝑤=𝑤(𝑡)
31
Estimate the parameters – Gradient Descent
1 𝑛 1
Loss function: L(𝑤) = σ𝑖=1 𝑒𝑖2 = σ𝑛𝑖=1 𝑤0 + 𝑤1𝑥𝑖1 … +𝑤𝑑 𝑥𝑖𝑑 − 𝑦𝑖 2
𝑛 𝑛
𝜕𝐿(𝑤) 2
The partial derivative: = σ𝑛𝑖=1 𝑤0 + 𝑤1𝑥𝑖1 … +𝑤𝑑 𝑥𝑖𝑑 − 𝑦𝑖 𝑥𝑖𝑗
𝜕𝑤𝑗 𝑛
(𝑡+1) (𝑡) 𝜕𝐿 𝑤
𝑤𝑗 ← 𝑤𝑗 −𝜂 ቚ
𝜕𝑤 𝑤𝑗 =𝑤 𝑡
𝑗
33
Gradient Descent
1 𝜀𝑖 2
So we have 𝑝 𝜀𝑖 = 𝑒𝑥𝑝 −
2𝜋𝜎 2𝜎2
𝑝 𝑦𝑖 |𝑥𝑖 ; 𝑤 = 𝑃 𝑌 = 𝑦𝑖 |𝑋 = 𝑥𝑖 ; 𝑤
= 𝑃 𝜀 = 𝑦𝑖 − 𝑤 𝑇 𝑥𝑖 𝑋 = 𝑥𝑖 ; 𝑤
2
1 𝑦𝑖 −𝑤𝑇 𝑥𝑖
= 𝑒𝑥𝑝 −
2𝜋𝜎 2𝜎2
35
Probabilistic interpretation
Given 𝜀𝑖′ s are independent, we can derive the likelihood of the model
2
1 𝑦𝑖 −𝑤𝑇𝑥𝑖
𝑓 𝑤 = ς𝑁
𝑖=1 𝑃 𝑦𝑖 |𝑥𝑖 ; 𝑤 = ς𝑁
𝑖=1 2𝜋𝜎 𝑒𝑥𝑝 − 2𝜎2
37
Supervised Learning Summary
Model: yi = 𝑤0 + 𝑤1𝑥𝑖,1 + ⋯ + 𝑤𝑑 𝑥𝑖,𝑑
Objective/ SSE/MSE
Loss function:
38
Generalization
Assumption: out data is generated independently and identically distributed (i.i.d.) from
some unknown distribution 𝑃
𝑥𝑖 , 𝑦𝑖 ~𝑃(𝑋, 𝑌)
The goal is minimize the expected error (true risk) under P
𝑅 𝑤 = 𝑥 𝑃 , 𝑦 𝑦 − 𝑤 𝑇 𝑥 2 𝑑𝑥𝑑𝑦 = 𝐸𝑋,𝑌 𝑦 − 𝑤 𝑇 𝑥 2
Estimate the true risk by the empirical risk on a sample data set D
1
𝑅𝐷 𝑤 = 𝑦 − 𝑤𝑇𝑥 2
𝐷 (𝑥,𝑦)∈𝐷
39
What happens if we optimize on training data?
● Suppose we are given training data D
ෝ𝐷 = arg min 𝑅𝐷 𝑤
Parameter estimation: 𝑤
𝑤
40
What if we evaluate performance on training data
ෝ𝐷 = arg min 𝑅𝐷 𝑤 , 𝑤 ∗ = arg min 𝑅 𝑤 .
With 𝑤
𝑤 𝑤
In general it holds that 𝐸𝐷 𝑅𝐷 𝑤
ෝ𝐷 ≤ 𝐸𝐷 𝑅 𝑤
ෝ𝐷
41
Other Considerations in Regression Model
Qualitative/Categorical predictors
42
Other Considerations in Regression Model
Qualitative/Categorical predictors
● Categorical variable with more than two levels, e.g. ethnicity (consider three levels:
Asian, Caucasian, African American)
1 if the person is Asian
𝑥1 = ቊ
0 if the person is not Asian
1 if the person is Caucasian
𝑥2 = ቊ
0 if the person is not Caucasian
43
Potential Problem – Outliers 𝑦
𝑥
44
Potential Problem – High Leverage Points
𝑦
How does a high leverage point influence the
regression model?
45
𝑥
Loss function for regression
Towards mean
Punish small errors less Towards median
46
Loss function for regression
Huber Loss (Smooth Mean Absolute Error)
47
Potential Problem – Collinearity
w2 w2
w1 w1
● Consider the level set {𝑤| 𝑤 − 𝑤 ෝ 𝑇 𝑋𝑇 𝑋 𝑤 − 𝑤 ෝ = 𝐶}. It is the equation of an ellipsoid (𝑋 full rank). The
lengths of the axes scale with eigenvalues of 𝑋 𝑇 𝑋 −1.
● When two predictors are highly correlated, the contours of the MSE run along a narrow valley, there is a
broad range for the coefficient estimate.
● With 𝑥1and 𝑥2 linearly related (perfectly), 𝑋 𝑇 𝑋 has a 0 eigenvalue.
● So the level set {𝑤| 𝑤 − 𝑤ෝ 𝑇 𝑋𝑇 𝑋 𝑤 − 𝑤 ෝ = 𝐶} is no longer an ellipsoid. It’s a degenerated ellipsoid –
48
where contour lines will be pairs of lines in this case
The Level Set and the Ellipsoid
𝑤 𝑤−𝑤 ෝ 𝑇𝑋𝑇𝑋 𝑤 − 𝑤 ෝ =𝐶
It is essentially a quadratic form (𝑧 𝑇 𝑀𝑧). The matrix 𝑋 𝑇 𝑋 is P.S.D.
49
Potential Problem – Overfitting
Error
Testing MSE
Training MSE
model complexity
In linear regression, the more features 𝑋𝑗 we include in the model, the lower training MSE will be.
Adding too many features to the model may lead to overfitting.
50
More features
51
Ref:Artificial Intelligence in Corneal Diagnosis: Where Are we?
Deciding on important variables
Subset selection:
We identify a subset of the p predictors that we believe to be related to the response.
We then fit a model using least squares on the reduced set of variables.
52
Forward Stepwise Selection
● Forward stepwise selection begins with a model containing no predictors, and then
adds predictors to the model, one-at-a-time, until all of the predictors are in the
model.
● In particular, at each step the variable that gives the greatest additional
improvement to the fit is added to the model.
53
Forward Stepwise Feature Selection
● Let 𝑀0 denote the null model, which contains no predictors.
● For 𝑘 = 0, 𝑝 − 1, 𝑝 :
○ Consider all 𝑝 − 𝑘 models that augment the predictors in 𝑀𝑘 with one additional predictor.
○ Choose the best among these 𝑝 − 𝑘 models, and call it 𝑀𝑘+1. Here best is defined as having smallest
SSE.
● Select a single best model from among 𝑀1, . . . , 𝑀𝑝 using cross-validated prediction
error
54
Backward Stepwise Selection
● Like forward stepwise selection, backward stepwise selection provides an efficient
alternative to best subset selection.
● However, unlike forward stepwise selection, it begins with the full least squares
model containing all p predictors, and then iteratively removes the least useful
predictor, one-at-a-time.
55
Backward Stepwise Selection
● Let 𝑀𝑝 denote the full model, which contains all p predictors.
● For 𝑘 = 𝑝, 𝑝 − 1, … 1:
○ Consider all 𝑘 models that contain all but one of the predictors in 𝑀𝑘 , for a total of k − 1 predictors.
○ Choose the best among these 𝑘 models, and call it 𝑀𝑘−1. Here best is defined as having smallest SSE.
● Select a single best model from among 𝑀1, . . . , 𝑀𝑝 using cross-validated prediction
error
56
Potential problem: synergy between two predictors
𝑦 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2𝑥2
𝑦 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥1𝑥2
Interaction term
58
Moving Beyond Linearity
60
The world is not linear
61
Polynomial Regression
Polynomial function of different degrees
𝑦 = 𝑤0 + 𝑤1 𝑥 + 𝑤2𝑥 2 + ⋯ + 𝑤𝐷 𝑥 𝐷
= 𝑤0 + 𝑤1 𝜙1 𝑥 + 𝑤2 𝜙2 𝑥 + ⋯ + 𝑤𝐷 𝜙𝐷 𝑥
= 𝑤 𝑇 𝜙(𝑥)
62
Polynomial Regression in sklearn
63
Piecewise Polynomials
Divide the domain of 𝑋 into contiguous intervals
Represent the function by a separate polynomial in each interval.
In each interval, fit a separate polynomial regression model 𝑦 = 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 + ⋯ + 𝑤𝐷 𝑥 𝐷