0% found this document useful (0 votes)
2 views

10 Linear Regression

The document covers linear regression and optimization techniques, including gradient descent for parameter estimation. It explains the mathematical foundations of convex optimization, the formulation of linear regression models, and the impact of multicollinearity on regression analysis. Additionally, it discusses the iterative approach to minimizing loss functions and compares gradient descent with stochastic gradient descent.

Uploaded by

hz3686
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

10 Linear Regression

The document covers linear regression and optimization techniques, including gradient descent for parameter estimation. It explains the mathematical foundations of convex optimization, the formulation of linear regression models, and the impact of multicollinearity on regression analysis. Additionally, it discusses the iterative approach to minimizing loss functions and compares gradient descent with stochastic gradient descent.

Uploaded by

hz3686
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Linear Regression

CS 360 Lecture 10

1
Solving unconstrained convex
optimization

2
How to solve convex optimization problems
Assume no constraints, then the problem
arg min𝑥 𝑓(𝑥) , where 𝑓(𝑥) is a convex function

Is equivalent to
∇𝑓 𝑥 = 0

𝜕 𝜕 𝜕
∇𝑓 𝑥 = 𝑓 𝑥 , 𝑓 𝑥 ,… 𝑓 𝑥
𝑥0 𝑥1 𝑥𝑝

3
Example
Objective: 𝑓 𝑥 = 𝑥 2 + 1
𝑑𝑓(𝑥)
Solve: = 2𝑥 = 0
𝑑𝑥

𝑥=0

4
Example in 2D
Objective:𝑓 𝑥 = 𝑥12 + 𝑥22 + 𝑥1𝑥2 − 𝑥1 + 1

𝜕𝑓(𝑥)
Solve: = 2𝑥1 + 𝑥2 − 1
𝑑𝑥1

𝜕𝑓(𝑥)
= 2𝑥2 + 𝑥1
𝑑𝑥2

𝜕 𝜕
∇𝑓 𝑥 = 𝑓 𝑥 , 𝑓 𝑥 = 0,0
𝑥1 𝑥2

5
Optimization using Scipy

8
Gradient Descent

9
Gradient Descent

10
Gradient

Each arrow is the gradient evaluated at that point. The gradients point towards the direction of
steepest ascent.
11
Gradient Descent Algorithm
● Choose initial guess 𝑥0
● While not converged (𝜂 is the learning rate)
𝑥𝑡+1 = 𝑥𝑡 − 𝜂∇𝑓(𝑥𝑡 )

12
Choice of learning rate

13
Debugging gradient descent

14
Example in 2D (revisited)
Objective:𝑓 𝑥 = 𝑥12 + 𝑥22 + 𝑥1𝑥2 − 𝑥1 + 1

𝜕𝑓(𝑥)
Solve: = 2𝑥1 + 𝑥2 − 1
𝜕𝑥1

𝜕𝑓(𝑥)
= 2𝑥2 + 𝑥1
𝜕𝑥2

𝜕 𝜕
∇𝑓 𝑥 = 𝑓 𝑥 , 𝑓 𝑥
𝜕𝑥1 𝜕𝑥2

15
Example in 2D (revisited)
Algorithm:

2𝑥1 + 𝑥2 − 1
∇𝑓 𝑥 =
2𝑥2 + 𝑥1

𝑥𝑡+1 ← 𝑥𝑡 − 𝜂∇𝑓(𝑥)
Try it for 𝜂 = 0.3

16
Caveat: Perils of non-convex functions

Non convex functions can lead to local minima or singularities


17
Caveat: local minima

Non convex functions can lead to local minima


18
Caveat: singularities

19
Example
Find the minimum of function 𝑔 𝑢, 𝑣 = 𝑢 + 2𝑣 2 + 10 sin 𝑢 + 2𝑢2 − 𝑢𝑣 − 2 /30

20
Example
Find the minimum of function 𝑔 𝑢, 𝑣 = 𝑢 + 2𝑣 2 + 10 sin 𝑢 + 2𝑢2 − 𝑢𝑣 − 2 /30

21
Linear Regression

22
Example
Suppose that we are statistical consultants hired by a client to provide advice on how to
improve sales of a particular product.
The Advertising data set consists of the sales of that product in 200 different markets,
along with advertising budgets for the product in each of those markets for three
different media: TV, radio, and newspaper.

23
Linear Regression
We assume the model:
y= 𝑤0 + 𝑤1𝑥1 + ⋯ + 𝑤𝑑 𝑥𝑑 + 𝜀

We interpret 𝑤𝑗 as the average effect on y of a one unit increase in 𝑥𝑗 , holding all other
predictors fixed. In the advertising example, the model becomes

Sales = 𝑤0 + 𝑤1𝑇𝑉 + 𝑤2 𝑅𝑎𝑑𝑖𝑜 + 𝑤3 𝑁𝑒𝑤𝑠𝑝𝑎𝑝𝑒𝑟 + noise

𝜀 is the noise term, typically i.i.d. for every sample.

24
Estimate the parameters
Minimize the mean squared error/loss
𝑛 𝑛
1 2 1 2
𝑀𝑆𝐸 = 𝐿(𝑤) = ෍ 𝑒𝑖 = ෍ 𝑦𝑖 − 𝑤0 − 𝑤1 𝑥𝑖,1 … 𝑤𝑑 𝑥𝑖,𝑑
𝑛 𝑛
𝑖=1 𝑖=1
Let 𝑦ො𝑖 be the prediction using the estimated parameters:
𝑦ො𝑖 = 𝑤0 + 𝑤1 𝑥𝑖,1 + ⋯ + 𝑤𝑑 𝑥𝑖,𝑑

𝑒𝑖 is the error/residue on sample 𝑖, i.e.


𝑒𝑖 = 𝑦𝑖 − 𝑦ො𝑖

25
Estimate the parameters
Matrix representation
𝑛
suppose we have training data (𝑥𝑖 , 𝑦𝑖 ) 𝑖=1 where 𝑥𝑖 = 1, 𝑥𝑖,1 , … , 𝑥𝑖,𝑑
𝑦 = 𝑋𝑤
1 𝑥1,1 … 𝑥1,𝑑
𝑦1 ⎯⎯ 𝑥1 ⎯⎯
1 𝑥2,1 … 𝑥2,𝑑
𝑦= ⋮ 𝑋= ⋮ =
𝑦𝑛 ⎯⎯ 𝑥𝑛 ⎯⎯ ⋮ ⋮ ⋱ ⋮
1 𝑥𝑛,1 … 𝑥𝑛,𝑑
where 𝑤 = (𝑤0 , 𝑤1 , …, 𝑤𝑑 )𝑇. And the loss function is
𝑛
1 2
𝑀𝑆𝐸 = ෍ 𝑦𝑖 − 𝑤0 − 𝑤1 𝑥𝑖,1 … −𝑤𝑝 𝑥𝑖,𝑑
𝑛
𝑖=1
1 𝑇
= 𝑦 − 𝑋𝑤 𝑦 − 𝑋𝑤
𝑛
26
Estimate the parameters
● Differentiate MSE w.r.t 𝑤

−𝑋 𝑇 2 𝑦 − 𝑋𝑤
ෝ =0

● If 𝑋 𝑇 𝑋 is nonsingular, then the unique solution is given by

ෝ = 𝑋 𝑇 𝑋 −1 𝑋 𝑇 𝑦
𝑤
● When 𝑋 𝑇 𝑋 is not invertible, the least square solution is NOT Unique.

27
Interpretation of the Minimizer: Projection onto the Column Space
2
min 𝑦 − 𝑋𝑤 2
𝑤
min ‖ 𝑦 − 𝑧‖22
z∈Col(X)
The minimizer z should be 𝑦’s projection onto the column space of 𝑋.
I.e., 𝑧 = 𝐻𝑦 where 𝐻 is the projection matrix for column space of 𝑋.
ෝ = 𝑋 𝑇 𝑋 −1𝑋 𝑇 𝑦
𝑤
𝑋𝑤ෝ = 𝑋 𝑋 𝑇 𝑋 −1 𝑋 𝑇 𝑦
The equation for projection matrix: 𝐻 = 𝑋 𝑋 𝑇 𝑋 −1 𝑋 𝑇
𝐻𝑦 = 𝑋 𝑋 𝑇 𝑋 −1𝑋 𝑇 𝑦 = 𝑋𝑤 ෝ

28
Advertising example
Linear regression model is easy to interpret
𝑦 = 2.939+0.046∗TV+0.189∗Radio−0.001∗Newspaper

29
Multicollinearity
● Multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a
multiple regression model can be linearly predicted from the others.
○ The coefficient estimates of the multiple regression may change erratically in response to small changes
in the model or the data.

● The parameter estimation for each individual predictor may not be interpretable, since the
variable may be redundant to others

● With the existence of perfect multicollinearity, the data matrix 𝑋 is not full-rank, and therefore
𝑋 𝑇 𝑋 cannot be inverted. The ordinary least squares estimator 𝑋 𝑇 𝑋 −1
𝑋 𝑇 𝑦 does not exist.

30
Estimate the parameters – Gradient Descent
Goal: to find 𝑤 to minimize the loss function L(𝑤)

Iterative approach:
(0) (0)
● Begin with some initial value 𝑤 , for example 𝑤 = (0, 0, . . , 0)𝑇
● Repeat until converge:

Evaluate the partial derivative of L(𝑤) at current value of 𝑤 and update 𝑤 using
𝜕𝐿 𝑤
𝑤 (𝑡+1) ← 𝑤 (𝑡) − 𝜂 ቚ
𝜕𝑤 𝑤=𝑤(𝑡)

31
Estimate the parameters – Gradient Descent
1 𝑛 1
Loss function: L(𝑤) = σ𝑖=1 𝑒𝑖2 = σ𝑛𝑖=1 𝑤0 + 𝑤1𝑥𝑖1 … +𝑤𝑑 𝑥𝑖𝑑 − 𝑦𝑖 2
𝑛 𝑛

𝜕𝐿(𝑤) 2
The partial derivative: = σ𝑛𝑖=1 𝑤0 + 𝑤1𝑥𝑖1 … +𝑤𝑑 𝑥𝑖𝑑 − 𝑦𝑖 𝑥𝑖𝑗
𝜕𝑤𝑗 𝑛
(𝑡+1) (𝑡) 𝜕𝐿 𝑤
𝑤𝑗 ← 𝑤𝑗 −𝜂 ቚ
𝜕𝑤 𝑤𝑗 =𝑤 𝑡
𝑗

Or we can write it in using vectors


𝜕𝐿 𝑤
𝑤 (𝑡+1) ← 𝑤 (𝑡) − 𝜂 ቚ
𝜕𝑤 𝑤=𝑤(𝑡)
𝜕𝐿 𝑤 2 𝑛
= σ𝑖=1(𝑤 𝑇 𝑥𝑖 − 𝑦𝑖 )𝑥𝑖
𝜕𝑤 𝑛
32
Gradient Descent and Stochastic Gradient Descent
Gradient Descent:
Repeat until convergence
1 𝑛
𝑤 (𝑡+1) ← 𝑤 (𝑡) − 𝜂 ෍ (𝑤 𝑇 𝑥𝑖 − 𝑦𝑖 )𝑥𝑖
𝑛 𝑖=1

Stochastic Gradient Descent:


Repeat until convergence
𝑤 (𝑡+1) ← 𝑤 (𝑡) − 𝜂(𝑤 𝑇 𝑥𝑖 − 𝑦𝑖 )𝑥𝑖

33
Gradient Descent

SGD requires more iterations, but every iteration is much cheaper.


SGD typically has a lower overall cost. Minibatch-SGD may achieve a better trade-off.
34
Probabilistic interpretation
Assumptions: 𝑦𝑖 = 𝑤 𝑇 𝑥𝑖 + 𝜀𝑖
𝜀𝑖 iid ~𝑁 0, 𝜎 2 (identically and independently distributed)

1 𝜀𝑖 2
So we have 𝑝 𝜀𝑖 = 𝑒𝑥𝑝 −
2𝜋𝜎 2𝜎2
𝑝 𝑦𝑖 |𝑥𝑖 ; 𝑤 = 𝑃 𝑌 = 𝑦𝑖 |𝑋 = 𝑥𝑖 ; 𝑤
= 𝑃 𝜀 = 𝑦𝑖 − 𝑤 𝑇 𝑥𝑖 𝑋 = 𝑥𝑖 ; 𝑤
2
1 𝑦𝑖 −𝑤𝑇 𝑥𝑖
= 𝑒𝑥𝑝 −
2𝜋𝜎 2𝜎2

35
Probabilistic interpretation
Given 𝜀𝑖′ s are independent, we can derive the likelihood of the model
2
1 𝑦𝑖 −𝑤𝑇𝑥𝑖
𝑓 𝑤 = ς𝑁
𝑖=1 𝑃 𝑦𝑖 |𝑥𝑖 ; 𝑤 = ς𝑁
𝑖=1 2𝜋𝜎 𝑒𝑥𝑝 − 2𝜎2

The log likelihood is


𝑙 𝑤 = log𝑓 𝑤
2
𝑦𝑖 −𝑤𝑇𝑥𝑖
= σ𝑁
𝑖=1 − log 2𝜋𝜎 − 2𝜎2
1
= −𝑁𝑙𝑜𝑔 2𝜋𝜎 − 2𝜎2 σ𝑁 𝑇
𝑖=1 𝑦𝑖 − 𝑤 𝑥𝑖
2

Minimizing MSE is equivalent to maximizing likelihood


36
Model evaluation

37
Supervised Learning Summary
Model: yi = 𝑤0 + 𝑤1𝑥𝑖,1 + ⋯ + 𝑤𝑑 𝑥𝑖,𝑑

Objective/ SSE/MSE
Loss function:

Method: Exact solution, Gradient Descent

Evaluation Empirical risk = Mean squared error


metric:

38
Generalization
Assumption: out data is generated independently and identically distributed (i.i.d.) from
some unknown distribution 𝑃
𝑥𝑖 , 𝑦𝑖 ~𝑃(𝑋, 𝑌)
The goal is minimize the expected error (true risk) under P
𝑅 𝑤 = ‫𝑥 𝑃 ׬‬, 𝑦 𝑦 − 𝑤 𝑇 𝑥 2 𝑑𝑥𝑑𝑦 = 𝐸𝑋,𝑌 𝑦 − 𝑤 𝑇 𝑥 2

Estimate the true risk by the empirical risk on a sample data set D
1
𝑅෠𝐷 𝑤 = ෍ 𝑦 − 𝑤𝑇𝑥 2
𝐷 (𝑥,𝑦)∈𝐷

39
What happens if we optimize on training data?
● Suppose we are given training data D
ෝ𝐷 = arg min 𝑅෠𝐷 𝑤
Parameter estimation: 𝑤
𝑤

● Ideally, we want to solve: 𝑤 ∗ = arg min 𝑅 𝑤


𝑤

40
What if we evaluate performance on training data
ෝ𝐷 = arg min 𝑅෠𝐷 𝑤 , 𝑤 ∗ = arg min 𝑅 𝑤 .
With 𝑤
𝑤 𝑤
In general it holds that 𝐸𝐷 𝑅෠𝐷 𝑤
ෝ𝐷 ≤ 𝐸𝐷 𝑅 𝑤
ෝ𝐷

41
Other Considerations in Regression Model
Qualitative/Categorical predictors

● Predictors with two levels, e.g. gender


1 𝑖𝑓 𝑡ℎ𝑒 𝑝𝑒𝑟𝑠𝑜𝑛 𝑖𝑠 𝑓𝑒𝑚𝑎𝑙𝑒
𝑥=ቊ
0 𝑖𝑓 𝑡ℎ𝑒 𝑝𝑒𝑟𝑠𝑜𝑛 𝑖𝑠 𝑚𝑎𝑙𝑒

42
Other Considerations in Regression Model
Qualitative/Categorical predictors
● Categorical variable with more than two levels, e.g. ethnicity (consider three levels:
Asian, Caucasian, African American)
1 if the person is Asian
𝑥1 = ቊ
0 if the person is not Asian
1 if the person is Caucasian
𝑥2 = ቊ
0 if the person is not Caucasian

w0 + w1 + ε(i) if the person is Asian


(𝑖) (𝑖)
𝑦 (𝑖) = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝜀 (𝑖) = ൞ w0 + w2 + ε(i) if the person is Caucasian
w0 + ε(i) if the person is African American

43
Potential Problem – Outliers 𝑦

How does outlier influence the


regression model?

How do we identify outliers?


Residual plot, Data distribution
How do we deal with outliers?
Delete the outlier
More robust models 𝑦 − 𝑦ො

𝑥
44
Potential Problem – High Leverage Points
𝑦
How does a high leverage point influence the
regression model?

How do we identify a high leverage point?


Has an unusual value for predictor 𝑥
How do we deal with high leverage points?
Delete the high leverage point
More robust models
𝑦 − 𝑦ො

45
𝑥
Loss function for regression
Towards mean
Punish small errors less Towards median

46
Loss function for regression
Huber Loss (Smooth Mean Absolute Error)

47
Potential Problem – Collinearity

w2 w2

w1 w1
● Consider the level set {𝑤| 𝑤 − 𝑤 ෝ 𝑇 𝑋𝑇 𝑋 𝑤 − 𝑤 ෝ = 𝐶}. It is the equation of an ellipsoid (𝑋 full rank). The
lengths of the axes scale with eigenvalues of 𝑋 𝑇 𝑋 −1.
● When two predictors are highly correlated, the contours of the MSE run along a narrow valley, there is a
broad range for the coefficient estimate.
● With 𝑥1and 𝑥2 linearly related (perfectly), 𝑋 𝑇 𝑋 has a 0 eigenvalue. 􏰴 􏰴
● So the level set {𝑤| 𝑤 − 𝑤ෝ 𝑇 𝑋𝑇 𝑋 𝑤 − 𝑤 ෝ = 𝐶} is no longer an ellipsoid. It’s a degenerated ellipsoid –
48
where contour lines will be pairs of lines in this case
The Level Set and the Ellipsoid
𝑤 𝑤−𝑤 ෝ 𝑇𝑋𝑇𝑋 𝑤 − 𝑤 ෝ =𝐶
It is essentially a quadratic form (𝑧 𝑇 𝑀𝑧). The matrix 𝑋 𝑇 𝑋 is P.S.D.

● Let’s consider the 2-d case for simplicity:


𝑥2 𝑦2
2-d ellipse equation: + 2 =1
𝑎2 𝑏
Spectrum decomposition of 𝑋 𝑇 𝑋 = 𝑄Λ𝑄𝑇
Consider (𝑥, 𝑦) = 𝑄𝑇 𝑤 − 𝑤

𝑤−𝑤 ෝ 𝑇𝑋𝑇𝑋 𝑤 − 𝑤 ෝ = 𝑥, 𝑦 𝑇 Λ 𝑥, 𝑦 = 𝜆1𝑥 2 + 𝜆2 𝑦 2

49
Potential Problem – Overfitting
Error

Testing MSE

Training MSE

model complexity

In linear regression, the more features 𝑋𝑗 we include in the model, the lower training MSE will be.
Adding too many features to the model may lead to overfitting.

50
More features

51
Ref:Artificial Intelligence in Corneal Diagnosis: Where Are we?
Deciding on important variables
Subset selection:
We identify a subset of the p predictors that we believe to be related to the response.
We then fit a model using least squares on the reduced set of variables.

52
Forward Stepwise Selection
● Forward stepwise selection begins with a model containing no predictors, and then
adds predictors to the model, one-at-a-time, until all of the predictors are in the
model.
● In particular, at each step the variable that gives the greatest additional
improvement to the fit is added to the model.

53
Forward Stepwise Feature Selection
● Let 𝑀0 denote the null model, which contains no predictors.
● For 𝑘 = 0, 𝑝 − 1, 𝑝 :
○ Consider all 𝑝 − 𝑘 models that augment the predictors in 𝑀𝑘 with one additional predictor.
○ Choose the best among these 𝑝 − 𝑘 models, and call it 𝑀𝑘+1. Here best is defined as having smallest
SSE.
● Select a single best model from among 𝑀1, . . . , 𝑀𝑝 using cross-validated prediction
error

54
Backward Stepwise Selection
● Like forward stepwise selection, backward stepwise selection provides an efficient
alternative to best subset selection.
● However, unlike forward stepwise selection, it begins with the full least squares
model containing all p predictors, and then iteratively removes the least useful
predictor, one-at-a-time.

55
Backward Stepwise Selection
● Let 𝑀𝑝 denote the full model, which contains all p predictors.
● For 𝑘 = 𝑝, 𝑝 − 1, … 1:
○ Consider all 𝑘 models that contain all but one of the predictors in 𝑀𝑘 , for a total of k − 1 predictors.
○ Choose the best among these 𝑘 models, and call it 𝑀𝑘−1. Here best is defined as having smallest SSE.
● Select a single best model from among 𝑀1, . . . , 𝑀𝑝 using cross-validated prediction
error

56
Potential problem: synergy between two predictors

When there is a synergy or interaction effect between two predictors


57
Potential problem: synergy between two predictors

𝑦 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2𝑥2

𝑦 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥1𝑥2
Interaction term

58
Moving Beyond Linearity

60
The world is not linear

61
Polynomial Regression
Polynomial function of different degrees

𝑦 = 𝑤0 + 𝑤1 𝑥 + 𝑤2𝑥 2 + ⋯ + 𝑤𝐷 𝑥 𝐷
= 𝑤0 + 𝑤1 𝜙1 𝑥 + 𝑤2 𝜙2 𝑥 + ⋯ + 𝑤𝐷 𝜙𝐷 𝑥
= 𝑤 𝑇 𝜙(𝑥)

62
Polynomial Regression in sklearn

Create a matrix containing powers of X

63
Piecewise Polynomials
Divide the domain of 𝑋 into contiguous intervals
Represent the function by a separate polynomial in each interval.
In each interval, fit a separate polynomial regression model 𝑦 = 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 + ⋯ + 𝑤𝐷 𝑥 𝐷

the function could be discontinuous


64

You might also like