0% found this document useful (0 votes)

2 views

10 Linear Regression

The document covers linear regression and optimization techniques, including gradient descent for parameter estimation. It explains the mathematical foundations of convex optimization, the formulation of linear regression models, and the impact of multicollinearity on regression analysis. Additionally, it discusses the iterative approach to minimizing loss functions and compares gradient descent with stochastic gradient descent.

Uploaded by

hz3686

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

10 Linear Regression

Uploaded by

hz3686

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

Linear Regression

CS 360 Lecture 10

1
Solving unconstrained convex
optimization

2
How to solve convex optimization problems
Assume no constraints, then the problem
arg min𝑥 𝑓(𝑥) , where 𝑓(𝑥) is a convex function

Is equivalent to
∇𝑓 𝑥 = 0

𝜕 𝜕 𝜕
∇𝑓 𝑥 = 𝑓 𝑥 , 𝑓 𝑥 ,… 𝑓 𝑥
𝑥0 𝑥1 𝑥𝑝

3
Example
Objective: 𝑓 𝑥 = 𝑥 2 + 1
𝑑𝑓(𝑥)
Solve: = 2𝑥 = 0
𝑑𝑥

𝑥=0

4
Example in 2D
Objective:𝑓 𝑥 = 𝑥12 + 𝑥22 + 𝑥1𝑥2 − 𝑥1 + 1

𝜕𝑓(𝑥)
Solve: = 2𝑥1 + 𝑥2 − 1
𝑑𝑥1

𝜕𝑓(𝑥)
= 2𝑥2 + 𝑥1
𝑑𝑥2

𝜕 𝜕
∇𝑓 𝑥 = 𝑓 𝑥 , 𝑓 𝑥 = 0,0
𝑥1 𝑥2

5
Optimization using Scipy

8
Gradient Descent

9
Gradient Descent

10
Gradient

Each arrow is the gradient evaluated at that point. The gradients point towards the direction of
steepest ascent.
11
Gradient Descent Algorithm
● Choose initial guess 𝑥0
● While not converged (𝜂 is the learning rate)
𝑥𝑡+1 = 𝑥𝑡 − 𝜂∇𝑓(𝑥𝑡 )

12
Choice of learning rate

13
Debugging gradient descent

14
Example in 2D (revisited)
Objective:𝑓 𝑥 = 𝑥12 + 𝑥22 + 𝑥1𝑥2 − 𝑥1 + 1

𝜕𝑓(𝑥)
Solve: = 2𝑥1 + 𝑥2 − 1
𝜕𝑥1

𝜕𝑓(𝑥)
= 2𝑥2 + 𝑥1
𝜕𝑥2

𝜕 𝜕
∇𝑓 𝑥 = 𝑓 𝑥 , 𝑓 𝑥
𝜕𝑥1 𝜕𝑥2

15
Example in 2D (revisited)
Algorithm:

2𝑥1 + 𝑥2 − 1
∇𝑓 𝑥 =
2𝑥2 + 𝑥1

𝑥𝑡+1 ← 𝑥𝑡 − 𝜂∇𝑓(𝑥)
Try it for 𝜂 = 0.3

16
Caveat: Perils of non-convex functions

Non convex functions can lead to local minima or singularities

17
Caveat: local minima

Non convex functions can lead to local minima

18
Caveat: singularities

19
Example
Find the minimum of function 𝑔 𝑢, 𝑣 = 𝑢 + 2𝑣 2 + 10 sin 𝑢 + 2𝑢2 − 𝑢𝑣 − 2 /30

20
Example
Find the minimum of function 𝑔 𝑢, 𝑣 = 𝑢 + 2𝑣 2 + 10 sin 𝑢 + 2𝑢2 − 𝑢𝑣 − 2 /30

21
Linear Regression

22
Example
Suppose that we are statistical consultants hired by a client to provide advice on how to
improve sales of a particular product.
The Advertising data set consists of the sales of that product in 200 different markets,
along with advertising budgets for the product in each of those markets for three
different media: TV, radio, and newspaper.

23
Linear Regression
We assume the model:
y= 𝑤0 + 𝑤1𝑥1 + ⋯ + 𝑤𝑑 𝑥𝑑 + 𝜀

We interpret 𝑤𝑗 as the average effect on y of a one unit increase in 𝑥𝑗 , holding all other
predictors fixed. In the advertising example, the model becomes

Sales = 𝑤0 + 𝑤1𝑇𝑉 + 𝑤2 𝑅𝑎𝑑𝑖𝑜 + 𝑤3 𝑁𝑒𝑤𝑠𝑝𝑎𝑝𝑒𝑟 + noise

𝜀 is the noise term, typically i.i.d. for every sample.

24
Estimate the parameters
Minimize the mean squared error/loss
𝑛 𝑛
1 2 1 2
𝑀𝑆𝐸 = 𝐿(𝑤) = ෍ 𝑒𝑖 = ෍ 𝑦𝑖 − 𝑤0 − 𝑤1 𝑥𝑖,1 … 𝑤𝑑 𝑥𝑖,𝑑
𝑛 𝑛
𝑖=1 𝑖=1
Let 𝑦ො𝑖 be the prediction using the estimated parameters:
𝑦ො𝑖 = 𝑤0 + 𝑤1 𝑥𝑖,1 + ⋯ + 𝑤𝑑 𝑥𝑖,𝑑

𝑒𝑖 is the error/residue on sample 𝑖, i.e.

𝑒𝑖 = 𝑦𝑖 − 𝑦ො𝑖

25
Estimate the parameters
Matrix representation
𝑛
suppose we have training data (𝑥𝑖 , 𝑦𝑖 ) 𝑖=1 where 𝑥𝑖 = 1, 𝑥𝑖,1 , … , 𝑥𝑖,𝑑
𝑦 = 𝑋𝑤
1 𝑥1,1 … 𝑥1,𝑑
𝑦1 ⎯⎯ 𝑥1 ⎯⎯
1 𝑥2,1 … 𝑥2,𝑑
𝑦= ⋮ 𝑋= ⋮ =
𝑦𝑛 ⎯⎯ 𝑥𝑛 ⎯⎯ ⋮ ⋮ ⋱ ⋮
1 𝑥𝑛,1 … 𝑥𝑛,𝑑
where 𝑤 = (𝑤0 , 𝑤1 , …, 𝑤𝑑 )𝑇. And the loss function is
𝑛
1 2
𝑀𝑆𝐸 = ෍ 𝑦𝑖 − 𝑤0 − 𝑤1 𝑥𝑖,1 … −𝑤𝑝 𝑥𝑖,𝑑
𝑛
𝑖=1
1 𝑇
= 𝑦 − 𝑋𝑤 𝑦 − 𝑋𝑤
𝑛
26
Estimate the parameters
● Differentiate MSE w.r.t 𝑤

−𝑋 𝑇 2 𝑦 − 𝑋𝑤
ෝ =0

● If 𝑋 𝑇 𝑋 is nonsingular, then the unique solution is given by

ෝ = 𝑋 𝑇 𝑋 −1 𝑋 𝑇 𝑦
𝑤
● When 𝑋 𝑇 𝑋 is not invertible, the least square solution is NOT Unique.

27
Interpretation of the Minimizer: Projection onto the Column Space
2
min 𝑦 − 𝑋𝑤 2
𝑤
min ‖ 𝑦 − 𝑧‖22
z∈Col(X)
The minimizer z should be 𝑦’s projection onto the column space of 𝑋.
I.e., 𝑧 = 𝐻𝑦 where 𝐻 is the projection matrix for column space of 𝑋.
ෝ = 𝑋 𝑇 𝑋 −1𝑋 𝑇 𝑦
𝑤
𝑋𝑤ෝ = 𝑋 𝑋 𝑇 𝑋 −1 𝑋 𝑇 𝑦
The equation for projection matrix: 𝐻 = 𝑋 𝑋 𝑇 𝑋 −1 𝑋 𝑇
𝐻𝑦 = 𝑋 𝑋 𝑇 𝑋 −1𝑋 𝑇 𝑦 = 𝑋𝑤 ෝ

28
Advertising example
Linear regression model is easy to interpret
𝑦 = 2.939+0.046∗TV+0.189∗Radio−0.001∗Newspaper

29
Multicollinearity
● Multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a
multiple regression model can be linearly predicted from the others.
○ The coefficient estimates of the multiple regression may change erratically in response to small changes
in the model or the data.

● The parameter estimation for each individual predictor may not be interpretable, since the
variable may be redundant to others

● With the existence of perfect multicollinearity, the data matrix 𝑋 is not full-rank, and therefore
𝑋 𝑇 𝑋 cannot be inverted. The ordinary least squares estimator 𝑋 𝑇 𝑋 −1
𝑋 𝑇 𝑦 does not exist.

30
Estimate the parameters – Gradient Descent
Goal: to find 𝑤 to minimize the loss function L(𝑤)

Iterative approach:
(0) (0)
● Begin with some initial value 𝑤 , for example 𝑤 = (0, 0, . . , 0)𝑇
● Repeat until converge:

Evaluate the partial derivative of L(𝑤) at current value of 𝑤 and update 𝑤 using
𝜕𝐿 𝑤
𝑤 (𝑡+1) ← 𝑤 (𝑡) − 𝜂 ቚ
𝜕𝑤 𝑤=𝑤(𝑡)

31
Estimate the parameters – Gradient Descent
1 𝑛 1
Loss function: L(𝑤) = σ𝑖=1 𝑒𝑖2 = σ𝑛𝑖=1 𝑤0 + 𝑤1𝑥𝑖1 … +𝑤𝑑 𝑥𝑖𝑑 − 𝑦𝑖 2
𝑛 𝑛

𝜕𝐿(𝑤) 2
The partial derivative: = σ𝑛𝑖=1 𝑤0 + 𝑤1𝑥𝑖1 … +𝑤𝑑 𝑥𝑖𝑑 − 𝑦𝑖 𝑥𝑖𝑗
𝜕𝑤𝑗 𝑛
(𝑡+1) (𝑡) 𝜕𝐿 𝑤
𝑤𝑗 ← 𝑤𝑗 −𝜂 ቚ
𝜕𝑤 𝑤𝑗 =𝑤 𝑡
𝑗

Or we can write it in using vectors

𝜕𝐿 𝑤
𝑤 (𝑡+1) ← 𝑤 (𝑡) − 𝜂 ቚ
𝜕𝑤 𝑤=𝑤(𝑡)
𝜕𝐿 𝑤 2 𝑛
= σ𝑖=1(𝑤 𝑇 𝑥𝑖 − 𝑦𝑖 )𝑥𝑖
𝜕𝑤 𝑛
32
Gradient Descent and Stochastic Gradient Descent
Gradient Descent:
Repeat until convergence
1 𝑛
𝑤 (𝑡+1) ← 𝑤 (𝑡) − 𝜂 ෍ (𝑤 𝑇 𝑥𝑖 − 𝑦𝑖 )𝑥𝑖
𝑛 𝑖=1

Stochastic Gradient Descent:

Repeat until convergence
𝑤 (𝑡+1) ← 𝑤 (𝑡) − 𝜂(𝑤 𝑇 𝑥𝑖 − 𝑦𝑖 )𝑥𝑖

33
Gradient Descent

SGD requires more iterations, but every iteration is much cheaper.

SGD typically has a lower overall cost. Minibatch-SGD may achieve a better trade-off.
34
Probabilistic interpretation
Assumptions: 𝑦𝑖 = 𝑤 𝑇 𝑥𝑖 + 𝜀𝑖
𝜀𝑖 iid ~𝑁 0, 𝜎 2 (identically and independently distributed)

1 𝜀𝑖 2
So we have 𝑝 𝜀𝑖 = 𝑒𝑥𝑝 −
2𝜋𝜎 2𝜎2
𝑝 𝑦𝑖 |𝑥𝑖 ; 𝑤 = 𝑃 𝑌 = 𝑦𝑖 |𝑋 = 𝑥𝑖 ; 𝑤
= 𝑃 𝜀 = 𝑦𝑖 − 𝑤 𝑇 𝑥𝑖 𝑋 = 𝑥𝑖 ; 𝑤
2
1 𝑦𝑖 −𝑤𝑇 𝑥𝑖
= 𝑒𝑥𝑝 −
2𝜋𝜎 2𝜎2

35
Probabilistic interpretation
Given 𝜀𝑖′ s are independent, we can derive the likelihood of the model
2
1 𝑦𝑖 −𝑤𝑇𝑥𝑖
𝑓 𝑤 = ς𝑁
𝑖=1 𝑃 𝑦𝑖 |𝑥𝑖 ; 𝑤 = ς𝑁
𝑖=1 2𝜋𝜎 𝑒𝑥𝑝 − 2𝜎2

The log likelihood is

𝑙 𝑤 = log𝑓 𝑤
2
𝑦𝑖 −𝑤𝑇𝑥𝑖
= σ𝑁
𝑖=1 − log 2𝜋𝜎 − 2𝜎2
1
= −𝑁𝑙𝑜𝑔 2𝜋𝜎 − 2𝜎2 σ𝑁 𝑇
𝑖=1 𝑦𝑖 − 𝑤 𝑥𝑖
2

Minimizing MSE is equivalent to maximizing likelihood

36
Model evaluation

37
Supervised Learning Summary
Model: yi = 𝑤0 + 𝑤1𝑥𝑖,1 + ⋯ + 𝑤𝑑 𝑥𝑖,𝑑

Objective/ SSE/MSE
Loss function:

Method: Exact solution, Gradient Descent

Evaluation Empirical risk = Mean squared error

metric:

38
Generalization
Assumption: out data is generated independently and identically distributed (i.i.d.) from
some unknown distribution 𝑃
𝑥𝑖 , 𝑦𝑖 ~𝑃(𝑋, 𝑌)
The goal is minimize the expected error (true risk) under P
𝑅 𝑤 = ‫𝑥 𝑃 ׬‬, 𝑦 𝑦 − 𝑤 𝑇 𝑥 2 𝑑𝑥𝑑𝑦 = 𝐸𝑋,𝑌 𝑦 − 𝑤 𝑇 𝑥 2

Estimate the true risk by the empirical risk on a sample data set D
1
𝑅෠𝐷 𝑤 = ෍ 𝑦 − 𝑤𝑇𝑥 2
𝐷 (𝑥,𝑦)∈𝐷

39
What happens if we optimize on training data?
● Suppose we are given training data D
ෝ𝐷 = arg min 𝑅෠𝐷 𝑤
Parameter estimation: 𝑤
𝑤

● Ideally, we want to solve: 𝑤 ∗ = arg min 𝑅 𝑤

𝑤

40
What if we evaluate performance on training data
ෝ𝐷 = arg min 𝑅෠𝐷 𝑤 , 𝑤 ∗ = arg min 𝑅 𝑤 .
With 𝑤
𝑤 𝑤
In general it holds that 𝐸𝐷 𝑅෠𝐷 𝑤
ෝ𝐷 ≤ 𝐸𝐷 𝑅 𝑤
ෝ𝐷

41
Other Considerations in Regression Model
Qualitative/Categorical predictors

● Predictors with two levels, e.g. gender

1 𝑖𝑓 𝑡ℎ𝑒 𝑝𝑒𝑟𝑠𝑜𝑛 𝑖𝑠 𝑓𝑒𝑚𝑎𝑙𝑒
𝑥=ቊ
0 𝑖𝑓 𝑡ℎ𝑒 𝑝𝑒𝑟𝑠𝑜𝑛 𝑖𝑠 𝑚𝑎𝑙𝑒

42
Other Considerations in Regression Model
Qualitative/Categorical predictors
● Categorical variable with more than two levels, e.g. ethnicity (consider three levels:
Asian, Caucasian, African American)
1 if the person is Asian
𝑥1 = ቊ
0 if the person is not Asian
1 if the person is Caucasian
𝑥2 = ቊ
0 if the person is not Caucasian

w0 + w1 + ε(i) if the person is Asian

(𝑖) (𝑖)
𝑦 (𝑖) = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝜀 (𝑖) = ൞ w0 + w2 + ε(i) if the person is Caucasian
w0 + ε(i) if the person is African American

43
Potential Problem – Outliers 𝑦

How does outlier influence the

regression model?

How do we identify outliers?

Residual plot, Data distribution
How do we deal with outliers?
Delete the outlier
More robust models 𝑦 − 𝑦ො

𝑥
44
Potential Problem – High Leverage Points
𝑦
How does a high leverage point influence the
regression model?

How do we identify a high leverage point?

Has an unusual value for predictor 𝑥
How do we deal with high leverage points?
Delete the high leverage point
More robust models
𝑦 − 𝑦ො

45
𝑥
Loss function for regression
Towards mean
Punish small errors less Towards median

46
Loss function for regression
Huber Loss (Smooth Mean Absolute Error)

47
Potential Problem – Collinearity

w2 w2

w1 w1
● Consider the level set {𝑤| 𝑤 − 𝑤 ෝ 𝑇 𝑋𝑇 𝑋 𝑤 − 𝑤 ෝ = 𝐶}. It is the equation of an ellipsoid (𝑋 full rank). The
lengths of the axes scale with eigenvalues of 𝑋 𝑇 𝑋 −1.
● When two predictors are highly correlated, the contours of the MSE run along a narrow valley, there is a
broad range for the coefficient estimate.
● With 𝑥1and 𝑥2 linearly related (perfectly), 𝑋 𝑇 𝑋 has a 0 eigenvalue. 􏰴 􏰴
● So the level set {𝑤| 𝑤 − 𝑤ෝ 𝑇 𝑋𝑇 𝑋 𝑤 − 𝑤 ෝ = 𝐶} is no longer an ellipsoid. It’s a degenerated ellipsoid –
48
where contour lines will be pairs of lines in this case
The Level Set and the Ellipsoid
𝑤 𝑤−𝑤 ෝ 𝑇𝑋𝑇𝑋 𝑤 − 𝑤 ෝ =𝐶
It is essentially a quadratic form (𝑧 𝑇 𝑀𝑧). The matrix 𝑋 𝑇 𝑋 is P.S.D.

● Let’s consider the 2-d case for simplicity:

𝑥2 𝑦2
2-d ellipse equation: + 2 =1
𝑎2 𝑏
Spectrum decomposition of 𝑋 𝑇 𝑋 = 𝑄Λ𝑄𝑇
Consider (𝑥, 𝑦) = 𝑄𝑇 𝑤 − 𝑤
ෝ
𝑤−𝑤 ෝ 𝑇𝑋𝑇𝑋 𝑤 − 𝑤 ෝ = 𝑥, 𝑦 𝑇 Λ 𝑥, 𝑦 = 𝜆1𝑥 2 + 𝜆2 𝑦 2

49
Potential Problem – Overfitting
Error

Testing MSE

Training MSE

model complexity

In linear regression, the more features 𝑋𝑗 we include in the model, the lower training MSE will be.
Adding too many features to the model may lead to overfitting.

50
More features

51
Ref:Artificial Intelligence in Corneal Diagnosis: Where Are we?
Deciding on important variables
Subset selection:
We identify a subset of the p predictors that we believe to be related to the response.
We then fit a model using least squares on the reduced set of variables.

52
Forward Stepwise Selection
● Forward stepwise selection begins with a model containing no predictors, and then
adds predictors to the model, one-at-a-time, until all of the predictors are in the
model.
● In particular, at each step the variable that gives the greatest additional
improvement to the fit is added to the model.

53
Forward Stepwise Feature Selection
● Let 𝑀0 denote the null model, which contains no predictors.
● For 𝑘 = 0, 𝑝 − 1, 𝑝 :
○ Consider all 𝑝 − 𝑘 models that augment the predictors in 𝑀𝑘 with one additional predictor.
○ Choose the best among these 𝑝 − 𝑘 models, and call it 𝑀𝑘+1. Here best is defined as having smallest
SSE.
● Select a single best model from among 𝑀1, . . . , 𝑀𝑝 using cross-validated prediction
error

54
Backward Stepwise Selection
● Like forward stepwise selection, backward stepwise selection provides an efficient
alternative to best subset selection.
● However, unlike forward stepwise selection, it begins with the full least squares
model containing all p predictors, and then iteratively removes the least useful
predictor, one-at-a-time.

55
Backward Stepwise Selection
● Let 𝑀𝑝 denote the full model, which contains all p predictors.
● For 𝑘 = 𝑝, 𝑝 − 1, … 1:
○ Consider all 𝑘 models that contain all but one of the predictors in 𝑀𝑘 , for a total of k − 1 predictors.
○ Choose the best among these 𝑘 models, and call it 𝑀𝑘−1. Here best is defined as having smallest SSE.
● Select a single best model from among 𝑀1, . . . , 𝑀𝑝 using cross-validated prediction
error

56
Potential problem: synergy between two predictors

When there is a synergy or interaction effect between two predictors

57
Potential problem: synergy between two predictors

𝑦 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2𝑥2

𝑦 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥1𝑥2
Interaction term

58
Moving Beyond Linearity

60
The world is not linear

61
Polynomial Regression
Polynomial function of different degrees

𝑦 = 𝑤0 + 𝑤1 𝑥 + 𝑤2𝑥 2 + ⋯ + 𝑤𝐷 𝑥 𝐷
= 𝑤0 + 𝑤1 𝜙1 𝑥 + 𝑤2 𝜙2 𝑥 + ⋯ + 𝑤𝐷 𝜙𝐷 𝑥
= 𝑤 𝑇 𝜙(𝑥)

62
Polynomial Regression in sklearn

Create a matrix containing powers of X

63
Piecewise Polynomials
Divide the domain of 𝑋 into contiguous intervals
Represent the function by a separate polynomial in each interval.
In each interval, fit a separate polynomial regression model 𝑦 = 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 + ⋯ + 𝑤𝐷 𝑥 𝐷

the function could be discontinuous

Lecture Notes 5 Linear Regression
No ratings yet
Lecture Notes 5 Linear Regression
11 pages
CS464 Ch9 LinearRegression
100% (1)
CS464 Ch9 LinearRegression
43 pages
Lecture 3 - Linear Regression
No ratings yet
Lecture 3 - Linear Regression
31 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
Module3_Ch1
No ratings yet
Module3_Ch1
83 pages
Lec 3-5 (Function Approximation)
No ratings yet
Lec 3-5 (Function Approximation)
34 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Regression
No ratings yet
Regression
44 pages
Math YHPLinear Regression
No ratings yet
Math YHPLinear Regression
13 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Wk05 machine learning
No ratings yet
Wk05 machine learning
6 pages
W2 Ecs7020p
No ratings yet
W2 Ecs7020p
54 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
COL774 Practice Problems
No ratings yet
COL774 Practice Problems
22 pages
linear regression
No ratings yet
linear regression
130 pages
Lecture3_upload
No ratings yet
Lecture3_upload
28 pages
M02 Linear Regression Methods
No ratings yet
M02 Linear Regression Methods
40 pages
Machine Learning II: The Linear Model
No ratings yet
Machine Learning II: The Linear Model
48 pages
04 LinearModels
No ratings yet
04 LinearModels
28 pages
L. D. College of Engineering: Lab Manual For
No ratings yet
L. D. College of Engineering: Lab Manual For
70 pages
lec6_7_Linear_regression
No ratings yet
lec6_7_Linear_regression
38 pages
MECH4403 LR Week04
No ratings yet
MECH4403 LR Week04
25 pages
Group 30 Ppt
No ratings yet
Group 30 Ppt
33 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
Lect03 CSN382
No ratings yet
Lect03 CSN382
31 pages
Lecture 6
No ratings yet
Lecture 6
29 pages
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
No ratings yet
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
67 pages
GradientDescent-Regression_slides
No ratings yet
GradientDescent-Regression_slides
26 pages
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
No ratings yet
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
43 pages
Module2-Optimizations
No ratings yet
Module2-Optimizations
65 pages
ML_Lec 4-introduction to regression
No ratings yet
ML_Lec 4-introduction to regression
65 pages
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
No ratings yet
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
20 pages
Lec 3
No ratings yet
Lec 3
22 pages
MACHINE LEARNING ALGORITHM Unit-II
No ratings yet
MACHINE LEARNING ALGORITHM Unit-II
115 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
CS 304.A Training Models
No ratings yet
CS 304.A Training Models
149 pages
ML MU Unit 3RegressionTechniquespdf 2025 02-07-10!56!37 (1)
No ratings yet
ML MU Unit 3RegressionTechniquespdf 2025 02-07-10!56!37 (1)
115 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Regression
No ratings yet
Regression
16 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
5.1Loss Function, Optimization,Gd
No ratings yet
5.1Loss Function, Optimization,Gd
39 pages
02-Linear Regression
No ratings yet
02-Linear Regression
17 pages
Basic Interview Question of Linear Regression
No ratings yet
Basic Interview Question of Linear Regression
9 pages
CS550 Lec2
No ratings yet
CS550 Lec2
24 pages
Group30 Linear Regression
No ratings yet
Group30 Linear Regression
20 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
Linear Regression
No ratings yet
Linear Regression
62 pages
Lec 03
No ratings yet
Lec 03
42 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
25 pages
intro to regression
No ratings yet
intro to regression
4 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
Introml 02 Regression Annotated PDF
No ratings yet
Introml 02 Regression Annotated PDF
26 pages
Lecture - 4 - Logistic Regression
No ratings yet
Lecture - 4 - Logistic Regression
62 pages
Generalized Fermat Equation
From Everand
Generalized Fermat Equation
Ran Van Vo
No ratings yet
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
4cf708fe-5603-4faa-967f-9b1624149b30
No ratings yet
4cf708fe-5603-4faa-967f-9b1624149b30
18 pages
Statistical Paradises and Paradoxes in Big Data (I) : Law of Large Populations, Big Data Paradox, and The 2016 Us Presidential Election
No ratings yet
Statistical Paradises and Paradoxes in Big Data (I) : Law of Large Populations, Big Data Paradox, and The 2016 Us Presidential Election
42 pages
Forecasting Session 2.0 2024
No ratings yet
Forecasting Session 2.0 2024
29 pages
QMT 3001 Business Forecasting Term Project
No ratings yet
QMT 3001 Business Forecasting Term Project
30 pages
Standard Deviation Quick Guide
No ratings yet
Standard Deviation Quick Guide
3 pages
Medical Insurance Cost Prediction
No ratings yet
Medical Insurance Cost Prediction
7 pages
A New Hybrid Steganography Scheme Employing A Time 240901 201513
No ratings yet
A New Hybrid Steganography Scheme Employing A Time 240901 201513
6 pages
539220
No ratings yet
539220
107 pages
Book's Solutions
No ratings yet
Book's Solutions
20 pages
Enhancing Stock Market Prediction A Robust LSTM-DNN Model Analysis On 26 Real-Life Datasets
No ratings yet
Enhancing Stock Market Prediction A Robust LSTM-DNN Model Analysis On 26 Real-Life Datasets
12 pages
A Machine Learning Approach To Georeferencing
No ratings yet
A Machine Learning Approach To Georeferencing
5 pages
Scaling Laws for Precision
No ratings yet
Scaling Laws for Precision
33 pages
Assignment - Week 10
No ratings yet
Assignment - Week 10
6 pages
Sample Final Summer 2016 (Updated)
No ratings yet
Sample Final Summer 2016 (Updated)
13 pages
A Solution Manual and Notes For: The Elements of Statistical Learning by Jerome Friedman, Trevor Hastie, and Robert Tibshirani
No ratings yet
A Solution Manual and Notes For: The Elements of Statistical Learning by Jerome Friedman, Trevor Hastie, and Robert Tibshirani
121 pages
1 s2.0 S1568494621001976 Main
No ratings yet
1 s2.0 S1568494621001976 Main
10 pages
Statistics MS. ORDONIO
No ratings yet
Statistics MS. ORDONIO
50 pages
Chapter 1 - Introduction To Sampling Techniques
No ratings yet
Chapter 1 - Introduction To Sampling Techniques
15 pages
Compersion of Forcasting
No ratings yet
Compersion of Forcasting
21 pages
QUESTION BANK ,sample paper , and many more
No ratings yet
QUESTION BANK ,sample paper , and many more
43 pages
Assignment 1
No ratings yet
Assignment 1
9 pages
A Primer in Nonparametric Econometrics
No ratings yet
A Primer in Nonparametric Econometrics
88 pages
0 - Module 2 MIT-OT
No ratings yet
0 - Module 2 MIT-OT
95 pages
Bi-Histogram Equalization Using Two Plateau Limits
No ratings yet
Bi-Histogram Equalization Using Two Plateau Limits
8 pages
Parameter Estimation of Bernoulli Distribution Using Maximum Likelihood and Bayesian Methods
No ratings yet
Parameter Estimation of Bernoulli Distribution Using Maximum Likelihood and Bayesian Methods
14 pages
U-Medsam: Uncertainty-Aware Medsam For Medical Image Segmentation
No ratings yet
U-Medsam: Uncertainty-Aware Medsam For Medical Image Segmentation
11 pages
Sakshi Seminar Report Final PDF
No ratings yet
Sakshi Seminar Report Final PDF
20 pages
Seismic Fragility Analysis of Steel Frame Using Radial Basis Function Nueral Networks
No ratings yet
Seismic Fragility Analysis of Steel Frame Using Radial Basis Function Nueral Networks
32 pages
Problem Set 10 Solutions
No ratings yet
Problem Set 10 Solutions
6 pages
Real-Time Prediction For Bitcoin
No ratings yet
Real-Time Prediction For Bitcoin
14 pages

10 Linear Regression

Uploaded by

10 Linear Regression

Uploaded by

Linear Regression

Non convex functions can lead to local minima or singularities

Non convex functions can lead to local minima

Sales = 𝑤0 + 𝑤1𝑇𝑉 + 𝑤2 𝑅𝑎𝑑𝑖𝑜 + 𝑤3 𝑁𝑒𝑤𝑠𝑝𝑎𝑝𝑒𝑟 + noise

𝜀 is the noise term, typically i.i.d. for every sample.

𝑒𝑖 is the error/residue on sample 𝑖, i.e.

● If 𝑋 𝑇 𝑋 is nonsingular, then the unique solution is given by

Or we can write it in using vectors

Stochastic Gradient Descent:

SGD requires more iterations, but every iteration is much cheaper.

The log likelihood is

Minimizing MSE is equivalent to maximizing likelihood

Method: Exact solution, Gradient Descent

Evaluation Empirical risk = Mean squared error

● Ideally, we want to solve: 𝑤 ∗ = arg min 𝑅 𝑤

● Predictors with two levels, e.g. gender

w0 + w1 + ε(i) if the person is Asian

How does outlier influence the

How do we identify outliers?

How do we identify a high leverage point?

● Let’s consider the 2-d case for simplicity:

When there is a synergy or interaction effect between two predictors

Create a matrix containing powers of X

the function could be discontinuous

You might also like