0% found this document useful (0 votes)
54 views

Statistics and Probability in Decision Modeling: Linear Regression

This document discusses linear regression analysis. It presents data on sunshine hours and concert attendance measured in hundreds. It calculates the mean and standard deviation of concert attendance. It defines covariance and correlation coefficient, and shows how to calculate them. It explains that linear regression finds the line of best fit that minimizes the sum of squared errors between actual and predicted values of the dependent variable. It indicates this analysis will be applied to the sunshine and concert attendance data to determine the relationship between the two variables.

Uploaded by

Sahil Goutham
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Statistics and Probability in Decision Modeling: Linear Regression

This document discusses linear regression analysis. It presents data on sunshine hours and concert attendance measured in hundreds. It calculates the mean and standard deviation of concert attendance. It defines covariance and correlation coefficient, and shows how to calculate them. It explains that linear regression finds the line of best fit that minimizes the sum of squared errors between actual and predicted values of the dependent variable. It indicates this analysis will be applied to the sunshine and concert attendance data to determine the relationship between the two variables.

Uploaded by

Sahil Goutham
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Inspire…Educate…Transform.

Statistics and Probability in


Decision Modeling
Linear Regression

Dr. L. Srinivasa Varadharajan


[email protected]

Thanks to Dr. Sridhar Pappu for the material.


The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges
Analyzing relationships between attributes
COVARIANCE, CORRELATION AND R-SQUARED

The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 2
Concert attendance 22 33 30 42 38 49 42 55
(100s)

• The band makes a loss if less than 3500 people attend.


• On any given day, how many people would you expect would
attend the concert?

Image Source: http://blurtonline.com/wp-content/uploads/2013/06/Shaky-Knees-1514.jpeg;


Last accessed: May 1, 2014
The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 3
Sunshine (hours) 1.9 2.5 3.2 3.8 4.7 5.5 5.9 7.2
Concert attendance (100s) 22 33 30 42 38 49 42 55

• Based on predicted hours of sunshine, can we predict ticket sales?


• Are sunshine and concert attendance related?

Image Source: http://blurtonline.com/wp-content/uploads/2013/06/Shaky-Knees-1514.jpeg;


Last accessed: May 1, 2014
The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 4
Some Statistics
Sunshine (hours) 1.9 2.5 3.2 3.8 4.7 5.5 5.9 7.2
Concert attendance (100s) 22 33 30 42 38 49 42 55

• Mean Concert Attendance (100s): 𝑥ҧ = 38.88


• Standard Deviation (100s): s = 10.56

The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 5
Sunshine (hours) 1.9 2.5 3.2 3.8 4.7 5.5 5.9 7.2
Concert attendance (100s) 22 33 30 42 38 49 42 55

The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 6
Covariance

Independent variable: x Dependent variable: y

σ(𝑥 − 𝑥)ҧ 2 ത 2
σ(𝑦 − 𝑦)
𝑠𝑥 2 = 𝑠𝑦 2 =
𝑛−1 𝑛−1
Covariance between x and y:

2
σ(𝑥 − 𝑥)(𝑦
ҧ − 𝑦)

𝑠𝑥𝑦 =
𝑛−1
* Height and weight data generated randomly using Excel.
Oil prices from http://www.macrotrends.net/1369/crude-oil-price-history-chart
Potato prices from https://data.gov.in/catalog/dailyweekly-retail-prices-potato
Last accessed: October 28, 2017

The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 7
Correlation Coefficient
Covariance between x and y: Correlation between x and y:

2
σ(𝑥 − 𝑥)(𝑦
ҧ − 𝑦)
ത 2
𝑠𝑥𝑦
𝑠𝑥𝑦 = 𝑟=
𝑛−1 𝑠𝑥𝑠𝑦
Correlation between x and y:

(𝑥 − 𝑥)ҧ (𝑦 − 𝑦)

σ
𝑠𝑥 𝑠𝑦
𝑟=
𝑛−1
* Height and weight data generated randomly using Excel.
Oil prices from http://www.macrotrends.net/1369/crude-oil-price-history-chart
Potato prices from https://data.gov.in/catalog/dailyweekly-retail-prices-potato
Last accessed: October 28, 2017

The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 8
We need to find the equation of the line.
y = a + bx

b
a

The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 9
Sunshine (hours) 1.9 2.5 3.2 3.8 4.7 5.5 5.9 7.2
Concert attendance (100s) 22 33 30 42 38 49 42 55

• Line of best fit

The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 10
We need to minimize errors.
ෝ = 𝒂 + 𝒃𝒙
𝒚
Line of best fit is the line that
Actual Values (y) minimizes all the distances
(residuals) between the Actual (y)
Actual and and Estimated (𝑦) ො values
Estimated
values of y for Predicted values
the same x based on line of
best fit (𝑦)

We could do that by minimizing σ(𝑦𝑖 − 𝑦ෝ𝑖 ), where 𝑦𝑖 is the actual


value and 𝑦ෝ𝑖 its estimate. (𝑦𝑖 − 𝑦ෝ𝑖 ) is also known as the residual.

But σ(𝑦𝑖 − 𝑦ෝ𝑖 ) = 0.

The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 11
Just as we did when finding variance, we find the sum of squared errors or SSE.
𝑆𝑆𝐸 = ෍(𝑦𝑖 − 𝑦ෝ𝑖 )2

The value of b, the slope, that minimizes the SSE (or Mean Squared Error, MSE)
is given by

2
σ( 𝑥 − 𝑥ҧ 𝑦 − 𝑦ത ) 𝑠𝑥𝑦 𝑟𝑠𝑦
𝑏= 2
= 2 =
σ(𝑥 − 𝑥)ҧ 𝑠𝑥 𝑠𝑥

The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 12
Sunshine (hours) 1.9 2.5 3.2 3.8 4.7 5.5 5.9 7.2 ഥ = 𝟒. 𝟑𝟒
𝒙
Concert attendance (100s) 22 33 30 42 38 49 42 55 ഥ = 𝟑𝟖. 𝟖𝟖
𝒚

σ( 𝑥−𝑥ҧ 𝑦−𝑦ത )
The value of b, the slope, that minimizes the SSE is given by 𝑏 = σ(𝑥−𝑥)ҧ 2

How do you calculate a in 𝑦ෝ𝑖 = a + bx?


The line of best fit must pass through (𝑥,ҧ 𝑦).
ത Substituting in the
equation 𝑦ത = a + b𝑥,ҧ we can find a.

This method of fitting the line of best fit is called Least Squares
Regression or Ordinary Least Squares Regression or OLS
Regression.

The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 13
But how do you know how accurate this line is?

Accurate Linear No Linear Correlation


Correlation

The fit of the line is given by correlation coefficient.

The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 14
Correlation Coefficient
Correlation coefficient, r, is a number between -1 and 1 and tells us how well a
regression line fits the data.

r=1 r = -1 r =0
Positive Linear Negative Linear No Correlation
Correlation Correlation

It gives the strength and direction of the relationship between two variables.

The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 15
Correlation Coefficient
𝑏𝑠𝑥
𝑟= where b is the slope of the line of best fit, 𝑠𝑥 is the standard deviation
𝑠𝑦
of the x values in the sample, and 𝑠𝑦 is the standard deviation of the y values
in the sample.
σ(𝑥−𝑥)ҧ 2 ത 2
σ(𝑦−𝑦)
𝑠𝑥 = and 𝑠𝑦 = .
𝑛−1 𝑛−1
Sunshine (hours) 1.9 2.5 3.2 3.8 4.7 5.5 5.9 7.2
Concert attendance (100s) 22 33 30 42 38 49 42 55

Find r for this data. r = 0.916

The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 16
Covariance and Correlation - Summary
2
𝑠𝑥𝑦
2 σ(𝑥−𝑥)(𝑦−
ҧ ത
𝑦)
𝑠𝑥𝑦 = ,𝑟 =
𝑛−1 𝑠𝑥 𝑠𝑦
• If both x and y are large distance away from their respective means, the
resulting covariance will be even larger.
– The value will be positive if both are below the mean or both are above.
– If one is above and the other below, the covariance will be negative.
• If even one of them is very close to the mean, the covariance will be small.
• Cov(x,x)=Var(x)

The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 17
Covariance and Correlation - Summary
2
𝑠𝑥𝑦
2 σ(𝑥−𝑥)(𝑦−
ҧ ത
𝑦)
𝑠𝑥𝑦 = ,𝑟 =
𝑛−1 𝑠𝑥 𝑠𝑦
• The value of covariance itself doesn’t say much generally. It only shows
whether the variables are moving together (positive value) or opposite to
each other (negative value).
– Affected by scale (measuring height in ft vs mm)
– Not intuitive comparing covariance values between 2 sets of variables (how does
height-weight covariance compare with oil price($)-potato price (Rupee)
covariance)
– Unintuitive units

The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 18
Covariance and Correlation - Summary
2
𝑠𝑥𝑦
2 σ(𝑥−𝑥)(𝑦−
ҧ ത
𝑦)
𝑠𝑥𝑦 = ,𝑟 =
𝑛−1 𝑠𝑥 𝑠𝑦

• To know the strength of how the variables move together, covariance is


standardized to the dimensionless quantity, correlation.

The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 19
R-squared
The coefficient of determination is given by 𝑟 2 or 𝑅2 . It is the percentage of variation in
the y variable that is explainable by the x variable.

For example, what percentage of the variation in open-air concert attendance is


explainable by the number of hours of predicted sunshine.

If 𝑟 2 = 0, it means you can’t predict the y value from the x value.


If 𝑟 2 = 1, it means you can predict the y value from the x value without any errors.

Usually, 𝑟 2 is between these two extremes.

The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 20
R-squared
SST (Recall Sum of Squares Total from ANOVA) – This is the total variation in data. The
horizontal line at 𝑦ത indicates expected concert attendance when sunshine is not
considered. This “model” has large residuals.

𝑆𝑆𝑇 = ෍ 𝑦𝑖 − 𝑦ത 2

𝑦ത

The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 21
R-squared
SSE (Recall Sum of Squares Within from ANOVA – the inherent noise) – This is the
unexplained variation in data. The line indicates expected concert attendance when
sunshine is not considered. This “model” has small residuals.

𝑆𝑆𝐸 = ෍ 𝑦𝑖 − 𝑦ො𝑖 2

The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 22
R-squared
Total Variation Unexplained Variation Explained Variation

2 𝑆𝑆𝐸 = ෍ 𝑦𝑖 − 𝑦ො𝑖 2 2
𝑆𝑆𝑇 = ෍ 𝑦𝑖 − 𝑦ത 𝑆𝑆𝑅 = ෍ 𝑦ො𝑖 − 𝑦ത

Considering sunshine
𝑺𝑺𝑻 = 𝑺𝑺𝑹 + 𝑺𝑺𝑬 Not considering sunshine

𝑆𝑆𝑅 𝑆𝑆𝐸 2
⇒ =1− =𝑅
𝑆𝑆𝑇 𝑆𝑆𝑇

The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 23
Covariance, Correlation and R2
How do the interest rates of federal funds and the commodities futures index
co-vary and correlate?
Day Interest Rate Futures Index
1 7.43 221
2 7.48 222
3 8.00 226
4 7.75 225
5 7.60 224
6 7.63 223
7 7.68 223
8 7.67 226
9 7.59 226
10 8.07 235
11 8.03 233
12 8.00 241
The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 24
Covariance, Correlation and R2
Day Interest Futures ഥ
𝒙−𝒙 ഥ
𝒚−𝒚 ഥ ∗ (𝒚 − 𝒚
𝒙−𝒙 ഥ)
Rate Index
1 7.43 221 -0.314 -6.083 1.911
2 7.48 222 12.216
-0.264 -5.083 1.343 𝐶𝑜𝑣 = = 1.111
11
3 8.00 226 0.256 -1.083 -0.277
4 7.75 225 0.006 -2.083 -0.012 1.111
𝑟= = 0.815
5 7.60 224 0.22 ∗ 6.07
-0.144 -3.083 0.445
6 7.63 223 -0.114 -4.083 0.466 𝑅2 = 0.8152 = 0.665
7 7.68 223 -0.064 -4.083 0.262
8 7.67 226 -0.074 -1.083 0.080
9 7.59 226 -0.154 -1.083 0.167
10 8.07 235 0.326 7.917 2.580
11 8.03 233 0.286 5.917 1.691
12 8.00 241 0.256 13.917 3.560
Mean 7.74 227.08 Sum 12.216
StDev 0.22 6.07
The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 25
Covariance, Correlation and R2 - SUMMARY
• Covariance
Tells you the direction of relationship between 2 variables
• Correlation Coefficient
Tells you the direction AND strength of linear relationship between 2 variables
• R2
Tells you what percentage of the variation in y can be explained by the model (or
equivalently, by the independent variable(s)).

The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 26
Welcome to the Learning Models
• Linear regression: A regression model where
class/dependent/target variable is numeric

• Logistic regression: A classification model where


class/dependent/target variable is categorical

The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 27
HYDERABAD PUNE
2nd Floor, Jyothi Imperial, Vamsiram Builders, Old Kirloskar - Pune
Mumbai Highway, Gachibowli, Hyderabad - 500 032 S. L. Kirloskar Center for Executive Education,
+91-9701685511 (Individuals) Kirloskar Corporate Office, 8th Floor,
+91-9618483483 (Corporates) Cello Platina, Model Colony, Shivaji Nagar – 411005
BENGALURU MUMBAI
Floors 1-3, L77, 15th Cross Road, 3A Main Road Kanakia Wall Street, 4th Floor, Andheri-Kurla Road
Sector 6, HSR Layout, Bengaluru – 560 102 Chakala, Andheri East, Mumbai - 400093
+91-9502334561 (Individuals)
+91-9502799088 (Corporates)

Web: http://www.insofe.edu.in
Facebook: https://www.facebook.com/insofe
Twitter: https://twitter.com/Insofeedu
YouTube: http://www.youtube.com/InsofeVideos
SlideShare: http://www.slideshare.net/INSOFE
LinkedIn: http://www.linkedin.com/company/international-school-of-engineering

This presentation may contain references to findings of various reports available in the public domain. INSOFE makes no representation as to their accuracy or that the organization subscribes to those findings.

The BEST GLOBAL DESTINATION for individuals and organizations to learn and adopt disruptive technologies for solving business and society’s challenges 28

You might also like