0% found this document useful (0 votes)
17 views

Correlation and Regression Analysis

IOE pulchowk

Uploaded by

basnetaayush407
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Correlation and Regression Analysis

IOE pulchowk

Uploaded by

basnetaayush407
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 71

PROBABILITY AND

S TAT I S T I C S
Syllabus
• Correlation and Regression (6 hours)
– Least square method
– An analysis of variance of Linear Regression
model
– Inference concerning Least square method
• Multiple correlation and regression
Variables: a symbolic name associated with a value and
whose associated value may be changed.
Quantitative variable: Variables that have are measured on a
numeric or quantitative scale. A country’s population, a
person’s shoe size, or a car’s speed are all quantitative
variables.
Variables that are not quantitative are known as
qualitative variables.
Correlation:
Correlation is a statistical technique used to determine
the degree to which two variables are related.
• A scatter diagram (Also known as scatter plot, scatter
graph, and correlation chart) is a tool for analyzing
relationships between two variables for determining
how closely the two variables are related.
• One variable is plotted on the horizontal axis and the
other is plotted on the vertical axis. The pattern of
their intersecting points can graphically show
relationship patterns.
Scatter Diagram
 Rectangular coordinate
 Two quantitative variables
 One variable is called independent (X) and the
second is called dependent (Y)
Y
 Points are not joined

* *
No frequency table
 Most common way for visualizing the association
*
between two quantitative variables X
 What we have to look in scatter plot
i) Linearity (Straight line)
ii) Spread
iii) Outliers
iv) Correlation
Scatter Plots
The pattern of data is indicative of the type of relationship between
two variables:
Positive Relationship
Negative Relationship
No Relationship
• Positive Correlation: The correlation is said to be positive correlation if the
values of two variables changing with same direction.
Ex. Pub. Exp. & sales, Height & weight, study time and grades.
• Negative correlation: The correlation is said to be negative correlation when the
values of variables change with opposite direction.
Ex. Price & qty. demanded, alcohol consumption and driving ability.
Positive Relationship
Negative Relationship

Strength

Age of buildings
No relationship
Linear and Non- Linear correlation
Simple Correlation Coefficient
The most common measure of correlation; also called Pearson
coefficient of correlation
 Is an index of relationship between two variables
Reflects the degree of linear relationship between two variables
It is symmetric in nature
The value of r ranges between ( -1) and ( +1)
The value of r denotes the strength of the association as
illustrated by the following diagram.

strong intermediate weak weak intermediate strong

1- -0.75 -0.25 0 0.25 0.75 1

perfect perfect
correlation correlation
no relation
Assumptions:
Two variables should be measured at the interval or ratio level (i.e.,
continuous)
There is a linear relationship between two variables.
There should be no significant outliers.
Variables should be approximately normally distributed.
How to compute the simple correlation
coefficient (r)

∑ x∑ y
∑ xy −
n
r=
√¿ ¿ ¿
EXAMPLE:
Calculate the simple correlation coefficient between ice-
cream’s sales unit and profit.
Profit (Rs.) sales unit serial No

120 70 1
80 60 2
120 80 3
100 50 4
115 60 5
135 90 6
Sales
Profit Serial
Y2 X2 xy Unit
(y) .no
(x)
1
2
3
4
5
6
Total
=y2∑ =x2∑ =xy∑ =y ∑ =x ∑
Interpretation:
For example, If r= 0.7, then r*r = 0.7*0.7 = 0.49=
0.49*100= 49%
About 49% of the variation (out of total variation) in
variable1 is explained by variable2 and remaining 51%
is due to unknown factors.
Q1. The following are the no. of minute it took 10 machines to
assemble a piece of machinery in the morning, x, and in the late
afternoon, y:
a) Calculate simple correlation coefficient, coefficient of determination
and interpret the results.

X=x 11.1 10.3 12.0 15.1 13.7 18.5 17.3 14.2 14.8 15.3
Y=y 10.9 14.2 13.8 21.5 13.2 21.1 16.4 19.3 17.4 19.0
Partial Correlation
Coeffi cient
Partial correlation estimates the relationship between two variables
while removing the influence of a third variable from the
relationship.
Examples: Relationship between a guy and girl while removing the
influence of effect of video games
Relationship between unit sales of ice cream and profit removing
the influence of daily temperature.
Assumptions
• You have one (dependent) variable and one (independent) variable and these are both measured
on a continuous scale (i.e., they are measured on an interval or ratio scale).
• You have one or more control variables, also known as covariates (i.e., control variables are just
variables that you are using to adjust the relationship between the other two variables; that is, your
dependent and independent variables). These control variables are also measured on
a continuous scale (i.e., they are continuous variables).
• There needs to be a linear relationship between all three variables. That is, all possible pairs of
variables must show a linear relationship.
• There should be no significant outliers.
• Your variables should be approximately normally distributed.
HOW TO COMPUTE THE PARTIAL CORRELATION
COEFFICIENT (R)
Where,
• rAB = simple correlation coeff. between A and B
• rAC = simple correlation coeff. between A and C
• rBC = simple correlation coeff. between B and C
Note: rAB = rBA, rAC= rCA, rBC = rCB
In above formula,

We calculate partial correlation coefficient between


variables A and B , assuming variable C as constant.
• What will be the formula if we wanted to calculate partial
correlation coefficient between B and C assuming A as
constant?
• What will be the formula if we wanted to calculate partial
correlation coefficient between A and C assuming B as
constant?
Note: a) It’s coefficient always lies between -1 to +1 b) rBC.A =
rCB.A and so on. i.e. the subscript of left hand side do not affect
the value c) Square of partial correlation coefficient gives
coefficient of partial determination
Interpretation: Out of total variation about 25% of the
variation in the variable A has been explained by variable B
assuming variable C as constant
1. Find Partial Correlation Coefficient Between Ice- Cream’s
Sales Unit And Profit Assuming Daily Temperature As Constant
Daily Temperature Profit
Sales Unit .Serial no
(* C)
25 120 70 1
20 80 60 2
30 120 80 3
27 100 50 4
21 115 60 5
32 135 90 6
2. Find Partial Correlation Coefficient Between Ice- Cream’s Sales Unit And
Daily Temperature Assuming Profit As Constant

3. Find Partial Correlation Coefficient Between Ice- Cream’s Profit And Daily
Temperature Assuming Sales Unit As Constant.

4. Also calculate coefficient of partial determination for questions 1, 2 and 3.


Also interpret results.
Note: Calculate simple correlation coefficient between Sales Unit vs Daily
Temperature, Profit vs Daily Temperature and Sales Unit vs Profit.
MULTIPLE CORRELATION
COEFFICIENT
• The multiple correlation coefficient denoting a
correlation of one variable with multiple other
variables.
• The multiple correlation coefficient is denoted as
RA.BCDE……K
• Which denotes that A is correlated with B, C, D, up to
K variables
• Its value lies between 0 and 1
How To Compute Multiple Correlation Coefficient (R)
• What will be the formula if we wanted to calculate multiple
correlation coefficient assuming B as dependent variable?
• What will be the formula if we wanted to calculate multiple
correlation coefficient assuming C as dependent variable?
Note: a) It’s coefficient always lies between 0 to +1 i.e. Always
Non negative b) rA.BC = rA.CB and so on. i.e. the subscript of right
hand side do not affect the value c) Square of multiple
correlation coefficient gives coefficient of multiple determination.
Coefficient of Multiple Determination
• The square of multiple correlation coefficients is called
the coeff. of multiple determination
• It is denoted by R2 1.23 , R2 2.13 , R2 3.12
• Let multiple corr. coeff. of yields of a wheat (x1) and
joint effects of fertilizer (X2)and quality of seeds (X3)
on yields of wheat (X1) is
R 1.23= 0.9, and R2 1.23 = 0.81= 0.81*100= 81% then R2 1.23 is
interpreted as 81% of variation on yields of wheat is
explained by variables fertilizer and quality of seeds and
remaining 19% by unknown factors.
Limitations of Correlation
• We are only considering LINEAR relationships
• Correlation (r) NOT resistant to outliers
• There may be variables other than x which are not studied, yet
do influence the response variable
• A strong correlation does NOT imply cause and effect
relationship
1 . F i n d M u l t i p l e C o r re l a t i o n C o e f f i c i e n t b e t w e e n D a i l y t e m p e r a t u re a n d
p ro f i t a s s u m i n g I c e - C re a m ’s S a l e s U n i t a s d e p e n d e n t Va r i a b l e .

X3= 3 Daily
Sales Unit
Temperature (* Profit X2 = 2 .Serial no
=X1
C)
25 120 70 1
20 80 60 2
30 120 80 3
27 100 50 4
21 115 60 5
32 135 90 6
• Find Multiple Correlation Coefficient between Profit and Unit sales
of ice-cream assuming Ice- Cream’s Daily Temperature As
dependent Variable.
• Find Multiple Correlation Coefficient between Daily temperature
and Unit sales assuming Profit as dependent Variable.
REGRESSION
ANALYSIS
REGRESSION ANALYSIS
 Regression Analysis is a very powerful tool in the field of statistical analysis in
predicting the value of one variable, given the value of another variable, when
those variables are related to each other.
 It investigates the relationship between a dependent variable (target) and
independent variable(s) (predictor).
 Regression Analysis is mathematical measure of average relationship between two
or more variables.
It is a statistical tool used in prediction of value of unknown variable from a
known variable.
Regression Equation

Regression equation describes Height (cm)


220
the regression line mathematically
200
– Intercept
180

– Slope 160

(The Regression Coefficient is 140

the constant in the regression 120


equation that tells about the
100
change in the value of the
dependent variable corresponding 80
60 70 80 90 100 110 120Wt (kg)
to the unit change in the
independent variable.)
Linear Regression
 It is one of the most widely known modeling technique. Linear regression is
usually among the first few topics which people pick while learning
predictive modeling.
 In this technique, the dependent variable is continuous, independent
variable(s) can be continuous or discrete, and nature of regression line is
linear.
 Linear Regression establishes a relationship between dependent variable
(Y) and one or more independent variables (X) using a best fit straight line.
REGRESSION EQUATION:
The algebraic expression of the of regression line are called regression
equation.
For two variables having one dependent variable and one independent variable
there are two regression equation.
Regression equation of y on x given by y = a+bx in which y is dependent
variable and x is independent variable.
Regression equation of x on y given by x = a+by in which x is dependent
variable and y is independent variable.- classical

Note: (Regression equation of y on x ) ≠ (Regression equation of x on y).


Simple Linear Regression
• A regression model that gives a straight-line relationship
between two or more variables is called a linear regression
model.
• Simple linear regression is a statistical method that allows
us to summarize and study relationships between two
continuous (quantitative) variables.
• One variable, denoted x, is regarded as
the predictor, explanatory, or independent variable.
• The other variable, denoted y, is regarded as
the response, outcome, or dependent variable.
• Simple linear regression gets its adjective "simple" because it
concerns the study of only one predictor variable
Dependent variables
• The single variable being explained/predicted by the
regression model
• Denoted by y- variable
Independent variable
• The explanatory variable(s) used to predict the
dependent variable
• Denoted by x- variables
ESTIMATION OF COEFFICIENTS USING LEAST SQUARE
METHOD(OLS):

The regression equation of y on x given by y = a+bx in which y is dependent


variable and x is independent variable.
The value of a and b are determined by using the principle of least square by
minimizing error sum of square.
Here,
error(e) = ) , so that

After differentiating both w.r.t. to ‘a’ and ‘b’ we get two equations;

………….. (i)

…….. (ii)
 Shortcut Method:
Here, u = and v = , then the regression equation v on u is;
v = a+bu
and the value or a and b are calculated as;
………….. (i)
…….. (ii)
Then substitute the value of ‘u’ and ‘v’ to get equation y = a + bx.
 Step Deviation Method:
Here, = and =
and the value of a and b are calculated as; ………….. (i)

….. (ii)

Then substitute the value of and to get equation y = a + bx


OR USING FORMULAS WE CAN OBTAIN THE
ESTIMATE VALUES OF AND AS FOLLOWS
STANDARD ERROR AND COEFFICIENT
OF DETERMINATION
Meaning of a and b
Suppose a predicted model is
yˆ = 1.5050 + 0.2525x
Where, y= food expenditure (per month), x= income (per
month),
a= 1.5050 and b= 0.2525
a) Meaning of a
for x= 0
yˆ = 1.5050 + 0.2525(0)= 1.5050
This means we can state that a household with no
income(x=0) is expected to spend $1.5050 per month on food.
b) Meaning of b
(The value of b in a regression model gives the change in the
predicted value of y (dependent variable) due to a change of
one unit in x (independent variable)).
When x = 50,
yˆ = 1.5050 + .2525(50) = 14.1300
This means that, on average, a $1 increase in income of a
household will increase the food expenditure by $0.2525.
If above equation is change into following
yˆ = 1.5050 - 0.2525(50) = -11
Then, how do you interpret b?
Multiple Regression
• It is the logical extension of simple linear
regression
• Multiple regression extends linear
regression to allow for 2 or more
independent variables.
• There is still only one dependent variable.
THE MULTIPLE REGRESSION
MODEL

Idea: Examine the linear relationship between


1 dependent (Y) & 2 or more independent variables (Xi)

Multiple Regression Model with k Independent Variables:

Y-intercept Population slopes Random Error

Yi β 0  β1X1i  β 2 X 2i    β k X ki  e i
MULTIPLE REGRESSION EQUATION

The coefficients of the multiple regression model are


estimated using sample data

Multiple regression equation with k independent variables:


Estimated Estimated
(or predicted) Estimated slope coefficients
value of Y intercept

Ŷi a  b1X1i  b 2 X 2i    b k X ki
We will always use software to obtain the regression slope
coefficients and other regression summary measures.
MULTIPLE REGRESSION
MODEL
MULTIPLE REGRESSION
EQUATION
Two independent
variables model
(continued)

Y
Ŷ a  b X  b X
1 1 2 2

X1
e
abl
ri
r va
fo
l ope X2
S
f or v ariable X 2
Slope

X1
EXAMPLE:
2 INDEPENDENT VARIABLES

• A distributor of frozen desert pies wants to


evaluate factors thought to influence demand

– Dependent variable: Pie sales (units


per week)
– Independent variables: Price (in $)

Advertising
($100’s)
• Data are collected for 15 weeks
Y X1 X2 YX 1 YX 2 X1 X 2 Y2 X 12 X 22
350 5.5 3.3 1925 1155 18.15 122500 30.25 10.89

460 7.5 3.3 3450 1518 24.75 211600 56.25 10.89

350 8 3 2800 1050 24 122500 64 9

430 8 3 3440 1290 24 184900 64 9

350 6.8 3 2380 1050 20.4 122500 46.24 9

380 7.5 4 2850 1520 30 144400 56.25 16

430 4.5 3 1935 1290 13.5 184900 20.25 9

470 6.4 3.7 3008 1739 23.68 220900 40.96 13.69

450 7 3.5 3150 1575 24.5 202500 49 12.25

490 5 4 2450 1960 20 240100 25 16

340 7.2 3.5 2448 1190 25.2 115600 51.84 12.25

300 7.9 3.2 2370 960 25.28 90000 62.41 10.24

440 5.9 4 2596 1760 23.6 193600 34.81 16

450 5 3.5 2250 1575 17.5 202500 25 12.25

300 7 2.7 2100 810 18.9 90000 49 7.29

Total 5990 99.2 50.7 39152 20442 333.46 2448500 675.26 173.75
Let the linear estimate equation be
Ŷ a  b X  b X
1 1 2 2

The normal equations are as follows:


 Y na  b1  X 1  b2  X 2
2
 X 1Y a  X 1  b1  X 1  b2  X 1 X 2
2
 X 2Y a  X 2  b1  X 1 X 2  b2  X 2
5990 =15a + 99.2b1 + 50.7b2
39152 = 99.2a +675.2b1 +333.46b2
20442 = 333.46a + 50.7b1 +173.75b2
Solving above three simultaneous equations we obtain best
estimate values of a, b1 and b2

a=306.526 , b1 = -24.975, and b2 =74.131

You might also like