Correlation and Regression Analysis
Correlation and Regression Analysis
S TAT I S T I C S
Syllabus
• Correlation and Regression (6 hours)
– Least square method
– An analysis of variance of Linear Regression
model
– Inference concerning Least square method
• Multiple correlation and regression
Variables: a symbolic name associated with a value and
whose associated value may be changed.
Quantitative variable: Variables that have are measured on a
numeric or quantitative scale. A country’s population, a
person’s shoe size, or a car’s speed are all quantitative
variables.
Variables that are not quantitative are known as
qualitative variables.
Correlation:
Correlation is a statistical technique used to determine
the degree to which two variables are related.
• A scatter diagram (Also known as scatter plot, scatter
graph, and correlation chart) is a tool for analyzing
relationships between two variables for determining
how closely the two variables are related.
• One variable is plotted on the horizontal axis and the
other is plotted on the vertical axis. The pattern of
their intersecting points can graphically show
relationship patterns.
Scatter Diagram
Rectangular coordinate
Two quantitative variables
One variable is called independent (X) and the
second is called dependent (Y)
Y
Points are not joined
* *
No frequency table
Most common way for visualizing the association
*
between two quantitative variables X
What we have to look in scatter plot
i) Linearity (Straight line)
ii) Spread
iii) Outliers
iv) Correlation
Scatter Plots
The pattern of data is indicative of the type of relationship between
two variables:
Positive Relationship
Negative Relationship
No Relationship
• Positive Correlation: The correlation is said to be positive correlation if the
values of two variables changing with same direction.
Ex. Pub. Exp. & sales, Height & weight, study time and grades.
• Negative correlation: The correlation is said to be negative correlation when the
values of variables change with opposite direction.
Ex. Price & qty. demanded, alcohol consumption and driving ability.
Positive Relationship
Negative Relationship
Strength
Age of buildings
No relationship
Linear and Non- Linear correlation
Simple Correlation Coefficient
The most common measure of correlation; also called Pearson
coefficient of correlation
Is an index of relationship between two variables
Reflects the degree of linear relationship between two variables
It is symmetric in nature
The value of r ranges between ( -1) and ( +1)
The value of r denotes the strength of the association as
illustrated by the following diagram.
perfect perfect
correlation correlation
no relation
Assumptions:
Two variables should be measured at the interval or ratio level (i.e.,
continuous)
There is a linear relationship between two variables.
There should be no significant outliers.
Variables should be approximately normally distributed.
How to compute the simple correlation
coefficient (r)
∑ x∑ y
∑ xy −
n
r=
√¿ ¿ ¿
EXAMPLE:
Calculate the simple correlation coefficient between ice-
cream’s sales unit and profit.
Profit (Rs.) sales unit serial No
120 70 1
80 60 2
120 80 3
100 50 4
115 60 5
135 90 6
Sales
Profit Serial
Y2 X2 xy Unit
(y) .no
(x)
1
2
3
4
5
6
Total
=y2∑ =x2∑ =xy∑ =y ∑ =x ∑
Interpretation:
For example, If r= 0.7, then r*r = 0.7*0.7 = 0.49=
0.49*100= 49%
About 49% of the variation (out of total variation) in
variable1 is explained by variable2 and remaining 51%
is due to unknown factors.
Q1. The following are the no. of minute it took 10 machines to
assemble a piece of machinery in the morning, x, and in the late
afternoon, y:
a) Calculate simple correlation coefficient, coefficient of determination
and interpret the results.
X=x 11.1 10.3 12.0 15.1 13.7 18.5 17.3 14.2 14.8 15.3
Y=y 10.9 14.2 13.8 21.5 13.2 21.1 16.4 19.3 17.4 19.0
Partial Correlation
Coeffi cient
Partial correlation estimates the relationship between two variables
while removing the influence of a third variable from the
relationship.
Examples: Relationship between a guy and girl while removing the
influence of effect of video games
Relationship between unit sales of ice cream and profit removing
the influence of daily temperature.
Assumptions
• You have one (dependent) variable and one (independent) variable and these are both measured
on a continuous scale (i.e., they are measured on an interval or ratio scale).
• You have one or more control variables, also known as covariates (i.e., control variables are just
variables that you are using to adjust the relationship between the other two variables; that is, your
dependent and independent variables). These control variables are also measured on
a continuous scale (i.e., they are continuous variables).
• There needs to be a linear relationship between all three variables. That is, all possible pairs of
variables must show a linear relationship.
• There should be no significant outliers.
• Your variables should be approximately normally distributed.
HOW TO COMPUTE THE PARTIAL CORRELATION
COEFFICIENT (R)
Where,
• rAB = simple correlation coeff. between A and B
• rAC = simple correlation coeff. between A and C
• rBC = simple correlation coeff. between B and C
Note: rAB = rBA, rAC= rCA, rBC = rCB
In above formula,
3. Find Partial Correlation Coefficient Between Ice- Cream’s Profit And Daily
Temperature Assuming Sales Unit As Constant.
X3= 3 Daily
Sales Unit
Temperature (* Profit X2 = 2 .Serial no
=X1
C)
25 120 70 1
20 80 60 2
30 120 80 3
27 100 50 4
21 115 60 5
32 135 90 6
• Find Multiple Correlation Coefficient between Profit and Unit sales
of ice-cream assuming Ice- Cream’s Daily Temperature As
dependent Variable.
• Find Multiple Correlation Coefficient between Daily temperature
and Unit sales assuming Profit as dependent Variable.
REGRESSION
ANALYSIS
REGRESSION ANALYSIS
Regression Analysis is a very powerful tool in the field of statistical analysis in
predicting the value of one variable, given the value of another variable, when
those variables are related to each other.
It investigates the relationship between a dependent variable (target) and
independent variable(s) (predictor).
Regression Analysis is mathematical measure of average relationship between two
or more variables.
It is a statistical tool used in prediction of value of unknown variable from a
known variable.
Regression Equation
– Slope 160
After differentiating both w.r.t. to ‘a’ and ‘b’ we get two equations;
………….. (i)
…….. (ii)
Shortcut Method:
Here, u = and v = , then the regression equation v on u is;
v = a+bu
and the value or a and b are calculated as;
………….. (i)
…….. (ii)
Then substitute the value of ‘u’ and ‘v’ to get equation y = a + bx.
Step Deviation Method:
Here, = and =
and the value of a and b are calculated as; ………….. (i)
….. (ii)
Yi β 0 β1X1i β 2 X 2i β k X ki e i
MULTIPLE REGRESSION EQUATION
Ŷi a b1X1i b 2 X 2i b k X ki
We will always use software to obtain the regression slope
coefficients and other regression summary measures.
MULTIPLE REGRESSION
MODEL
MULTIPLE REGRESSION
EQUATION
Two independent
variables model
(continued)
Y
Ŷ a b X b X
1 1 2 2
X1
e
abl
ri
r va
fo
l ope X2
S
f or v ariable X 2
Slope
X1
EXAMPLE:
2 INDEPENDENT VARIABLES
Advertising
($100’s)
• Data are collected for 15 weeks
Y X1 X2 YX 1 YX 2 X1 X 2 Y2 X 12 X 22
350 5.5 3.3 1925 1155 18.15 122500 30.25 10.89
Total 5990 99.2 50.7 39152 20442 333.46 2448500 675.26 173.75
Let the linear estimate equation be
Ŷ a b X b X
1 1 2 2