0% found this document useful (0 votes)
161 views

3 STAT-602 Regression & Correlation

Regression analysis is used to determine the relationship between variables and develop models to approximate their relationships. Simple linear regression analyzes the relationship between a single independent variable (X) and dependent variable (Y). The sample regression equation estimates the population regression equation and is expressed as Y=b0 + b1X, where b0 is the Y-intercept and b1 is the slope. Coefficient of determination (R2) measures how well the regression line approximates the real data points, with a value closer to 1 indicating a better fit.

Uploaded by

jazib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
161 views

3 STAT-602 Regression & Correlation

Regression analysis is used to determine the relationship between variables and develop models to approximate their relationships. Simple linear regression analyzes the relationship between a single independent variable (X) and dependent variable (Y). The sample regression equation estimates the population regression equation and is expressed as Y=b0 + b1X, where b0 is the Y-intercept and b1 is the slope. Coefficient of determination (R2) measures how well the regression line approximates the real data points, with a value closer to 1 indicating a better fit.

Uploaded by

jazib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Regression Analysis

The objective of many investigations is to understand and explain the relationship among variables.
Frequently, one wants to know how and to what extent a certain variable (response variable) is related to a
set of other variables (explanatory variables).
Regression analysis helps us to determine the nature and the strength of relationship among variables.
Types of relationship:
i) Deterministic relationship also called functional relationship
ii) Probabilistic relationship also called statistical relationship
In deterministic relationship the relationship between two variables is known exactly such as
a) Area of a circle= r2
b) F=k(m1m2/r2) (Newton’s law of gravity)
c)The relationship between dollar sales (Y) of a product sold at a fixed price and the number of units sold.
In statistical relationship the relation between variables is not know exactly and we have to approximate
the relationship and develop models that characterize their main features. Regression analysis is
concerned with developing such “approximating” models.
For example, in business research the sale of the product is related to the advertising expenditure of the
product. It is usually required to build a model relating sale to advertising expenditure.
The word regression is used to investigate the dependence of one variable called the dependent variable
denoted by Y, on one or more variables, called independent variables denoted by X’s and provides an
equation to be used for estimating or predicting the average value of the dependent variable from the
known values of the independent variables. When we study the dependence of a variable on a single
independent variable, it is called simple regression. Where as the dependence of a variable on two or more
than two independent variables is called multiple regression.

Regressor:- The variable that forms the basis of estimation or prediction is called the regressor. It is also
called independent variable, or explanatory or controlled or predictor variable, usually denoted by X.
Regressand:- The variable whose resulting values depends upon the known values of independent
variable, is called regressand. It is also called response, dependent, or random variable, usually denoted by
Y.
In simple regression, the dependence of response variable (Y) is investigated on only one
regressor (X). If the relationship of these variables can be described by a straight line, it is termed as
simple linear regression.
The population simple linear regression model is defined as:

Y= 0 + 1 X+ , Population Regression Model


Y= 0 + 1 X Population Regression Line
where 0 and 1 are the population regression coefficients and i is a random error peculiar to the i-th
observation. Thus, each response is expressed as the sum of a value predicted from the corresponding X,
plus a random error.
The sample regression equation is an estimate of the population regression equation. Like any other
estimate, there is an uncertainty associated with it.

Y^ = b0 + b1 X Sample Regression Line


Where
b0 : Y intercept
b1: Slope of regression line
b0 & b1 also called regression coefficients. X1 is independent variable and Y is the dependent variable.
This model is said to be simple (b/c only one independent variable) linear in parameters and linear
in independent variable (as it is in first power not X2 or X3)

STAT-602 [Muhammad Imran Khan is thankful for the contributors of these notes] Page 1
How to identify the relationship between variables
In order to begin regression analysis, useful tool is to plot the Y verses X this plot is called a scatter plot
and may suggest that what type of mathematical functions would be appropriate for summarizing the data.
A variety of functions are useful in fitting models to data.

LEAST SQUARE LINE


A least square line is described in terms of its Y-intercept (the height at which it intercepts the Y-axis)
and its slope (the angle of the line). The line can be expressed by the following relation
Y=a + bX or Y  b0  b1 X (Estimated regression of Y on X)
Where
S ( XY ) In other words
b Called slope of the line
S ( XX ) S XY
b1 
  S 2
a  Y  b X , Called intercept of the line X

b 0  Y  b1 X

Example: - The following data are the sparrow wing length in cm at various times in days after hatching
Wing Age XY X2 Y2 Y^ e=Y-Y^ e2
Length (X)
(Y)
1.4 3 4.2 9 1.96 1.525 -0.125 0.015625
1.5 4 6.0 16 2.25 1.795 -0.295 0.087025
2.2 5 11 25 4.84 2.065 0.135 0.018225
2.4 6 14.4 36 5.76 2.335 0.065 0.004225
3.1 8 24.8 64 9.61 2.875 0.225 0.050625
3.2 9 28.8 81 10.24 3.145 0.055 0.003025
3.2 10 32.0 100 10.24 3.415 -0.215 0.046225
3.9 11 42.9 121 15.21 3.685 0.215 0.046225
4.1 12 49.2 144 16.81 3.955 0.145 0.021025
4.7 14 65.8 196 22.09 4.495 0.205 0.042025
4.5 15 67.5 225 20.25 4.765 -0.265 0.070225
5.2 16 83.2 256 27.04 5.035 0.165 0.027225
5.0 17 85.0 289 25.00 5.305 -0.305 0.093025
44.4 130 514.80 1562 171.3 44.395 0.005 0.525
(i):- Draw scatter plot for the data
(ii):- Fit simple linear regression and interpret the parameters
(iii):- Calculate coefficient of determination and interpret it.
(iv):- Estimate the value of Y when X=13.
v):-Estimate and interpret simple correlation
Wing length VS Days
coefficient
Wing length (Cm)

Solution:- 6
4
2
0
0 2 4 6 8 10 12 14
age (days)

STAT-602 [Muhammad Imran Khan is thankful for the contributors of these notes] Page 2
X  10 Y  3.415

 XY   n   70.8
n ( X )( Y )
S ( XY )   ( X i  X )(Yi  Y ) 
i 1

( X )2
S ( XX )   ( X i  X ) 2   X 2   262
n
( Y )2
S (YY )   (Yi  Y )   Y 
2 2
 19.6569
n
S ( XY )
b1   0.270 cm/day
S ( XX )
 
bo  Y  b1 X  0.715 cm
So estimated simple linear regression equation is
Y=0.715 + 0.270 X
Interpretation of estimated regression parameter
 The value of b1=0.270, indicates that the average wing length is expected to increase by 0.270 cm
with each one day increase in age.
The observed range of age(Explanatory Variable) in the experiment was 3 to 17 days(i.e scope of the
model), therefore it would be an unreasonable extrapolation to expect this rate of increase in wing length
to continue if number of days were to increase. It is safe to use the results of regression only within the
range of the observed value of the independent variable only (i.e within the scope of the model).
 In regression equation b0=0.715, is the average wing length when age=0 day. In this example since
scope of the model does not cover x=0 so b0 does not have any particular meaning as a separate term
in the regression equation.
NOTE: Interpolation and Extrapolation
Interpolation is making a prediction within the range of values of the predictor in the sample used to
generate the model. Interpolation is generally safe. Extrapolation is making a prediction outside the range
of values of the predictor in the sample used to generate the model. The more removed the prediction is
from the range of values used to fit the model, the riskier the prediction becomes because there is no way
to check that the relationship continues to be linear
Total variation:- S(YY)=19.6569
Explained variation (Variation in Y due to X also called variation due to regression):
bS(XY) =0.270(70.80)=19.1322
Unexplained Variation: Total variation – explained variation=19.6569-19.1322=0.5247
Goodness of Fit
An important part of any statistical procedure that builts models from data are establishing how well the
model actually fits. This topic encompasses the detecting of possible violations of the required
assumptions in the data being analyzed and to check how close the observed data points to the fitted line.
A commonly used measure of the goodness of fit of a linear model is R 2 called coefficient of
determination. If all the observations fall on the regression line R 2 is 1. If no linear relationship between Y
& X R2 is 0. R2 =0 does not necessarily mean that there is no association between the variables. Instead,
it indicates that there is no linear relationship.
The co-efficient of determination tells us the proportion of variation in the dependent variable explained
by the independent variable
Re g .SS 19.1322
R2  x100  x100  97.33%
TotalSS 19.6569
The value of R2, indicates that about 97% variation in the dependent variable has been explained by the
linear relationship with X and remaining are due to some other unknown factors.
Finding the value of Y when X=13
Y13=0.715 + 0.270 (13)=4.225

STAT-602 [Muhammad Imran Khan is thankful for the contributors of these notes] Page 3
CORRELATION ANALYSIS
SIMPLE CORRELATION
Q.1. The following data represent the wing length and tail length of sparrows
Wing length Tail length
(X) (Y) XY X2 Y2
10.4 7.4 76.96 108.16 54.76
10.8 7.6 82.08 116.64 57.76
11.1 7.9 87.69 123.21 62.41
10.2 7.2 73.44 104.04 51.84
10.3 7.4 76.22 106.09 54.76
10.2 7.1 72.42 104.04 50.41
10.7 7.4 79.18 114.49 54.76
10.5 7.2 75.6 110.25 51.84
10.8 7.8 84.24 116.64 60.84
11.2 7.7 86.24 125.44 59.29
10.6 7.8 82.68 112.36 60.84
11.4 8.3 94.62 129.96 68.89
128.2 90.8 971.37 1371.31 688.40
X Y XY X2 Y2
(a) Find Coefficient of Correlation between wing length and Tail length.
(b) Test the hypothesis H 0 : 12  0

Solution
(a) Coefficient of Correlation between wing length and Tail length
X  1 0 .6 8 Y  7 .5 7
S XY   XY  nX Y  1 .3 2
SX2   X  n( X )
2 2
 1 .7 2
SY 2   Y  n (Y )
2 2
 1 .3 5
S XY
r   0 .8 6 6
S X 2 SY 2
PARTIAL CORRELATION
Q. :- Suppose that X1=Fish Length X2=Fish weight X3=Fish age and r12=0.60 , r13 =0.70, r23=0.65 n=15
(a) Find partial correlation coefficient between X1 and X2 while the effect of X3 kept constant.
r12  r13r23 (0.60)  (0.70)(0.65)
r12.3    0.27
(1  r13 )(1  r23 )
2 2
(1  0.70 )(1  0.65 )
2 2

MULTIPLE CORRELATION
Q.:- Suppose that X1=Fish Length X2=Fish weight X3=Fish age and r12=0.60 , r13 =0.70, r23=0.65 n=15
(a) Find Multiple correlation coefficient between X1 and joint effect of X2 and X3.

r122  r132  2r12r13r23 (0.60)2  (0.70)2  2(0.60)(0.70)(0.65)


R1.23    0.73
(1 r23 )2
[1 (0.65) ]
2

STAT-602 [Muhammad Imran Khan is thankful for the contributors of these notes] Page 4

You might also like