0% found this document useful (0 votes)
12 views

ECONOMETRICSII Notes1

The document discusses the coefficient of multiple determination (R2) in multiple regression analysis. R2 measures the proportion of the total variation in the dependent variable that is explained by the independent variables. It ranges from 0 to 1, with higher values indicating the regression plane (defined by the independent variables) better explains the total variation in the dependent variable. The document provides formulas to calculate R2 in terms of residuals and predicted values from the regression model. It also discusses adjusting R2 to account for adding more predictor variables and testing the overall significance of the regression model using an F-test.

Uploaded by

Marlene Esipila
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

ECONOMETRICSII Notes1

The document discusses the coefficient of multiple determination (R2) in multiple regression analysis. R2 measures the proportion of the total variation in the dependent variable that is explained by the independent variables. It ranges from 0 to 1, with higher values indicating the regression plane (defined by the independent variables) better explains the total variation in the dependent variable. The document provides formulas to calculate R2 in terms of residuals and predicted values from the regression model. It also discusses adjusting R2 to account for adding more predictor variables and testing the overall significance of the regression model using an F-test.

Uploaded by

Marlene Esipila
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Coefficient of Multiple Determination (or the squared Multiple Correlation

Coefficient) R2

When the explanatory variables are more than one we talk of multiple correlation. The
square of the correlation coefficient is called the coefficient of multiple determination or
square multiple correlation coefficient. The coefficient of multiple determinations, R 2 is
defined as the proportion of the total variation in Y explained by multiple regression of
Y on X1 and X2. As in a two variable model, R 2 shows the percentage of the total variation
on Y explained by the regression plane, that is, by changes in X1 and X2.

2∑
R= 2 =
^y 2i ∑ ( Y^ −Ȳ )2
=1−
∑ e2i ∑ y 2 −∑ x 2
=
yi ∑ ( Y −Ȳ )2 ∑ y 2i ∑ y2

We have established that

e i= y i− ^y i

and

^y i =b^ 1 x 1 + b^ 2 x 2

The squared residuals are

∑ e2i =∑ e i ( y i − ^y i )
=∑ ei ( y i− b^ 1 x 1−b^ 2 x 2 )

=∑ ei y i −b^ 1 ∑ ei x 1− b^ 2 ∑ ei x 2

But from our normal equations

−2 x 1∑ ( y i −b^ 1 x 1− b^ 2 x 2 )=0
^ ^ ^
We can observe that y i =bi x 1 + b 2 x2

and

∑ [ y i−( b^ i x 1+ b^ 2 x 2) ] ( x 1)=∑ ( y i − ^y i )( x 1 )=0


∑ [ y i−( b^ i x 1+ b^ 2 x 2) ] ( x 2)=∑ ( y i − ^y i )( x 2 )=0
or

∑ ei x 1=0
∑ ei x 2=0
Therefore

∑ e2i =∑ e i y i
=∑ ( y i − ^y i ) y i

=∑ y i ( y i −b
^ x −b
1 1
^ x )
2 2

=∑ y 2i − b^ 1 ∑ y i x 1− b^ 2 y i x 2

Substituting in the formula of R2

2∑ ^y 2i
R = 2 =1−
∑ e2i b^ 1 ∑ yx 1 + b^ 2 ∑ yx 2
=
yi ∑ y 2i ∑ y2

The value of R2 is between 0 and 1. The higher the R2 the greater the percentage of the
variation of Y explained by the regression plane, that is the better the goodness of fit of
the regression plane to the sample observations. The closer R 2 to zero, the worse the fit.

The above formula does not take into account the loss of degrees of freedom from the
introduction of additional explanatory variables in the function. Inclusion of additional

explanatory variables is likely to increase the numerator ∑ ^yi2 for the same TSS=, hence
2
R2 also increases. To take this into consideration we use ‘adjusted’ R 2, or R̄ .

n−1
R̄2 =1−(1−R2 )
n−k
n=number of observations,

k=the number of parameters estimated.

Example: Compute the R2 and the adjusted R2 for the corn-fertilizer-insecticide example.
Test of the Overall Significance of the Regression

The overall significance of the regression can be tested with the ratio of the explained to
the unexplained variance. this follows an F distribution with k-1 and n-k degrees of
freedom.

F k−1 , n−k =
∑ ^y 2i / k−1
=
R 2 / ( k −1 )
∑ e 2i /n−k ( 1−R2 ) / ( n−k )

If the calculated F ratio exceeds the tabular value of F at the specified level of
significance and degrees of freedom, the hypothesis is accepted that regression
parameters are not all equal to zero and that R2 is significantly different from zero.

The theory of correlation

Correlation analysis has serious limitations and throws little light on the nature of the
relationships existing between variables. Regression analysis implies (but does not
prove) causality between the independent variable, X and dependent variable, Y.
However, correlation analysis implies no causality or dependence but refers simply to
the type and degree of association between two variables. For example, X and Y may be
highly correlated because of another variable that that strongly affects both. Thus
correlation analysis is a much less powerful tool than regression analysis and is seldom
used by itself in the real world. Correlation is mainly used to determine the degree of
association found in regression analysis. It is given as the coefficient of determination,
which is square of the correlation coefficient. Correlation coefficient is therefore, a
crucial statistic of regression analysis hence students must familiarize with it.

Correlation may be defined as the degree of relationship existing between two or more
variables.

 Simple correlation; is the degree of r/ship existing between two variables.


 Multiple correlation;
 Linear correlation; this occurs when all points (X,Y) on scatter diagram seem to
cluster near a straight line.
 Nonlinear; when all points seems to lie near a curve.
 Positive & negative correlation; two variables may have a positive correlation,
negative correlation or they may be uncorrelated. This holds for both linear and
non linear correlation.
Correlation coefficients can range from -1.00 to +1.00. The value of -1.00 represents a
perfect negative correlation while a value of +1.00 represents a perfect positive
correlation. A value of 0.00 represents a lack of correlation.

The most widely-used type of correlation coefficient is Pearson r, also called linear or
product- moment correlation.

Simple Linear Correlation (Pearson r). Pearson correlation (hereafter called


correlation), assumes that the two variables are measured on at least interval scales,
and it determines the extent to which values of the two variables are "proportional" to
each other. The value of correlation (i.e., correlation coefficient) does not depend on the
specific measurement units used; for example, the correlation between height and
weight will be identical regardless of whether inches and pounds, or centimeters and
kilograms are used as measurement units. Proportional means linearly related; that is,
the correlation is high if it can be "summarized" by a straight line (sloped upwards or
downwards).

Partial Correlation Coefficients


The partial correlation coefficients measures the net correlation between the dependent
variable and one independent variable after excluding the common influence of (i.e.,
holding constant) the other independent variable in the model. for example ryx1.x2 is the
partial correlation between Y and X1 after removing the influence of X2 from both Y and
X2

r YX −r YX r X X2
1 2 1
r YX 1 . X 2=
√ 1−r 2
X1X2 √1−r 2
YX 2

r YX −r YX r X X1
2 1 2
r YX 2 . X 1 =
√ 2

1−r X 1 X2 1−r YX 1
2

Partial correlation range in value from -1 to +1 (as do simple correlation coefficients),


they have the sign of the corresponding estimated parameters, and are useful in
determining the relative importance of the different explanatory variables in multiple
regression.

Statistical inference in the Multiple Regression Model; formulas for the General
case of k Explanatory variables

The analysis if variance (ANOVA) is a statistical method developed by R.A. Fisher for the
analysis of experimental data. It was applied to the analysis of agricultural experiment
(use of various fertilizers, various seeds), but soon its application expanded to many
other fields of scientific research.

The method of ANOVA breaks down total variance of a variable into additive
components which may be attributed to various, separate factors. These factors are the
‘causes’ or ‘sources’ of variation of the variable being analyzed. When the method is
applied to the experimental data, it assumes a certain design of the experiment, which
determines the number of relevant factors (or causes) of variation and the logical
significance of each one of them.

EXAMPLE: assume that we have ten plots of land on which we cultivate Maize, and we
want to study the yield per unit of land. We use different seed varieties, different
fertilizers and different systems of irrigation. We therefore would logically attribute the
variation in yield to each of the three factors.
X1=type of seed variety
X2=type of fertilizer
X3=type of irrigation system
With the ANOVA method we may break down total variation in yield into three separate
components: a component due to X1, another due to X2 and a third due to X3.

From the definition ANOVA method is conceptually the same as regression analysis. In
regression analysis also the aim is to determine the factors which cause the variation of
dependent variable. The total variation is in the dependent variable is split into two
components: the variation explained by the regression line (or regression plane), and
the unexplained variation, shown by the scatter of points around regression line.
Furthermore, the multiple correlation coefficients was seen to represent the proportion
of total variation explained by the regression line (or regression plane). R 2 was found to
be additive components, each corresponding to a relevant explanatory variable.
However, there are significant differences between the two methods. The main
difference is that regression analysis provides numerical values for the influence of the
various explanatory factors on the dependent variable, in addition to the information
concerning the breaking down of total variance of Y into additive components, while the
ANOVA provides only the latter type of information.

The objective of both the ANOVA and regression analysis is to determine factors which
cause the variation of the dependent variable Y. This resemblance has led to the
combination of the two methods in most scientific fields.

Importance of ANOVA in Regression Analysis

ANOVA is used to conduct tests of significance in the:

i) Overall significance of regression.


ii) Significance of improvement in fit as a result of additional explanatory
variables in the function (same as t test).
iii) Equality of coefficients obtained from different samples.
iv) Extra sample performance of regression (or stability of regression
coefficients)
v) Restrictions imposed on coefficients of a function.

The Method of ANOVA as a Statistical Method

The aim of ANOVA is to split the total variation of a variable (around its mean) into
components which may be attributed to specific (additive) causes. To simplify the
ANOVA we assume there is only one systematic factor which influences the variable
being studied. Any variation not accounted for by this (explanatory) factor is assumed
to be random (or chance) variation, due to various random happenings. We have a
series of values of a variable Y and the corresponding values of the (explanatory)
variable X. The ANOVA method concentrates only on Y values and studies their
variation. The values of X are used only for dividing values of Y into sub-groups. The
sub-samples; for example one group (or sample) corresponding to large values of X and
one group corresponding to small values of X.

For each sub-sample we estimate the mean value of Y, obtaining a set of means. If X
(which is the basis of the classification of the Y’s into the sub-samples) is an important
cause of variation in Y (an important explanatory variable) is an important cause of
variation in Y (an important explanatory variable) the difference between means of the
sub-samples will be large; this would be shown by a large dispersion of the means of
sub-samples Yi’s around the common meanȲ , that is by a large variance of the
distribution of the means. On the contrary, if X is not an important source of variation Y,
the difference between the means of sub-samples will be small, a fact that would be
reflected in in a small variance of the distribution of sampling means (
Ȳ i ) around the
common mean Ȳ :

 The importance of X as a cause of variation (in Y) is judge from the difference


Ȳ ' s
between the means of sub-samples ( i ), formed on the basis of values of X.
 The difference between the means is reflected in the value of the variance of the
distribution of the sample means.

We study and test the difference between the means using two estimates of the
2
population variance of Y. One estimate of σ Y is obtained by pooling the variances of the
sub-samples, and the other is obtained from the expression of sampling distribution
(the distribution of Ȳ ). ANOVA ultimately estimates two variances, and their
comparison is used to establish of the difference between the means are statistically
significant, or it is due to chance.

The comparison of any two variances is implemented by the F statistic and the F tables.
The F statistic is the ratio of any two independent estimates of variances, which have
been obtained from sample data. Each estimate involves some loss of degrees of
freedom. F is often called the variance ratio. Where F stands for the name of Fisher who
invented the statistic.

If the variance estimates are close to each other their ratio will approach the value of
one (1).

The greater the discrepancy (difference) between the two variance the greater is the
value of F ratio. Thus, in general, high values of F suggest that the difference between
the two variances is significant, or the rejection of the null hypothesis, which assumes
no significant difference between the two variances.

Example:

Suppose three different types of petrol are used for running a car: Regular rated at 90
octane, Super rated at 95 octane and Unleaded-Regular at 100 octane. We wish to test
whether these different types of petrol give the same consumption per mile, that is, we
want to compare the consumption performance of the three brands of petrol. Suppose
that we use each brand for ten days and we measure the miles per gallon of petrol. Thus
we obtain three samples of size 10 for each brand.

Sample 1 (Regular) Sample 2 (Super) Sample 3(U- Total Observations


n1=10 n2=10 Regular) N=n1+n2+n3
n3=10
32 35 44 32
30 38 46 30
35 37 47 35
33 40 47 33
35 41 46 35
34 35 43 34
29 37 47 29
32 41 45 32
36 36 48 36
34 40 47 34
35
38
37
40
41
35
37
41
36
40
44
46
47
47
46
43
47
45
48
47
∑ Y 1i=330 ∑ Y 2i =380 ∑ Y 3i =460 ∑ ∑ Y ji =1170
j i

Ȳ 1 =
∑ Y 1i =33 Ȳ 2 =
∑ Y 2i =38 Ȳ 3 =
∑ Y 3i =46 Ȳ =
∑ ∑ Y 1 i =39
n1 n2 n3 N
n2 n3
n1 ∑ ( Y 2i− Ȳ 2 ) 2
∑ ( Y 3i −Ȳ 3 ) 2
∑ ( Y 1i−Ȳ 1 )2 S22 = i
S23 = i

S21 = i n2 n3
n1 2 50 2 22
S2 = =5. 0 S3 = =2. 2
2 46 10 10
S1 = =4 . 6
10

The data above may be interpreted as the three random samples of size n 1 +n2 +n3=10,
with means Ȳ 1 =33, Ȳ 2 =38 , and 3
Ȳ =46
miles per gallon of petrol. Our problem is to
establish whether the difference between these means is significant or whether it may
be attributed to chance.

We want to test if there is any significant difference between the means of the
populations, μ1 , μ2 and μ3 : We want to test the null hypothesis

H 0 : μ1 =μ2 =μ3

Against the alternative hypothesis

H 1 =μ j not all equal

If the three means are the same, that is if the null hypothesis is true, then three
population may be considered as one large population with mean μ(=μ1 =μ 2 =μ 3 ) and
standard deviation σ , that is

Y ≈ N ( μ , σ2 )

An estimate of the common mean μ may be computed from the enlarged sample
n1 +n2 + n3 +30
k nk

∑Yi= ∑ ∑ Y ji
j i 1170
μ^ = = =39
N N 30

We then obtain the estimate of the population variance


σ 2 which can be obtained in two

ways.

First as unbiased estimator of the population variance may be obtained from the
expression:
k
∑ n j ( Ȳ j −Ȳ )2
j
σ^ 2=
k−1
Where k is the number of the samples.
2
Secondly: As an estimate of population variance σ may be obtained by pooling gether
various sample variances. Using the double summation we obtain the following
expression:
k nk
∑ ∑ ( Y ji −Ȳ j )2
σ^^ 2= j i
N−k

This estimate of the population variance is obtained from the sample variances which
reflect the variation the variation within each sample.
2
We now have two unbiased estimates of the population variance σ

Estimate (1) reflects the variation between the sample means, and depends on the
validity of the null hypothesis.

Estimate (2) reflects the variation of Yi’’s within the samples, and is independent of null
hypothesis.

From our example we have the following results:

(1) The ‘between’ variance estimate is

k
∑ n j ( Ȳ j −Ȳ )2
j
σ^ 2=
k−1
n1 ( Ȳ 1 −Ȳ )2 + n2 ( Ȳ 2−Ȳ )2 +n 3 ( Ȳ 3 −Ȳ )2
=
3−1

10( 33−39)2 +10 ( 38−39 )2 +10( 46−39 )2


=
3−1

(2) The ‘within’ variance estimate is


k nk
∑ ∑ ( Y ji −Ȳ j )2
σ^^ 2= j i
N−k
10 10 10
∑ ( Y 1 j −Ȳ 1 ) + ∑ ( Y 2 j−Y 2 ) +∑ ( Y 3 j−Y 3 )2
2 2

1 1 1
=
30−3

46+50+22 118
= = ≈4 .37
27 27
(3) Therefore the observed variance ratio is

σ^ 2 430
F∗¿ = =98 . 39≈98 . 4
σ^^ 2 4 . 37

(4) The theoretical value of F at the at the 5 % level of significance with v 1=k-1=2
and v2=N-k=27 degrees of freedom is found from F-tables.
F 0 .05 =3. 35
(5) Since F*>F0.05 we reject the null hypothesis, that is we accept that there is a
significant difference in the average millage obtained from the three types of
petrol.

Regression Analysis and Analysis of Variance

To illustrate the similarities between regression analysis and the ANOVA method we
work out the above example with the method of OLS and compare the results.

we quantify octane rating of the three brands of petrol, by treating their octane rating as
a variable rather than as a qualitative attribute.
Test Imposed on the Relationship between Two or More Parameters of a Function

Cobb-Douglas is one of the popular forms of production functions in the theory of


production.
b b2
X =b0 L 1 K
Where X=Output, L=labour input and K=Capital input.
This function is homogeneous of degree (b1+b2), so that of
(b1+b2)=1, we have constant returns to scale.
(b1+b2)<1, we have decreasing returns to scale.
(b1+b2)>1, we have increasing returns to scale.
Assuming that by fitting a regression to a sample of 30 observations on X,L and K for
tea industry we obtain.
Y^ = b^ 0 L0 . 7321 K 0. 3422
sb^ =0 . 03 sb^ =0. 02
R2-=0.781 ∑ y =180 ∑ e1 =4.64
2 2
1 2

we observed that (b1+b2)=1.0743, that is, our estimates suggest that over the sample
period the tea industry experienced increasing returns to scale. We want to test the
statistical reliability of this result i.e. we test the hypothesis
H1: (b1+b2)=1
against the alternative hypothesis
H2: (b1+b2)≠ 1
We test the hypothesis using F ratio as follows.
Step1. We perform the regression with the restriction (b 1+b2) =1, from the restriction
we obtain b2=1-b1, so that by substitution in the production function we find
b (1−b1 )
Y =b0 L 1 K
Dividing through by K we find

( )
b1
Y L
=b0
K K
¿
Fitting a regression to this expression we get b 1=0 . 7431. Substituting in the restriction
¿ ¿
we find b 2=1−b1 =0 . 2569
Thus the restricted production function is
¿ ¿
Y =b L0 .7431 K 0.2569
R2 =0.650, ∑ 2
y =180 ∑
e2 =6 . 45 2

Step 2: We have two unexplained sum of squares:


∑ e21= Sum of squared residuals from unrestricted function
and
∑ e22= Sum of squared residuals from the restricted function.
The statistic (R. Tintner, 1952) suggested the following,

F∗¿
∑ e 22−∑ e21
(n−k )
∑ e12
has an F distribution with v1=1, and v2=n-k degrees of freedom.
Step 3: the observed F* value is compared with the theoretical (tabular) value of F 0.05,
with v1=1, and v2=(n-k) df. If F*>F0.05 we reject our basic hypothesis, that is we accept
that (b1+b2)≠ 1
In our example
6 . 45−4 . 64 1 . 81
F∗¿ (30−3 )= (27 )=10 .53
4 . 64 6 . 64
F0.05=4.20
Hence we conclude that (b1+b2)≠ 1
In general if we have c restrictions and we want to test the null hypothesis
H0:all the restrictions are true.
against the alternative hypothesis.
H1: not all restrictions are true,
We proceed as follows

Step1: Apply OLS to the original unrestricted relationship and obtain the sum of

squared residuals ∑ ee with (n-k) degrees of freedom.


Step 2: Apply OLS to the restricted relationship (i.e. the function which incorporates the

restrictions) and obtain the sum of (restricted) squared residuals ∑ e2R with n-(k-c)=n-
k-c degrees of freedom.

Compute the F* ratio


F∗¿
∑ e 2R−∑ e 2/c ( n−k )
∑ e21 /(n−k )
Step 4. Find the critical value F at the chosen level of significance (from F tables) with
v1=c and v2=(n-k), where c is the number of restrictions and n-k) is the df in the
unrestricted estimation. If F*>F we reject the null hypothesis and we conclude that
restrictions are not supported by the sample data . If F*<F we accept the null hypothesis
and we conclude that the restrictions are compatible with the observed in the real
world.

You might also like