ECONOMETRICSII Notes1
ECONOMETRICSII Notes1
Coefficient) R2
When the explanatory variables are more than one we talk of multiple correlation. The
square of the correlation coefficient is called the coefficient of multiple determination or
square multiple correlation coefficient. The coefficient of multiple determinations, R 2 is
defined as the proportion of the total variation in Y explained by multiple regression of
Y on X1 and X2. As in a two variable model, R 2 shows the percentage of the total variation
on Y explained by the regression plane, that is, by changes in X1 and X2.
2∑
R= 2 =
^y 2i ∑ ( Y^ −Ȳ )2
=1−
∑ e2i ∑ y 2 −∑ x 2
=
yi ∑ ( Y −Ȳ )2 ∑ y 2i ∑ y2
e i= y i− ^y i
and
^y i =b^ 1 x 1 + b^ 2 x 2
∑ e2i =∑ e i ( y i − ^y i )
=∑ ei ( y i− b^ 1 x 1−b^ 2 x 2 )
=∑ ei y i −b^ 1 ∑ ei x 1− b^ 2 ∑ ei x 2
−2 x 1∑ ( y i −b^ 1 x 1− b^ 2 x 2 )=0
^ ^ ^
We can observe that y i =bi x 1 + b 2 x2
and
∑ ei x 1=0
∑ ei x 2=0
Therefore
∑ e2i =∑ e i y i
=∑ ( y i − ^y i ) y i
=∑ y i ( y i −b
^ x −b
1 1
^ x )
2 2
=∑ y 2i − b^ 1 ∑ y i x 1− b^ 2 y i x 2
2∑ ^y 2i
R = 2 =1−
∑ e2i b^ 1 ∑ yx 1 + b^ 2 ∑ yx 2
=
yi ∑ y 2i ∑ y2
The value of R2 is between 0 and 1. The higher the R2 the greater the percentage of the
variation of Y explained by the regression plane, that is the better the goodness of fit of
the regression plane to the sample observations. The closer R 2 to zero, the worse the fit.
The above formula does not take into account the loss of degrees of freedom from the
introduction of additional explanatory variables in the function. Inclusion of additional
explanatory variables is likely to increase the numerator ∑ ^yi2 for the same TSS=, hence
2
R2 also increases. To take this into consideration we use ‘adjusted’ R 2, or R̄ .
n−1
R̄2 =1−(1−R2 )
n−k
n=number of observations,
Example: Compute the R2 and the adjusted R2 for the corn-fertilizer-insecticide example.
Test of the Overall Significance of the Regression
The overall significance of the regression can be tested with the ratio of the explained to
the unexplained variance. this follows an F distribution with k-1 and n-k degrees of
freedom.
F k−1 , n−k =
∑ ^y 2i / k−1
=
R 2 / ( k −1 )
∑ e 2i /n−k ( 1−R2 ) / ( n−k )
If the calculated F ratio exceeds the tabular value of F at the specified level of
significance and degrees of freedom, the hypothesis is accepted that regression
parameters are not all equal to zero and that R2 is significantly different from zero.
Correlation analysis has serious limitations and throws little light on the nature of the
relationships existing between variables. Regression analysis implies (but does not
prove) causality between the independent variable, X and dependent variable, Y.
However, correlation analysis implies no causality or dependence but refers simply to
the type and degree of association between two variables. For example, X and Y may be
highly correlated because of another variable that that strongly affects both. Thus
correlation analysis is a much less powerful tool than regression analysis and is seldom
used by itself in the real world. Correlation is mainly used to determine the degree of
association found in regression analysis. It is given as the coefficient of determination,
which is square of the correlation coefficient. Correlation coefficient is therefore, a
crucial statistic of regression analysis hence students must familiarize with it.
Correlation may be defined as the degree of relationship existing between two or more
variables.
The most widely-used type of correlation coefficient is Pearson r, also called linear or
product- moment correlation.
r YX −r YX r X X2
1 2 1
r YX 1 . X 2=
√ 1−r 2
X1X2 √1−r 2
YX 2
r YX −r YX r X X1
2 1 2
r YX 2 . X 1 =
√ 2
√
1−r X 1 X2 1−r YX 1
2
Statistical inference in the Multiple Regression Model; formulas for the General
case of k Explanatory variables
The analysis if variance (ANOVA) is a statistical method developed by R.A. Fisher for the
analysis of experimental data. It was applied to the analysis of agricultural experiment
(use of various fertilizers, various seeds), but soon its application expanded to many
other fields of scientific research.
The method of ANOVA breaks down total variance of a variable into additive
components which may be attributed to various, separate factors. These factors are the
‘causes’ or ‘sources’ of variation of the variable being analyzed. When the method is
applied to the experimental data, it assumes a certain design of the experiment, which
determines the number of relevant factors (or causes) of variation and the logical
significance of each one of them.
EXAMPLE: assume that we have ten plots of land on which we cultivate Maize, and we
want to study the yield per unit of land. We use different seed varieties, different
fertilizers and different systems of irrigation. We therefore would logically attribute the
variation in yield to each of the three factors.
X1=type of seed variety
X2=type of fertilizer
X3=type of irrigation system
With the ANOVA method we may break down total variation in yield into three separate
components: a component due to X1, another due to X2 and a third due to X3.
From the definition ANOVA method is conceptually the same as regression analysis. In
regression analysis also the aim is to determine the factors which cause the variation of
dependent variable. The total variation is in the dependent variable is split into two
components: the variation explained by the regression line (or regression plane), and
the unexplained variation, shown by the scatter of points around regression line.
Furthermore, the multiple correlation coefficients was seen to represent the proportion
of total variation explained by the regression line (or regression plane). R 2 was found to
be additive components, each corresponding to a relevant explanatory variable.
However, there are significant differences between the two methods. The main
difference is that regression analysis provides numerical values for the influence of the
various explanatory factors on the dependent variable, in addition to the information
concerning the breaking down of total variance of Y into additive components, while the
ANOVA provides only the latter type of information.
The objective of both the ANOVA and regression analysis is to determine factors which
cause the variation of the dependent variable Y. This resemblance has led to the
combination of the two methods in most scientific fields.
The aim of ANOVA is to split the total variation of a variable (around its mean) into
components which may be attributed to specific (additive) causes. To simplify the
ANOVA we assume there is only one systematic factor which influences the variable
being studied. Any variation not accounted for by this (explanatory) factor is assumed
to be random (or chance) variation, due to various random happenings. We have a
series of values of a variable Y and the corresponding values of the (explanatory)
variable X. The ANOVA method concentrates only on Y values and studies their
variation. The values of X are used only for dividing values of Y into sub-groups. The
sub-samples; for example one group (or sample) corresponding to large values of X and
one group corresponding to small values of X.
For each sub-sample we estimate the mean value of Y, obtaining a set of means. If X
(which is the basis of the classification of the Y’s into the sub-samples) is an important
cause of variation in Y (an important explanatory variable) is an important cause of
variation in Y (an important explanatory variable) the difference between means of the
sub-samples will be large; this would be shown by a large dispersion of the means of
sub-samples Yi’s around the common meanȲ , that is by a large variance of the
distribution of the means. On the contrary, if X is not an important source of variation Y,
the difference between the means of sub-samples will be small, a fact that would be
reflected in in a small variance of the distribution of sampling means (
Ȳ i ) around the
common mean Ȳ :
We study and test the difference between the means using two estimates of the
2
population variance of Y. One estimate of σ Y is obtained by pooling the variances of the
sub-samples, and the other is obtained from the expression of sampling distribution
(the distribution of Ȳ ). ANOVA ultimately estimates two variances, and their
comparison is used to establish of the difference between the means are statistically
significant, or it is due to chance.
The comparison of any two variances is implemented by the F statistic and the F tables.
The F statistic is the ratio of any two independent estimates of variances, which have
been obtained from sample data. Each estimate involves some loss of degrees of
freedom. F is often called the variance ratio. Where F stands for the name of Fisher who
invented the statistic.
If the variance estimates are close to each other their ratio will approach the value of
one (1).
The greater the discrepancy (difference) between the two variance the greater is the
value of F ratio. Thus, in general, high values of F suggest that the difference between
the two variances is significant, or the rejection of the null hypothesis, which assumes
no significant difference between the two variances.
Example:
Suppose three different types of petrol are used for running a car: Regular rated at 90
octane, Super rated at 95 octane and Unleaded-Regular at 100 octane. We wish to test
whether these different types of petrol give the same consumption per mile, that is, we
want to compare the consumption performance of the three brands of petrol. Suppose
that we use each brand for ten days and we measure the miles per gallon of petrol. Thus
we obtain three samples of size 10 for each brand.
Ȳ 1 =
∑ Y 1i =33 Ȳ 2 =
∑ Y 2i =38 Ȳ 3 =
∑ Y 3i =46 Ȳ =
∑ ∑ Y 1 i =39
n1 n2 n3 N
n2 n3
n1 ∑ ( Y 2i− Ȳ 2 ) 2
∑ ( Y 3i −Ȳ 3 ) 2
∑ ( Y 1i−Ȳ 1 )2 S22 = i
S23 = i
S21 = i n2 n3
n1 2 50 2 22
S2 = =5. 0 S3 = =2. 2
2 46 10 10
S1 = =4 . 6
10
The data above may be interpreted as the three random samples of size n 1 +n2 +n3=10,
with means Ȳ 1 =33, Ȳ 2 =38 , and 3
Ȳ =46
miles per gallon of petrol. Our problem is to
establish whether the difference between these means is significant or whether it may
be attributed to chance.
We want to test if there is any significant difference between the means of the
populations, μ1 , μ2 and μ3 : We want to test the null hypothesis
H 0 : μ1 =μ2 =μ3
If the three means are the same, that is if the null hypothesis is true, then three
population may be considered as one large population with mean μ(=μ1 =μ 2 =μ 3 ) and
standard deviation σ , that is
Y ≈ N ( μ , σ2 )
An estimate of the common mean μ may be computed from the enlarged sample
n1 +n2 + n3 +30
k nk
∑Yi= ∑ ∑ Y ji
j i 1170
μ^ = = =39
N N 30
ways.
First as unbiased estimator of the population variance may be obtained from the
expression:
k
∑ n j ( Ȳ j −Ȳ )2
j
σ^ 2=
k−1
Where k is the number of the samples.
2
Secondly: As an estimate of population variance σ may be obtained by pooling gether
various sample variances. Using the double summation we obtain the following
expression:
k nk
∑ ∑ ( Y ji −Ȳ j )2
σ^^ 2= j i
N−k
This estimate of the population variance is obtained from the sample variances which
reflect the variation the variation within each sample.
2
We now have two unbiased estimates of the population variance σ
Estimate (1) reflects the variation between the sample means, and depends on the
validity of the null hypothesis.
Estimate (2) reflects the variation of Yi’’s within the samples, and is independent of null
hypothesis.
k
∑ n j ( Ȳ j −Ȳ )2
j
σ^ 2=
k−1
n1 ( Ȳ 1 −Ȳ )2 + n2 ( Ȳ 2−Ȳ )2 +n 3 ( Ȳ 3 −Ȳ )2
=
3−1
1 1 1
=
30−3
46+50+22 118
= = ≈4 .37
27 27
(3) Therefore the observed variance ratio is
σ^ 2 430
F∗¿ = =98 . 39≈98 . 4
σ^^ 2 4 . 37
(4) The theoretical value of F at the at the 5 % level of significance with v 1=k-1=2
and v2=N-k=27 degrees of freedom is found from F-tables.
F 0 .05 =3. 35
(5) Since F*>F0.05 we reject the null hypothesis, that is we accept that there is a
significant difference in the average millage obtained from the three types of
petrol.
To illustrate the similarities between regression analysis and the ANOVA method we
work out the above example with the method of OLS and compare the results.
we quantify octane rating of the three brands of petrol, by treating their octane rating as
a variable rather than as a qualitative attribute.
Test Imposed on the Relationship between Two or More Parameters of a Function
we observed that (b1+b2)=1.0743, that is, our estimates suggest that over the sample
period the tea industry experienced increasing returns to scale. We want to test the
statistical reliability of this result i.e. we test the hypothesis
H1: (b1+b2)=1
against the alternative hypothesis
H2: (b1+b2)≠ 1
We test the hypothesis using F ratio as follows.
Step1. We perform the regression with the restriction (b 1+b2) =1, from the restriction
we obtain b2=1-b1, so that by substitution in the production function we find
b (1−b1 )
Y =b0 L 1 K
Dividing through by K we find
( )
b1
Y L
=b0
K K
¿
Fitting a regression to this expression we get b 1=0 . 7431. Substituting in the restriction
¿ ¿
we find b 2=1−b1 =0 . 2569
Thus the restricted production function is
¿ ¿
Y =b L0 .7431 K 0.2569
R2 =0.650, ∑ 2
y =180 ∑
e2 =6 . 45 2
F∗¿
∑ e 22−∑ e21
(n−k )
∑ e12
has an F distribution with v1=1, and v2=n-k degrees of freedom.
Step 3: the observed F* value is compared with the theoretical (tabular) value of F 0.05,
with v1=1, and v2=(n-k) df. If F*>F0.05 we reject our basic hypothesis, that is we accept
that (b1+b2)≠ 1
In our example
6 . 45−4 . 64 1 . 81
F∗¿ (30−3 )= (27 )=10 .53
4 . 64 6 . 64
F0.05=4.20
Hence we conclude that (b1+b2)≠ 1
In general if we have c restrictions and we want to test the null hypothesis
H0:all the restrictions are true.
against the alternative hypothesis.
H1: not all restrictions are true,
We proceed as follows
Step1: Apply OLS to the original unrestricted relationship and obtain the sum of
restrictions) and obtain the sum of (restricted) squared residuals ∑ e2R with n-(k-c)=n-
k-c degrees of freedom.