ECONOMETRICS 1 Notes
ECONOMETRICS 1 Notes
LUANAR
Open and Distance Learning (ODL)
LILONGWE UNIVERSITY OF AGRICULTURE
AND NATURAL RESOURCES
Econometrics I
COURSE CODE: AAE 316
By
Elisa Nathan Ebiyamu
2
TABLE OF CONTENTS
i
Distribution of the dependent variable .................................................................................. 28
T-DISTRIBUTION ............................................................................................................... 65
F DISTRIBUTION ............................................................................................................... 66
ii
6.2 INTERVAL ESTIMATION .......................................................................................... 75
REFERENCES ............................................................................................................................. 95
iii
1. UNIT ONE: MEANING OF ECONOMETRICS
INTRODUCTION
Economics suggests important relationships, often with policy implications, but virtually never
suggests quantitative magnitudes of causal effects. What is the quantitative effect of increasing
amount of fertilizer applied to maize on yield? Econometrics provide methods for estimating
causal effects as well as forecasting using observational data. You may be wondering as to what
econometrics means? This unit introduces you to the meaning, types and importance of
econometrics.
OBJECTIVES
a. define econometrics
b. state the importance of econometrics
c. list types of econometrics
d. give examples of applications and use of econometrics in real world
There are several aspects of the quantitative approach to economics, and no single one of these
aspects taken by itself, should be confounded with econometrics. Thus, econometrics is by no
means the same as economic statistics. Nor is it identical with what we call general economic
theory, although a considerable portion of this theory has a definitely quantitative character. Nor
should econometrics be taken as synonymous with the application of mathematics to economics.
Experience has shown that each of these three viewpoints, that of statistics, economic theory, and
mathematics, is a necessary, but not by itself a sufficient, condition for a real understanding of
the quantitative relations in modern economic life. It is the unification of all three that is
powerful. And it is this unification that constitutes econometrics.
1
Econometrics may be defined as the social science in which the tools of economic theory,
mathematics, and statistical inference are applied to the analysis of economic phenomena.
Social science is, in its broadest sense, the study of society and the manner in which people
behave and influence the world around us. It tells us about the world beyond our immediate
experience, and can help explain how our own society works from the causes of unemployment
or what helps economics.
Economic phenomenon refers to situations or problems that economists deal with. For example,
explaining changes in commodity prices. Econometrics is concerned with the empirical
determination of economic laws.
Econometrics can also be defined as a statistical and mathematical methods to the analysis of
economic data, with a purpose of giving empirical content to economic theories and verifying
them or refuting them.
The art of the econometrician consists in finding the set of assumptions that are both sufficiently
specific and sufficiently realistic to allow him to take the best possible advantage of the data
available to him. Econometrics is seen as the vehicle by which economics can claim scientific
validity.
The three aims of econometrics are formulation and specification of econometric models,
estimating as well as testing of models and using models.
The economic models are formulated in an empirically testable form. Several econometric
models can be derived from an economic model. Such models differ due to different choice of
functional form, specification of stochastic structure of the variables etc.
The models are estimated on the basis of observed set of data and are tested for their suitability.
This is the part of statistical inference of the modeling. Various estimation procedures are used to
2
know the numerical values of the unknown parameters of the model. Based on various
formulations of statistical models, a suitable and appropriate model is selected.
Use of models:
The obtained models are used for forecasting and policy formulation which is an essential part in
any policy decision. Such forecasts help the policy makers to judge the goodness of fitted model
and take necessary measures in order to re-adjust the relevant economic variables.
CHECK POINT
1. Define econometrics.
________________________________________________________________________
________________________________________________________________________
2. List three aims of econometrics.
a. __________________________________________________________________
b. __________________________________________________________________
c. __________________________________________________________________
3. Discuss whether econometrics is superior over economic theory, mathematics and
statistics.
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
Many economics students think that econometrics challenge their brains for nothing. However,
econometrics is a vital course in economics. Some of the importance of econometrics are
outlined below.
3
Economics as a science involves knowing the theory and establishing the set of hypotheses
which is understood by studying economics. Once the theory is known, the theory theory or the
hypotheses using is tested using various techniques. Testing of theory or hypothesis is achieved
through studying econometrics.
Econometrics contains statistical tools to help you defend or test assertions in economic theory.
For example, you think that the production in an economy is in Cobb-Douglas form. But do data
support your hypothesis? Econometrics can help you in this case. The econometrician uses the
mathematical equations proposed by the mathematical economist but puts these equations in
such a form that they lend themselves to empirical testing.
Econometrics can be classified into two groups as shown in Figure 1.6.1 below.
Econometrics
Theoretical Applied
4
Theoretical Econometrics
Applied econometrics
In applied econometrics we use the tools of theoretical econometrics to study some special
field(s) of economics and business, such as the production function, investment function,
demand and supply functions, portfolio theory,
Estimating the impact of immigration on native workers: Immigration increases the supply of
workers, so standard economic theory predicts that equilibrium wages will decrease for all
workers. However, since immigration can also have positive demand effects, econometric
estimates are necessary to determine the net impact of immigration in the labor market.
Identifying the factors that affect a firm’s entry and exit into a market: The microeconomic field
of industrial organization, among many issues of interest, is concerned with firm concentration
and market power. Theory suggests that many factors, including existing profit levels, fixed costs
associated with entry/exit, and government regulations can influence market structure.
Econometric estimation helps determine which factors are the most important for firm entry and
exit.
Determining the influence of minimum-wage laws on employment levels: The minimum wage is
an example of a price floor, so higher minimum wages are supposed to create a surplus of labor
(higher levels of unemployment). However, the impact of price floors like the minimum wage
5
depends on the shapes of the demand and supply curves. Therefore, labor economists use
econometric techniques to estimate the actual effect of such policies.
Finding the relationship between management techniques and worker productivity: The use of
high-performance work practices (such as worker autonomy, flexible work schedules, and other
policies designed to keep workers happy) has become more popular among managers. At some
point, however, the cost of implementing these policies can exceed the productivity benefits.
Econometric models can be used to determine which policies le to the highest returns and
improve managerial efficiency.
Measuring the association between insurance coverage and individual health outcomes: One of
the arguments for increasing the availability (and affordability) of medical insurance coverage is
that it should improve health outcomes and reduce overall medical expenditures. Health
economists may use econometric models with aggregate data (from countries) on medical
coverage rates and health outcomes or use individual-level data with qualitative measures of
insurance coverage and health status.
Deriving the effect of dividend announcements on stock market prices and investor behavior:
Dividends represent the distribution of company profits to its shareholders. Sometimes the
announcement of a dividend payment can be viewed as good news when shareholders seek
investment income, but sometimes they can be viewed as bad news when shareholders prefer
reinvestment of firm profits through retained earnings. The net effect of dividend announcements
can be estimated using econometric models and data of investor behavior.
Predicting revenue increases in response to a marketing campaign: The field of marketing has
become increasingly dependent on empirical methods. A marketing or sales manager may want
to determine the relationship between marketing efforts and sales. How much additional revenue
is generated from an additional dollar spent on advertising? Which type of advertising (radio,
TV, newspaper, and so on) yields the largest impact on sales? These types of questions can be
addressed with econometric techniques.
Calculating the impact of a firm’s tax credits on R&D expenditure: Tax credits for research and
development (R&D) are designed to provide an incentive for firms to engage in activities related
to product innovation and quality improvement. Econometric estimates can be used to determine
6
how changes in the tax credits influence R&D expenditure and how distributional effects may
produce tax-credit effects that vary by firm size.
This unit has defined Econometrics as the social science in which the tools of economic theory,
mathematics, and statistical inference are applied to the analysis of economic phenomena. You
have also learnt aims and application of econometrics
7
e. __________________________________________________________________
__________________________________________________________________
3. Describe any two application of econometrics in real world.
a. __________________________________________________________________
__________________________________________________________________
b. __________________________________________________________________
__________________________________________________________________
8
2 UNIT TWO: VARIABLES
INTRODUCTION
In unit 1, you defined econometrics as statistical and mathematical methods to the analysis of
economic data, with a purpose of giving empirical content to economic theories and verifying
them or refuting them. The data is collected on certain characteristics or attributes or variables
from sampled respondents. This unit discusses the meaning, types and measurement levels of
variables.
OBJECTIVES
In microeconomics, you learnt about the theory of demand. This theory states that there is an
inverse relationship between price of a good or service and quantity demanded for the same
product. You also learnt that the demand function can be shifted due to changes in income, price
of substitutes or complements among other demand shifters. Just like quantity demanded, price
and income are random variables. A variable is an entity that can take different values.
The term random is a synonym for the term stochastic. The random or stochastic variable is a
variable that can take on any set of values, positive or negative, with a given probability. We
can also say that a random variable is a variable whose value is not known until it is observed.
The value of a random variable may result from an experiment and it is not perfectly predictable.
A fixed variable X is said to be a random variable if for every real number a there exists a
probability P(X ≤ a) that X takes on a value less than or equal to a. We shall denote random
9
variables by capital letters X, Y, Z, and so on. We shall use lowercase letters, x, y, z, and so on,
to denote particular values of the random variables. A fixed variable is predictable.
If the random variable can assume only a particular finite set of values, it is said to be a discrete
random variable. The value of a discrete random variable in observed by counting.
A random variable is said to be continuous if it can assume any value in a certain range. The
value of a continuous random variable is usually obtained by measurement.
10
Example of a continuous random variable are;
Not all random variables describe a characteristic using numerical values. Suppose you want to
report sex of a responded, the response is going to be either male or female. None of these two
responses is a numerical value.
a. Gender
b. marital status
c. location
d. Participation in a project.
Dummy Variable
In economics, qualitative variables are coded using discrete variables. When a discrete variable
is used to recode a qualitative characteristic, it is called a dummy variable. A dummy variable is
also called a design variable or indicator variable.
Example
Let us create a dummy variable for sex of household head that takes a value of 1 if the household
head is male and 0 if the household head is female.
11
Independent and dependent variables
The nature of the dependent variable is one of the factors that determine the choice of an
econometric model to be used in data analysis. For example, you may use a regression model if
the dependent variable is a continuous variable and you can use a probit model if the dependent
variable is dichotomous.
The variables that we will generally encounter fall into four broad levels or scales of
measurement. These are ratio, interval, ordinal and nominal.
Ratio Scale
𝑥1
For a variable X, taking two values, 𝑥1 and 𝑥2 , the ratio ⁄𝑥2 and the distance (𝑥1 − 𝑥2 )
are meaningful quantities. Also, there is a natural ordering (ascending or descending) of the
values along the scale. Therefore, comparisons such as 𝑥2 ≤ 𝑥1 or 𝑥2 ≥ 𝑥1 are meaningful. Most
economic variables belong to this category. Thus, it is meaningful to ask how big this year’s
GDP is compared with the previous year’s GDP. Personal income, measured in dollars, is a ratio
variable; someone earning $100,000 is making twice as much as another person earning $50,000.
Interval Scale
An interval scale variable satisfies the last two properties of the ratio scale variable but not the
first. Thus, the distance between two time periods, say (2000–1995) is meaningful, but not the
ratio of two time periods (2000/1995). Without a true zero, it is impossible to compute
ratios. With interval data, we can add and subtract, but cannot multiply or divide.
Examples are temperature and time. You may have heard from a weather report that at 11:00
a.m. on 25th December, 2017, Lilongwe reported a temperature of 20 degrees Celsius while
12
Ngabu reached 40 degrees Celsius. Temperature is not measured on a ratio scale since it does not
make sense to claim that was Ngabu 100 percent warmer than Lilongwe. This is mainly due to
the fact that the Celsius scale does not use 0 degrees as a natural base. Thus 0 degrees Celsius is
arbitrary and does not mean absence of heat energy or internal energy.
Ordinal Scale
A variable belongs to this category only if it satisfies the third property of the ratio scale (i.e.,
natural ordering). For these variables the ordering exists but the distances between the categories
cannot be quantified. Students of economics will recall the indifference curves between two
goods. Each higher indifference curve indicates a higher level of utility, but one cannot quantify
by how much one indifference curve is higher than the others.
Examples are;
Nominal Scale
Variables in this category have none of the features of the ratio scale variables. Variables such as
gender (male, female) and marital status (married, unmarried, divorced, separated) simply denote
categories.
As we shall see, econometric techniques that may be suitable for ratio scale variables may not be
suitable for nominal scale variables. It is therefore important to know how the variables were
measured
Data are observations that have been collected on variables. Data are sometimes used to calculate
statistics. The success of any econometric analysis ultimately depends on the availability of the
appropriate data. Data collection is a very important stage in research. If you are carrying out a
research, make sure all necessary steps are followed to ensure the data that is collected is of good
quality. The results of research are only as good as the quality of the data.
13
Types of Data
Four types of data may be available for empirical analysis: time series, cross-section, pooled data
and panel data.
A time series is a set of observations on the values that a variable takes at different times. Such
data may be collected at regular time intervals.
Examples
Cross-Section Data
Cross-section data are data collected on one or more variables collected at the same point in
time.
Examples
a. Population Census data collected by the National Statistical Office (NSO) every 10 years
b. Integrated Household Survey data collected by NSO every 5 years
c. Base line survey data for a project
Pooled data
These are data with combined elements of both time series and cross-section data. In pooled data
we have a “time series of cross sections,” but the observations in each cross section do not
necessarily refer to the same unit.
14
Panel data
Department of Commerce carries out a census of housing at periodic intervals. At each periodic
survey the same household (or the people living at the same address). Interviewed to find out if
there has been any change in the housing and financial conditions of that household since the last
survey. By interviewing the same household periodically, the data becomes panel data that
provides very useful information on the dynamics of household behavior. Panel data are data
from samples of the same cross-sectional units observed at multiple points in time.
Each type of data in analysed using specific econometric models. There are some models that
can be used to analyse cross sectional data but cannot be used to analyse time series data. For
example you can use Auto-Regression Models (AR) to analyse tome series data but you cannot
use these models to analyse cross-section data. Therefore, before choosing a model for analysing
data, a research need to understand the type of data that is available for analysis.
We started this unit by defining a random variable as a variable that can take on any set of
values, positive or negative, with a given probability. We also classified variables into
quantitative variables and qualitative variables. It has been observed that quantitative variables
can be continuous or discrete. A dummy variable was defined as a numerical variable used to
represent a qualitative variable (subgroups). Variables can be measured on different levels
including ratio scales, interval scale, ordinal scale and nominal scale. Data are observations that
have been collected on variables. Data can be categorized as cross-section data, time series data,
pooled data or panel data.
15
2.6 UNIT EXERCISE
INTRODUCTION
In unit 2, you learnt about variables and data. It was noted that econometric analysis of data
partly depends on the type of data and nature of dependent variable. Economics students often
ask themselves questions like; what model can I use to analyse my data? How can I interpret the
results? In this unit we introduce to you one of the mostly used econometric models called Linear
Regression Model (LRM). You will find this unit very important in your career and research.
OBJECTIVES
Figure 3.1 summarises the procedure in econometric analysis. Broadly speaking, traditional
econometric methodology proceeds along the following lines.
17
Figure 3.1: Econometric procedure
18
Statement of Theory or Hypothesis
This is where the researcher states what economic theory says about the inter-dependency of
economic variables of interest. Economists include variables in econometric models based on
theory. For example, the theory of demand states that quantity demanded for a product for a
given time period is inversely related to its own price. Economic theory also indicates the
direction of relationship between variables. However, the theory does not quantify the
dependency.
𝑌 = 𝛽0 + 𝛽1 𝑋 ……………………………………………………………...……… 3.1
In equation 3.1,
The mathematical model is deterministic or exact. It does not take into account the possibility of
error.
The purely mathematical model of the consumption function given in Eq. (3.1) is of limited
interest to the econometrician, for it assumes that there is an exact or deterministic relationship
between the two variables. But relationships between economic variables are generally inexact.
Thus, if we were to obtain data on consumption expenditure and disposable (i.e., after tax)
income of a sample of, say, 500 Malawian families and plot these data on a graph paper with
consumption expenditure on the vertical axis and disposable income on the horizontal axis, we
would not expect all 500 observations to lie exactly on the straight line of Eq. (3.1) because, in
addition to income, other variables affect consumption expenditure.
19
To allow for the inexact relationships between economic variables, the econometrician would
modify the deterministic consumption function in Eq. (3.1) as follows:
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜀 ……………………………….…………………………………… 3.2
Where ε, known as the disturbance, or error, term, is a random (stochastic) variable that has well-
defined probabilistic properties. The disturbance term ε may well represent all those factors that
affect consumption but are not taken into account explicitly. The error term makes the
econometric model stochastic.
Data collection
To estimate the econometric model given in Eq. (3.2), that is, to obtain the numerical values of
𝛽0 and 𝛽1 we need data. Data is collected on both dependent and independent variables. As noted
in section 2.4, the quality of estimates are as good as the quality of data. There is need for
selecting an adequate sample using appropriate sampling technique. Data collection officers
(enumerators) should be those who are trustworthy, well trained and preferably experienced.
Estimation of parameters
Now that we have the data, our next task is to estimate the parameters of the model. The
numerical estimates of the parameters give empirical content to the consumption function. The
actual mechanics of estimating the parameters will be discussed later in this module. For now,
note that the statistical technique of regression analysis is the main tool used to obtain the
estimates.
20
You may obtain the following estimates of 𝛽0 and 𝛽1 as −299.5913 and 0.7218 respectively.
Thus, the estimated model could be:
6. Hypothesis Testing
Assuming that the fitted model is a reasonably good approximation of reality, we have to
develop suitable criteria of finding out whether the estimates obtained in, say, Equation 3.3 are in
accord with the expectations of the theory that is being tested. A theory or hypothesis that is not
verifiable by appeal to empirical evidence may not be acceptable as a part of scientific enquiry.
7. Forecasting or Prediction
If the chosen model does not refute the hypothesis or theory under consideration, we may use it
to predict the future value(s) of the dependent, or forecast, variable Y on the basis of the known
or expected future value(s) of the explanatory, or predictor, variable X.
Suppose we have the estimated consumption function given in Eq. (3.3). Suppose further the
government believes that consumer expenditure of about 8750 (billions of 2000 dollars) will
keep the unemployment rate at its current level of about 4.2 percent (early 2006). What level of
income will guarantee the target amount of consumption expenditure? If the regression results
given in Eq. (3.3) seem reasonable, simple arithmetic will show that
Which gives X = 12537, approximately. That is, an income level of about 12537 (billion) dollars,
given an MPC of about 0.72, will produce an expenditure of about 8750 billion dollars.
As these calculations suggest, an estimated model may be used for control, or policy, purposes.
By appropriate fiscal and monetary policy mix, the government can manipulate the control
variable X to produce the desired level of the target variable Y.
21
CHECK POINT
1. In the econometric procedure, state four things that you would do before testing the
hypothesis.
a. __________________________________________________________________
b. __________________________________________________________________
c. __________________________________________________________________
d. __________________________________________________________________
2. What is the difference between a mathematical model and an econometrical model?
________________________________________________________________________
________________________________________________________________________
3. Given that 𝑌̂𝑖 = −299.5913 + 0.7218Xi, what does the hut above variable Y mean?
________________________________________________________________________
4. Suppose you are carrying out a research, how would you ensure that good quality data is
collected?
a. __________________________________________________________________
b. __________________________________________________________________
c. __________________________________________________________________
d. __________________________________________________________________
A cock craws early in the morning and this is followed by rising of the sum. Does it mean that
crowing of a cock causes the sun to rise? Obviously no. Many times economists want to know
the effect of one variable on another variable. For example, what is the effect of one additional
year of education on earnings? How would tobacco output be affected by using an additional unit
of unit of fertilizer? In econometrics, this is no longer a big challenge. We use regression
analysis to come up with marginal effects of independent variables on the dependent variable.
22
Regression is a technique for determining the statistical relationship between two or more
variables where a change in a dependent variable is associated with, and depends on, a change in
one or more independent variables.
A regression model can also be defined as a mathematical equation that helps to predict or
forecast the value of the dependent variable based on the known values of independent variables.
A simple linear regression is a statistical equation that characterizes the relationship between a
dependent variable and only one independent variable.
For example
Tobacco yield is a function of quantity of fertilizer applied. Such a study is known as simple, or
two-variable, regression analysis.
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜀 ………………………………………………………………… 3.5
However, if we are studying the dependence of one variable on more than one explanatory
variable, as in the crop-yield, rainfall, temperature, sunshine, and fertilizer example, it is known
as multiple regression analysis. Equation 3.6 is a multiple linear regression.
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + ⋯ + 𝛽𝑘 𝛽𝑘 + 𝜀 …………………………………… 3.6
A multivariate regression has more than one dependent variables and more than one
independent variables.
23
3.3 BASICS OF A SIMPLE LINEAR REGRESSION
As used in statistics, the term population refers to a complete list of possible measurements or
outcomes that are of interest.
For instance;
a. To study the effect of study hour on GPA among undergraduate students at LUANAR,
the population is a list of all undergraduate students at LUANAR.
Let us say we want to determine the effect of study hours on GPA. Then the population linear
regression can be as in equation 3.7.
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜀 ………………………………………………………………… 3.7
Where 𝑦 is GPA, 𝑥 is study duration in hours 𝛽0 and 𝛽1 are population parameter while ε is the
stochastic term.
If possible the population parameters could be determined by collecting data on study duration
and GPA from all the students at LUANAR. However, it is very difficult and costly to study the
whole population. Therefore we estimate the population parameters using data collected from a
sample of students selected from the population.
Sample
The tern “sample” refers to a proportion of the population that is representative of the population
from which it was selected. You may refer to modules of statistics for economists and Research
methods for details on sampling techniques that ensures that the sample is both adequate and
representative.
24
A sample regression function can be as in equation 3.8.
Where
We can make the sample regression function stochastic by including the error term as in equation
3.9.
POPULATION SAMPLE
𝑌1 = 𝛽0 + 𝛽1 𝑋1 + 𝜀1
25
Importance of the error term
1. Vagueness of theory
The theory, if any, determining the behavior of Y may be, and often is, incomplete. We might
know for certain that weekly income X influences weekly consumption expenditure Y, but we
might be ignorant or unsure about the other variables affecting Y. Therefore, 𝜀𝑖 may be used as a
substitute for all the excluded or omitted variables from the model.
2. Unavailability of data
Even if we know what some of the excluded variables are and therefore consider a multiple
regression rather than a simple regression, we may not have quantitative information about these
variables. It is a common experience in empirical analysis that the data we would ideally like to
have often are not available.
Assume that we want to study consumption-income relationship and economic theory guides us
that explanatory variables include income, number of children per household, religion, education
and geographical location. It is quite possible that the joint influence of all or some of these
variables may be so small and at best nonsystematic or random to the extent that it does not pay
to introduce them into the model explicitly. Their combined effect can be treated as a random
variable 𝜀𝑖 .
26
4. Intrinsic randomness in human behavior
Even if we succeed in introducing all the relevant variables into the model, there is bound to be
some “intrinsic” randomness in individual Y’s that cannot be explained no matter how hard we
try. The disturbances, the ε’s, may very well reflect this intrinsic randomness.
Although the classical regression model assumes that the variables Y and X are measured
accurately. In practice the data may be plagued by errors of measurement and data on some
variable sis not available. But since data on these variables are not directly observable, in
practice we use proxy variables. For example expenditure may be used as a proxy for income.
Obviously expenditure may not always be equal to income as some people may be saving or
donating to others. Therefore use of expenditure as a proxy for income comes with the problem
of errors of measurement. The disturbance term 𝜀𝑖 may in this case then also represent the errors
of measurement.
6. Principle of parsimony:
Even if we have theoretically correct variables explaining a phenomenon and even if we can
obtain data on these variables, very often we do not know the form of the functional relationship
between the regressand and the regressors. Is consumption expenditure a linear (invariable)
function of income or a nonlinear (invariable) function?
27
In two-variable models the functional form of the relationship can often be judged from the
scatter gram. But in a multiple regression model, it is not easy to determine the appropriate
functional form, for graphically we cannot visualize scatter grams in multiple dimensions.
A model may be linear in variables or linear in parameters. The first and perhaps more “natural”
meaning of linearity is that the conditional expectation of Y is a linear function of Xi as in
equation 3.10. Geometrically, the regression curve in this case is a straight line.
In this interpretation, a regression function such as 3.11 is not a linear function because the
variable X appears with a power or index of 2.
Linearity in parameters
The second interpretation of linearity is that the conditional expectation of Y, E(Y | Xi), is a
linear function of the parameters, the β’s; it may or may not be linear in the variable X. In this
interpretation equation 3.11 is a linear (in the parameter) regression model.
Of the two interpretations of linearity, linearity in the parameters is relevant for the development
of the regression theory to be presented shortly. Therefore, from now on, the term “linear”
regression will always mean a regression that is linear in the parameters the β’s (that is, the
parameters) are raised to the first power only. It may or may not be linear in the explanatory
variables, the X’s.
When 10 farmers apply10 kilograms of fertilizer per 70 square metre plot, do you expect them to
have the same yield? Obviously, the output will be different. This means that for each value of
the independent value, the observed values of the dependent variables are many and different.
28
They form a distribution that has a mean and variance. Figure 3.2 is an illustration of this
observation
The regression line is a schedule that connects all the mean values of the dependent variable for
each level of the independent variable. Not all observed values will be along the regression line.
Some of the observed values will be above the line while others will be below it. The difference
between the observed value of the dependent variable and the fitted value for the same level of
independent variable is the error.
This unit has introduced to the econometric procedure. You have learnt eight steps of
econometric analysis. We have differentiated between a mathematical model and an econometric
model. In econometrics the error term is very important.
The unit also defined a regression. The linear regression is linear in parameters. Another point is
that the dependent variable has a distribution for each value of the independent variable and that
the mean value of the distribution lies on the regression line. It may be linear or non-linear in
29
variables. You may have noticed that we usually do not know the population parameters but we
use a sample to estimate population parameters.
UNIT EXERCISE
30
4 Unit 4: ESTIMATION OF A SIMPLE LINEAR
REGRESSION
INTRODUCTION
In unit 3, you were introduced to linear regression. We said that it is one of the most commonly
used models in in estimating population parameters using data collected from a sample. In this
unit, you will learn how to estimate a linear regression model. You will also learn assumptions of
a classical linear regression. Furthermore, the unit provides a guide to estimation of a simple
linear regression as well as a multiple linear regression in STATA. We shall end the unit by
discussing more about the dummy variable and carrying out joint test.
OBJECTIVES
We can estimate the population parameters using sample parameters in various ways and these
are;
31
By and large, it is the method of OLS that is used extensively in regression analysis primarily
because it is intuitively appealing and mathematically much simpler than the method of
maximum likelihood. Besides, as we will show later, in the linear regression context the
Ordinary Least Squares (OLS) and Maximum likelihood methods generally give similar results.
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝜀𝑖 ………………………………………………….………………….4.1
This estimated simple regression based on data collected form a sample is as follows;
Then,
𝑌𝑖 = 𝑌̂i + 𝜀𝑖 ………………………………………………………………………………4.4
This shows that the error term is just the difference between the observed value and the error
term or residual or disturbance.
𝜀𝑖 = 𝑌𝑖 − ̂𝑌i ………………………………………………………………………….4.5
32
Figure 4.1: Error as the vertical distance.
To fit the simple regression line to the data, the sum of squares of the error (vertical distances),
as illustrated in Figure 4.1, must be as small as possible.
You may recall from your introductory statistics class that the procedure is as follows;
Example
33
Compute the following
Working
Y ̅
𝐘 ̅)
(𝐘 − 𝐘 ̅) 𝟐
(𝐘 − 𝐘
70 122 -52 2704
65 122 -57 3249
90 122 -32 1024
95 122 -27 729
110 122 -12 144
115 122 -7 49
120 122 -2 4
140 122 18 324
155 122 33 1089
260 122 138 19044
̅ )𝟐
(𝒀−𝒀
c. 𝜎 = √ 𝑁
𝜎 = √1904.4
𝜎 = 43.64
34
The OLS selects estimates that minimizes the error sum or squares over all sample data points.
Methods of calculus are used to derive the formulas for the parameter of a simple linear
regression model. Below is the step by step derivation of the estimators.
𝜀𝑖 = 𝑌𝑖 − ̂𝑌i
𝜀𝑖 = 𝑌𝑖 − 𝛽̂0 − 𝛽̂1 𝑋𝑖
2
∑ 𝜀𝑖 = ∑(𝑌𝑖 − ̂𝑌i )
2
∑ 𝜀𝑖 = ∑(𝑌𝑖 − 𝛽̂0 − 𝛽̂1 𝑋𝑖 ) ……………………………………………………………..4.7
To minimize the residual sum of squares, we equate the first derivatives of the sum of squares
with respect to each parameter.
𝜕 ∑ 𝜀𝑖
̂0 =0
𝜕𝛽
𝜕 ∑(𝑌𝑖 − 𝛽 ̂1 𝑋𝑖 )2
̂0 − 𝛽 𝜕(𝑌𝑖 −𝛽0 −𝛽1 𝑋𝑖 )
̂0 = 2 ∑(𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖 ) ̂0
𝜕𝛽 𝜕𝛽
0 = 2∑(𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖 ) (−1)
0 = 2 ∑(−𝑌𝑖 + 𝛽0 + 𝛽1 𝑋𝑖 )
0 = −𝑌𝑖 + 𝑛𝛽0 + ∑ 𝛽1 𝑋𝑖
0 = − ∑ 𝑦𝑖 + 𝑛𝛽0 + 𝛽1 ∑ 𝑋𝑖
∑ 𝑌𝑖 = 𝑛𝛽0 + 𝛽1 ∑ 𝑋𝑖 …………………………………….4.8
𝜕 ∑ 𝜀𝑖
̂1 =0
𝜕𝛽
0 = 2∑(𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖 ) (−𝑋𝑖 )
0 = − ∑ 𝑌𝑖 𝑋𝑖 + 𝛽0 ∑ 𝑋𝑖 + 𝛽1 ∑ 𝑋𝑖 2
∑ 𝑌𝑖 𝑋𝑖 = 𝛽0 ∑ 𝑋𝑖 + 𝛽1 ∑ 𝑋𝑖 2 ……………………………………..….4.9
35
Equations 4.8 and 4.9 are called normal equations. Let us now solve the two equations
simultaneously to find 𝛽0 and 𝛽1.
2
𝑛 ∑ 𝑌𝑖 𝑋𝑖 − ∑ 𝑋𝑖 ∑ 𝑌𝑖 = 𝛽1 (𝑛 ∑ 𝑋𝑖 2 − (∑ 𝑋𝑖 ) )
𝒏 ∑ 𝒀𝒊 𝑿𝒊 −∑ 𝑿 ∑ 𝒚𝒊
𝜷𝟏 = ……………………………….4.12
𝒏 ∑ 𝑿𝒊 𝟐 − (∑ 𝑿𝒊 )𝟐
𝑛2
If we multiply the right hand side of equation 4.5 by 𝑛2
∑ 𝑌𝑖𝑋𝑖 ∑ 𝑋𝑖 ∑ 𝑦𝑖
−
𝑛 𝑛2
𝛽1 = 2
∑ 𝑋𝑖 2 (∑ 𝑋𝑖 )
−
𝑛 𝑛2
∑ 𝒀𝒊 𝑿𝒊
̅𝒀
− 𝑿 ̅
𝒏
𝜷𝟏 = ∑ 𝑿𝒊 𝟐
……………………..............4.13
̅𝟐
− 𝑿
𝒏
∑ 𝑌𝑖 = 𝑛𝛽0 + 𝛽1 ∑ 𝑋𝑖
Therefore;
𝑛𝛽0 = ∑ 𝑌𝑖 − 𝛽1 ∑ 𝑋𝑖
∑ 𝑌𝑖 ∑ 𝑋𝑖
𝛽0 = − 𝛽1
𝑛 𝑛
𝑿𝒊 ………………………………4.14
𝜷𝟎 = 𝒀̅𝒊 − 𝜷𝟏̅̅̅
36
Example
The data in Table 4.1 show values of the dependent variable Y for each value of
the independent variable X.
𝑿𝒊 𝒀𝒊
1 1
2 1
3 2
4 2
5 4
a. Calculate
i. ∑ 𝑋𝑖 iii. ∑ 𝑋𝑖 𝑌𝑖
ii. ∑ 𝑌𝑖 iv. ∑ 𝑋𝑖2
b. Calculate
i. The slope parameter 𝛽1
Working
𝒙𝒊 . 𝒚𝒊 𝒙𝟐𝒊 𝒚𝟐𝒊 𝒙 𝒊 𝒚𝒊
1 1 1 1 1
2 1 4 1 2
3 2 9 4 6
4 2 16 4 8
5 4 25 16 20
∑ 𝑥𝑖 = 15
∑ 𝑦𝑖 = 10 ∑ 𝑥𝑖2 = 55
∑ 𝑦𝑖2 = 26 ∑ 𝑥𝑖 𝑦𝑖 = 37
37
a. Sums
i. ∑ 𝑥𝑖 = 15
ii. ∑ 𝑦𝑖 = 10
iii. ∑ 𝑥𝑖2 = 55
iv. ∑ 𝑥𝑖 𝑦𝑖 = 37
b. Parameters
𝑛 ∑ 𝑌𝑖 𝑥𝑖 −∑ 𝑥𝑖 ∑ 𝑦𝑖
i. 𝛽1 =
𝑛 ∑ 𝑥𝑖 2 − (∑ 𝑥𝑖 )2
(5)(37) − (15)(10)
𝛽1 =
(5)(55) − 152
𝜷𝟏 = 𝟎. 𝟕
∑ 𝑦𝑖
𝑦̅𝑖 =
𝑛
10
= =𝟐
5
∑ 𝑥𝑖
𝑥̅𝑖 =
𝑛
15
= =𝟑
5
𝛽0 = 2 − (0.7)(3)
𝜷𝟎 = − 𝟎. 𝟏
̂ 𝐢 = −𝟎. 𝟏 + 𝟎. 𝟕𝒙𝒊
c. The estimated simple linear regression is 𝒀
d. The estimated simple linear regression is
𝑌̂i = −0.1 + 0.7𝑥𝑖
𝑌̂i = −0.1 + (0.7)(6)
𝑌̂i = −0.1 + 4.2
̂ 𝐢 = 𝟒. 𝟏
𝒀
38
Interpretation of parameters of a linear regression
Intercept parameter
This is the average value of the dependent variable when the value of the explanatory variable is
zero holding all other factors constant.
For example
The estimated simple linear regression in the example above is 𝑌̂i = −0.1 + 0.7𝑋𝑖 . The
intercept parameter −0.1 means that the average value of y is −𝟎. 𝟏 when the value of x = 0 .
Slope parameter
The slope parameter is interpreted as marginal effect of the independent variable on the
dependent variable..
Example
The estimated simple linear regression in the example above is 𝑌̂i = −0.1 + 0.7𝑋𝑖 . The slope
parameter, = 0.7 means that a unit increase in the value of x will increase the value of y by 0.7
on average.
i. Linearity
The regression model is linear in the parameters, though it may or may not be linear in the
variables. This model can be extended to include more explanatory variables.
Values taken by the regressor X are considered fixed in repeated samples. This assumption is
made to keep the regression simple for now. This assumption is relaxed later.
39
iii. The error term is assumed to have a mean of zero.
The variance of the error, or disturbance, term is the same regardless of the value of X. Put
simply, the variation around the regression line (which is the line of average relationship
between Y and X) is the same across the X values; it neither increases nor decreases as X varies.
Symbolically,
𝑽𝒂𝒓(𝜺) = 𝝈
40
Violation of this assumption causes a problem of heteroskedasticity or unequal spread, or
unequal variance. In this case 𝑽𝒂𝒓(𝜺) ≠ 𝝈
41
v. No autocorrelation between the disturbances
Given any two X values 𝒙𝒊 and 𝒙𝒋 (𝒊 ≠ 𝒋), the correlation between any two error terms 𝜺𝒊 and
𝜀𝑗 (𝑖 ≠ 𝑗) is zero. In short, the observations are sampled independently. Thus, 𝐶𝑜𝑣 (𝜀𝑖 , 𝜀𝑗 ) = 0.
But it should be added here that the justification of this assumption depends on the type of data
used in the analysis. If the data are cross-sectional and are obtained as a random sample from the
relevant population, this assumption can often be justified. However, if the data are time series,
the assumption of independence is difficult to maintain,
vi. The Number of observations n must be greater than the number of parameters to Be
estimated:
Alternatively, the number of observations must be greater than the number of explanatory
variables. We need at least two pairs of observations to estimate the two unknowns. You may as
well recall from your high school mathematics that we need at least two equations to solve
simultaneously for values of two unknown variables.
vii. Variability of X
The X values in a given sample must not all be the same. Technically, the independent variable
(X) must be a positive number. Furthermore, there can be no outliers in the values of the X
variable.
No exact linear relationship between X1 and X2. No X variable has a linear relationship with
another X variable. Informally, no collinearity means none of the regressors can be written as
exact linear combinations of the remaining regressors in the model. Formally, no collinearity
means that there exists no set of numbers, λ1 and λ2, not both zero such that
λ1X1i + λ2X2i = 0.
ix. There is no specification bias. The model is correctly specified. We will discuss further
on model specification in unit 7 of this module.
42
Simple linear regression in STATA
Open the data editor in STATA and enter the data with a column for each variable.
Fertilizer Output
1 1
1 2
2 3
2 4
4 5
The top part of the table gives analysis of variance. At the top right corner we find the F statistic
and its calculated P-value. A large calculated F value means that the variation in the dependent
variable due to error is very small. The p-value is compared to a predefined significance level
like 0.05. When the P value is less than the level of significance, we conclude that the model in
significant.
43
We also have R-squared and Adjusted R-squared showing the proportion of variation in the
dependent variable explained by the independent variable(s) in the regression.
The lower part of the output shows the parameters together with their t-statistics and p-values.
The coefficient in the row of constant, is the intercept parameter that shows the value of the
dependent variable when the value of the independent variable if zero. The coefficient in the row
of fertilizer (independent variable) in the slope parameter for fertilizer. It is the marginal effect of
additional unit of fertilizer.
To test
𝐻0 : 𝛽1 = 0
𝐻1 : 𝛽1 ≠ 0
Test statistic z
𝛽̂1
𝑡=
𝑠(𝛽̂1 )
Reject 𝐻0 if;
𝛼 𝛼
𝑡 > 𝑛 ( 2 ; 𝑛 − 𝑘 − 1) 𝑜𝑟 𝑡 < −𝑛 ( 2 ; 𝑛 − 𝑘 − 1)
Alternatively, the P-value associated with t-statistic is compared to the level of significance like
0.05 such that if the P-value is less than the level of significance, we reject the null hypothesis.
The P = value associated with the F statistic is 0.0354 that is less than 0.05. This means model is
significant at 0.05 level of significance. Fertilizer explain 81 percent of the variation in the
output. The coefficient (slope parameter) for fertilizer is 0.7 and the p-value for fertilizer is 0.035
that is less than 0.05. This means that fertilizer has a significant positive effect on the level of
output. An additional unit of fertilizer will increase the output by 0.7 units on average.
44
4.4 MULTIPLE REGRESSION MODEL
A simple linear regression for a demand function expresses quantity demanded as a function of
price only.
𝑄 = 𝛽0 + 𝛽1 𝑃𝑅𝐼𝐶𝐸 + 𝜀
However, demand for a commodity is likely to depend not only on its own price but also on the
prices of other competing or complementary goods, income of the consumer, social status, etc.
Therefore, we need to extend our simple two-variable regression model to cover models
involving more than two variables. Adding more independent variables leads us to the discussion
of multiple regression models, that is, models in which the dependent variable, or regressand, Y
depends on two or more explanatory variables, or regressors.
Where Y is the dependent variable, Xs are the explanatory variables (or regressors), 𝜀𝑖 is the
stochastic disturbance term for the ith observation.
The simplest possible multiple regression model is three-variable regression, with one dependent
variable and two explanatory variables.
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + 𝜀𝑖
45
8. No exact collinearity between the X variables. No exact linear relationship between X2
and X3.
9. There is no specification bias. The model is correctly specified.
Like in the simple linear regression, we use the F-Statistic to test the overall statistical
significance of the regression relation between the response variable Y and the set of variables.
Given the assumptions of the classical regression model, it follows that, on taking the conditional
expectation of Y on both sides of the three variable multiple regression equation we have’
In words, the equation gives conditional mean or expected value of Y conditional upon the given
or fixed values of X2 and X3. Therefore, as in the two-variable case, multiple regression analysis
is regression analysis conditional upon the fixed values of the regressors, and what we obtain is
the average or mean value of Y or the mean response of Y for the given values of the regressors.
The slope coefficients are partial coefficients. The meaning of partial regression coefficient is as
follows: β2 measures the change in the mean value of Y, E(Y), per unit change in X2, holding the
value of X1 constant.
Let us consider a multiple regression model fitted to an extract from the data collected by the
National Statistical Office (NSO) from tobacco farmers during the third integrated survey. The
variable are Tobacco output in Kg, fertilizer (FERT) in Kg, labour in man-days and number of
schooling years, disregarding repetition, for the household head (edu).
The dependent variable in the model is output and the explanatory variables are labour, quantity
of fertilizer and number of effect schooling years (edu).
46
. reg output labor FERT edu
Interpretation
The P-value associated with the F statistic in 0.000 < 0.05. This means that the model as a whole
is significant at 0.05 level of significance. However the adjusted R-squared of 0.049 shows that
the independent variables in the model only explain 5 per cent of the variation in the dependent
variable.
Labour has a significant positive effect on tobacco output. Increasing quantity of labour used in
the production of tobacco by one man-day increases tobacco output by 0.7 kilogram holding all
other factors fixed. The reader should interpret the remaining variables as part of practice.
a. Best
This means that the OLS estimator have minimum (smallest) variance of all unbiased
linear estimators. This property is what makes the OLS estimators to be the best.
b. Linear
OLS estimators are linear functions of the dependent variable. So the OLS estimator is a
linear with respect to how it uses the values of the dependent variable only irrespective of
how it uses the values of the regressors.
c. Unbiased
47
In repeated sampling, the estimates have a normal distribution with mean equal to the
population mean. Unbiased estimator with minimum variance is efficient.
𝐸(𝛽̂0 ) = 𝛽0 and 𝐸(𝛽̂1 ) = 𝛽1
Gauss–Markov summarized the properties by saying that OLS estimators are best linear unbiased
estimators (BLUE).
This property extends to the entire 𝛽̂ vector; that is, 𝛽̂ is linear (each of its elements is a linear
function of Y, the dependent variable). E(𝛽̂ ) = 𝛽̂ , that is, the expected value of each element of 𝛽̂
is equal to the corresponding element of the true β, and in the class of all linear unbiased
estimators of β, the OLS estimator ˆβ has minimum variance.
The OLS estimators are also consistent. This means that the estimators approach the real value of
the population parameter as sample size increases.
CHECK POINT
48
4.6 DUMMY VARIABLES
In section 2.4 of this module, we said that when a discrete variable is used to recode a qualitative
characteristic, it is called a dummy variable. A dummy variable takes a binary or dichotomous
values (0 or 1).
For instance;
49
Using dummy variables
When dummy variables are conclusive, they sum up to one. You cannot put all of them in the
model because they may cause multicollinearity. We therefore leave out one of the dummy
variables to be used as base category (for comparison).
Including dummy variables for both genders is the simplest example of the so-called dummy
variable trap, which arises when too many dummy variables describe a given number of groups.
The number of dummy variables describing a variable should therefore be small.
Suppose wage is a function of education, there need to be a separate dummy variable for each
education level like no education, primary education, secondary education and tertiary education.
1 no education
Ed0 = {
0 otherwise
1 Primary education
Ed1 = {
0 otherwise
1 Secondary education
Ed2 = {
0 otherwise
1 Tertiary education
Ed3 = {
0 otherwise
(𝛽0 + 𝛿1 + 𝛿2 + 𝛿3 ) + 𝛽1 𝐸𝑥𝑝
(𝛽0 + 𝛿1 ) + 𝛽1 𝐸𝑥𝑝
(𝛽0 + 𝛿3 ) + 𝛽1 𝐸𝑥𝑝
{ 𝛽0 + 𝛽1 Exp
50
We leave out 𝐸𝑑0 to avoid perfect multicollinearity and 𝐸𝑑0 becomes the reference group or
base category. It does not matter which category is left out. However, care must be taken when
interpreting the coefficients.
𝛽0 + 𝛽1 Exp no education
{
Dummy variables can be categorized into intercept dummy variable and slope dummy variable
depending on how it has been used in the regression model. Let us consider a hedonic model.
This model says that price of a commodity is identified by looking at characteristics of the item.
In many cases a dummy variable modifies the intercept parameter. In the hedonic model, let us
assume;
𝑃𝑖 = 𝛽0 + 𝛽1 𝑆𝑖 + 𝜀𝑖
𝛽0 is land rent
In real estate, location is very important. Let us introduce a dummy variable for location.
𝑃𝑖 = 𝛽0 + 𝛿𝐷𝑖 + 𝛽1 𝑆𝑖 + 𝜀𝑖
51
1 𝑖𝑓 𝑡ℎ𝑒 𝑝𝑟𝑜𝑝𝑒𝑟𝑡𝑦 𝑖𝑠 𝑖𝑛 𝑎 𝑑𝑒𝑠𝑖𝑟𝑎𝑏𝑙𝑒 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑢𝑟ℎ𝑜𝑜𝑑
𝐷𝑖 = {
0 𝑖𝑓 𝑡ℎ𝑒 𝑝𝑟𝑜𝑝𝑒𝑟𝑡𝑦 𝑖𝑠 𝑛𝑜𝑡 𝑖𝑛 𝑎 𝑑𝑒𝑠𝑖𝑟𝑎𝑏𝑙𝑒 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑢𝑟ℎ𝑜𝑜𝑑
The researcher can decide the group to take the value of zero. Notice that the coefficient of the
dummy variable 𝛿 affects the intercept of the regression when 𝐷𝑖 = 1. 𝛿 is location premium.
(𝛽0 + 𝛿) + 𝛽1 𝑆 𝑖𝑓 𝐷𝑖 = 1
𝐸(𝑃) = {
𝛽0 + 𝛽1 𝑆 𝑖𝑓 𝐷𝑖 = 0
𝐸(𝑃) = (𝛽0 + 𝛿) + 𝛽1 𝑆
𝐸(𝑃) = 𝛽0 + 𝛽1 𝑆
We may want to consider the interaction term for location and size of the house. Let us assume
that;
𝑃𝑖 = 𝛽0 + 𝛽1 𝑆𝑖 + 𝛾(𝑆𝑖 𝐷𝑖 ) + 𝜀𝑖
In this case, 𝛾 is the interaction term of size and dummy variable for location 𝑆1 𝐷𝑖 .
52
If 𝐷𝑖 = 1 then change in price, ∆𝑃𝑖 is due to size and the difference will be 𝛾. When 𝑆𝑖 = 1 then
change in price, ∆𝑃𝑖 is due to size and the difference will be due to 𝐷𝑖 .
𝛽0 + (𝛽1 + 𝛾)𝑆𝑖 𝑖𝑓 𝐷𝑖 = 1
𝐸(𝑃𝑖 ) = 𝛽0 + 𝛽1 𝑆𝑖 + 𝛾(𝑆𝑖 𝐷𝑖 ) = {
𝛽0 + 𝛽1 𝑆𝑖 𝑖𝑓 𝐷𝑖 = 0
𝐸(𝑃) = 𝛽0 + 𝛽1 𝑆
The slopes are different by 𝛾. Therefore 𝛾 is a slope dummy variable. 𝛽0 is deliberately kept
the same in both equations to simplify the phenomena. However, the land in desirable location is
normally expensive. We now introduce an intercept dummy variable so that as we move to
another location the price of the house should change.
53
𝐸(𝑃) = (𝛽0 + 𝛿) + (𝛽1 + 𝛾)𝑆
𝛾 𝐸(𝑃) = (𝛽0 + 𝛿) + 𝛽1 𝑆
𝐸(𝑃) = 𝛽0 + 𝛽1 𝑆
𝑃𝑖 = 𝛽0 + 𝛿1 𝐷𝑖 + 𝛽1 𝑆𝑖 + 𝛾(𝑆𝑖 𝐷𝑖 ) + 𝜀𝑖
54
4.7 JOINT TEST
In the hedonic model above, one of the key questions to be answered could be, does location
affect the price? Since there are two parameters to be tested, this is a joint test.
Procedure
Run unrestricted model. This is the usual regression with all the in dependent variables of
interest.
Record the Error sum of Squares from the ANOVA table of unrestricted model (𝑆𝑆𝐸𝑈𝑅 ) .
Run a restricted model. This is a model with all the independent variables of interest
except the variable you want to test.
Record the Error sum of Squares from the ANOVA table of the restricted model (𝑆𝑆𝐸𝑈 ).
Compute the F-statistic.
Compare the F- statistic to the critical F-value. If the calculated F-statistic is greater than
the critical value, reject the null hypothesis of no effect and conclude that the variable has
an effect. Alternatively, compute the P-value of the calculated F-statistic and reject the
null hypothesis if the P-value is less that the level of significance.
Let
Example
Using IHS3 data, you may want to estimate the effect of location on aggregate expenditure.
55
Run the unrestricted regression model.
Compute F-statistic
(𝑆𝑆𝐸𝑅 − 𝑆𝑆𝐸𝑈𝑅 )⁄
𝑚
𝐹=
𝑆𝑆𝐸𝑈𝑅
⁄(𝑛 − 𝑘)
56
We reject the null hypothesis and conclude that location (being in urban) has a significant
positive effect on aggregate expenditure. Urban residents spend MK 166053.80 more than rural
residents on average holding all other factors constant.
This unit has introduce to you the method of Ordinary Least Squares (OLS) for estimating
parameters of a regressions module. You learnt manual computation of estimators of population
parameters for a simple linear regression based on the OLS technique. In addition, you
interpreted parameters of a simple linear regression and a multiple linear regression. Then you
covered properties of OLS estimators and the two types of dummy variables.
1. The data in Table 4.1 show values of the dependent variable Y for each
value of the independent variable X.
𝒀𝒊 𝑿𝒊
70 80
65 100
90 120
95 140
110 160
115 180
120 200
140 220
155 240
150 260
57
a. Calculate
i. ∑ 𝑋𝑖 iii. ∑ 𝑋𝑖 𝑌𝑖
ii. ∑ 𝑌𝑖 iv. ∑ 𝑋𝑖2
b. Calculate
i. The slope parameter 𝛽1
58
5 GOODNESS OF FIT AND ANALYSIS OF VARIANCE
UNIT INTRODUCTION
Unit 4 introduced to you the simple linear regression and a multiple linear regression. Not every
model fit the data very well. It is very important for economists to know how well the model fits
the data. This unit provides useful techniques for testing the goodness of fit. You will learn more
about coefficient of determination and correlation coefficient. You will also review same
commonly used distributions and the analysis of variance.
UNIT OBJECTIVES
The total variation or Total Sum of Squares (TSS) in the dependent variable, Y, is equal to the
sum of explained variation and residual variation.
Explained variation or Regression Sum of Squares (RSS) is the variation in the dependent
variable due to the effect of the independent variables in the model.
Residual variation or Error Sum of Squares refer to variation in the dependent variable due to the
error.
The coefficient of determination is the proportion if the variation in the dependent variable that
in explained by the independent variables in the regression model.
59
If we divide both sides of the equation by TSS, and decomposing the response and the
explanatory variables, there is a possibility of calculating the coefficient of determination.
∑(𝑦̂𝑖 − 𝑦̅)2
𝑅2 =
∑(𝑦𝑖 − 𝑦̅)2
𝑹𝑺𝑺
𝑹𝟐 =
𝑻𝑺𝑺
𝑇𝑆𝑆 − 𝐸𝑆𝑆
𝑅2 =
𝑇𝑆𝑆
𝑇𝑆𝑆 𝐸𝑆𝑆
𝑅 2 = 𝑇𝑆𝑆 - 𝑇𝑆𝑆
𝟐
̂ 𝒊 )𝟐
𝑬𝑺𝑺 ∑(𝒚𝒊 − 𝒚
𝑹 =𝟏− =
̅ )𝟐
𝑻𝑺𝑺 ∑(𝒚𝒊 − 𝒚
2
The coefficient of determination can also be expressed as adjusted R-squared (𝑅𝐴𝑑𝑗 ) . Some of
2
the reasons for using 𝑅𝐴𝑑𝑗 are;
a. To allow for comparison of several regressions fitted to data in order to choose the best
model.
b. To correct for degrees of freedom. R squared tend to increase as more and more
regressors enter the model. There is need to have a measure of goodness of fit that takes
into account the number of variables in the regression. The term adjusted means adjusted
for the degrees of freedom (df) associated with the sum of squares. Where k is the
number of variables and n is the sample size, the adjusted R squared can be computed
from the formula below.
60
2
𝑛−1
𝑅𝐴𝑑𝑗 = 1 − (1 − 𝑅 2 )
𝑛−𝑘
a. Coefficient of determination does not measure the magnitude of the slope of the
regression line.
b. Coefficient of determination is not a complete measure of the overall fitness of the
overall linear regression model.
c. Coefficient of determination is not a verification of correct specification of a fitted
model.
5.3 CORRELATION
A correlation exist between two variables when one of them is related to the other in some way.
Linear coefficient of correlation measures the strength of linear relationship between paired
values of two variables in a sample. Since we use sample data, coefficient of correlation is a
sample statistic, r. it is an estimate of a population parameter 𝜌 . This quantity is closely related
to but conceptually very much different from the coefficient of determination for a regression,
R2. To develop methods of using sample correlation coefficient to make inference to the
population coefficient of determination, we make the following assumptions;
𝑟 = ± √𝑟 2
∑ 𝑥𝑖 𝑦𝑖
𝑟=
√(∑ 𝑥𝑖2 )(∑ 𝑦𝑖2 )
61
𝑛 ∑ 𝑋𝑖 𝑌𝑖 − ∑ 𝑋𝑖 ∑ 𝑌𝑖
𝑟=
√[𝑛 ∑ 𝑋𝑖2 − (∑ 𝑋𝑖 )2 ][𝑛 ∑ 𝑌𝑖2 − (∑ 𝑌𝑖 )2 ]
i. It can be positive or negative, the sign depending on the sign of the term in the numerator
of definitional formula which measures the sample covariation of two variables.
ii. It lies between the limits of −1 and +1; that is, −1 ≤ r ≤ 1.
iii. It is symmetrical in nature; that is, the coefficient of correlation between X and Y (rXY )
is the same as that between Y and X(rY X).
iv. It is independent of the origin and scale
v. If X and Y are statistically independent the correlation coefficient between them is zero;
but if r = 0, it does not mean that two variables are independent. In other words, zero
ucorrelation does not necessarily imply independence.
vi. It is a measure of linear association or linear dependence only; it has no meaning for
describing nonlinear relations. Y = X2 is an exact relationship yet r is zero.
vii. Although it is a measure of linear association between two variables, it does not
necessarily imply any cause-and-effect relationship.
Example
The table below show hypothetical data for sales of two commodities in a supermarket
for 10 weeks. Calculate linear correlation of coefficient.
y x
1130 780
621 916
813 793
996 1188
1030 499
1257 1180
898 1229
743 1450
921 1071
1179 1153
62
Working
Y X xs ys xy
1130 780 608400 1276900 881400
621 916 839056 385641 568836
813 793 628849 660969 644709
996 1188 1411344 992016 1183248
1030 499 249001 1060900 513970
1257 1180 1392400 1580049 1483260
898 1229 1510441 806404 1103642
743 1450 2102500 552049 1077350
921 1071 1147041 848241 986391
1179 1153 1329409 1390041 1359387
9588 10259 11218441 9553210 9802193
𝑛 ∑ 𝑋𝑖 𝑌𝑖 − ∑ 𝑋𝑖 ∑ 𝑌𝑖
𝑟=
√[𝑛 ∑ 𝑋𝑖2 − (∑ 𝑋𝑖 )2 ][𝑛 ∑ 𝑌𝑖2 − (∑ 𝑌𝑖 )2 ]
10(9802193) − (10259)(9588)
𝑟=
√[(10)(11218441) − 102592 )[(10)(9553210) − 95882
𝑟 = −0.068
NORMAL DISTRIBUTION
The best known of all the theoretical probability distributions is the normal distribution, whose
bell-shaped picture is familiar to anyone with statistical knowledge. A (continuous) random
variable X is said to be normally distributed if its PDF has the following form:
1 1 (𝑥−𝜇)2
𝑓(𝑥) = 𝜎√2𝜋 𝑒𝑥𝑝 (− 2 ) −∞ < 𝑥 < ∞
𝜎2
63
The curve of a normal distribution is bell shaped with 95 percent of the observations lying within
two standard deviations from the mean as shown in the figure below.
| | | | | |
−3𝜎 − 2𝜎 −𝜎 µ 𝜎 2𝜎 3𝜎
68% (approx.)
95% (approx.)
99.7% (approx.)
Properties of a normal distribution
We use tables to find the probability that X lies within a certain interval. To use the table, we
convert the given normally distributed variable X with mean μ and σ2 into a standardized normal
variable Z by the following transformation.
𝑋−𝜇
𝑍=
𝜎
64
T-DISTRIBUTION
If the sample size n is less than 30, the population standard deviation is unknown, and the
population distribution can be assumed to be normal, then the boundaries a and b of a confidence
interval a ≤ µ ≤ b are determined by means of:
𝑋̅ − 𝜇
𝑇=
𝑆𝑋̅
If Z1 is a standardized normal variable [that is, Z1 ∼ N(0, 1)] and another variable Z2 follows the
chi-square distribution with k df and is distributed independently of Z1, then the variable defined
as
𝑍1
𝑡=
√𝑍2⁄
𝑘
k = 120 (normal)
k = 20
k=5
65
Properties of the t distribution
a. Like the normal distribution, is symmetrical, but it is flatter than the normal distribution.
But as the df increase, the t distribution approximates the normal distribution. One can
think of the number of degrees of freedom associated with a statistic as the number of
unrestricted, free-to-vary values that are used in calculating the statistic
b. The mean of the t distribution is zero, and its variance is k/(k − 2).
It can be proven mathematically that at infinite degrees of freedom, the t distribution and the
standard normal distribution are identical. As the degrees of freedom decrease from infinity, the t
curves remain bell-shaped and symmetric around a mean of zero, but they become progressively
flatter than the standard normal, with more area added to the tails. Below v = 30, the t distribution
is quite different from the standard normal distribution.
F DISTRIBUTION
Apart from the normal distribution, there are other distributions in statistics that are relevant in
econometrics. If we select randomly two independent samples from two normally distributed
𝑠2
populations with equal variances, the distribution of sample variances ( 𝑠12 ) is called the F-
2
distribution.
α
F-distribution showing non-negative values only
66
5.5 ANALYSIS OF VARIANCE (ANOVA)
If you select two samples from two populations with equal variance and give a specific treatment
to one of the samples, the observed variances may be due to the treatment or due to errors. The
analysis of variance is based on a comparison of two different estimates of population variance.
a. The variances between samples. This is the variance due to the treatment.
b. The variance within samples. This is the variance due to the error.
The method is called one way ANOVA because we use one property or characteristic to
categorize the two populations. This characteristic is called treatment of factor. One way
ANOVA helps us to test the null hypothesis that three of more population means are equal.
𝑯𝟎 ; 𝝁 𝟏 = 𝝁 𝟐 = 𝝁 𝟑
If the variation among the samples (due to treatment) is equal to variation within samples (due to
error), it means that the treatment did not have any effect and the F statistic will be equal to one.
F statistic close to one, shows that there is insufficient evidence for us to reject the null
hypothesis of equality of means. But if the F-statistic is very large, it shows that the variation due
to treatment in much larger than variation due to error. In such a case we reject the null
hypothesis especially if the calculated F statistic is larger than the critical value.
67
Below is the ANOVA Table, where k is the number of population means being compared (equal
to number of treatments) while n is sample size
68
Example
A company sells identical soap in three different wrappings at the same price. Sales data are
normally distributed with equal variance. The sales for 5 months are given in the table below.
Test at 5% level of significance whether the mean soap sales for each wrapping is equal or not.
SSE = (87 − 82)2 + (83 − 82)2 + (79 − 82)2 +(81 − 82)2 +(80 − 82)2 + (78 − 80)2 +
(81 − 80)2 + (79 − 80)2 + (82 − 80)2 + (80 − 80)2 + (90 − 87)2 + (91 − 87)2 +
(84 − 87)2 + (82 − 87)2 + (88 − 87)2
= 110
69
The calculated F value is 7.09 which is greater that the tabulated value. Therefore we reject the
hull hypothesis of equal means and concluded that the three means are not equal. This means that
the difference in wrapping had an effect on the sales of the clothes.
Data analysis is one of Add ins for excel. Add it by dropping down through file, options, add ins,
Analysis ToolPak- VBA, go, check in ASnalysis ToolPak-VBA box then click ok.
Enter the data in excel with the treatments as columns.
Go to;
data
data analysis
ANOVA single factor
Select all cells with data as input cells
You may select the output range, the default is new sheet.
Enter
SUMMARY
Groups Count Sum Average Variance
Wrapping 1 5 410 82 10
Wrapping 2 5 400 80 2.5
Wrapping 3 5 435 87 15
ANOVA
Source of VariationSS df MS F P-value F crit
Between Groups130 2 65 7.090909 0.00927 3.885294
Within Groups 110 12 9.166667
Total 240 14
70
Notice that the ANOVA table from excel has the same information as the one we got through
manual calculations. The interpretation is also the same. Notice that from the excel output, we
also have the P-value. This is an alternative way to the critical value approach for us to make
rejection decision. In this case we reject the null hypothesis is the P-value associated with the F
statistic is smaller than the level of significance like 0.05.
This unit has defined coefficient of determination as the proportion of variation in the dependent
variable that is explained by the regression model. For a multiple regression, we use adjusted R2.
You have also learnt that linear coefficient of correlation measures the strength of linear
relationship between paired values of two variables in a sample. Among the distributions that are
commonly used, you have studied normal distribution, t-distribution and F-distribution. You
were also able to compute the F-statistic using ANOVA
1. The table below gives the output for 8 years of an experimental farm that used each of 4
types of fertilizers. Assume that the outputs with each fertilizer are normally distributed
with equal variance.
a. Find the mean output for each fertilizer and the grand mean for all the years and for all
four fertilizers.
71
b. Estimate the population variance from the variance between the means or columns.
c. Estimate the population variance from the variance within the samples or columns.
d. Test the hypothesis that the population means are the same at the 5% level of
significance.
2. An experiment was conducted to compare three different computer keyboard designs with respect
to their effect on repetitive stress injuries (rsi). Fifteen businesses of comparable size participated
in a study to compare the three keyboard designs. Five of the fifteen businesses were randomly
selected and their computers were equipped with design 1 keyboards. Five of the remaining ten
were selected and equipped with design 2 keyboards, and the remaining five used design 3
keyboards. After one year, the number of rsi were recorded for each company. The results are
shown in Table 1.
72
6 UNIT SIX: INFERENCE AND PREDICTION
UNIT INTRODUCTION
In unit 3, you learnt that the population parameters are usually unknown and you can estimate
them by using data collected from an adequate and representative sample. In this unit you will
learn about hypothesis testing, point estimation and interval estimation. Statisticians and
econometricians are very systematic in their approach to hypothesis testing. You will learn the
steps in hypothesis testing and types of error that are committed in the process of hypothesis
testing.
UNIT OBJECTIVES
Very often we know or are willing to assume that a random variable X follows a particular
probability distribution but do not know the value(s) of the parameter(s) of the distribution. For
example, if a random variable, X follows a normal distribution, we may want to know the value
of its two parameters, namely, the mean and the variance. To estimate the unknowns, we assume
that we have a random sample of size n from the known probability distribution and use the
sample data to estimate the unknown parameters. This is known as the problem of estimation.
The problem of estimation is categorized into point estimation and interval estimation.
A point estimate of a parameter 𝜽 is a single number that can be regarded as a sensible value
̂ to denote the point estimator of a parameter 𝜽 . A point estimate is
for 𝜽 . Usually we use 𝜽
obtained by selecting a suitable statistic and computing its value from given sample data. The
selected statistic is called the point estimator. For example, mean and variance are point
estimators. The sample value of the mean is the point estimator. Different statistic can be used to
73
estimate the same parameter, i.e., a parameter may have multiple point estimators. For example,
the following are
i. Sample mean: µ̂ = 𝑋̅
ii. Sample median: µ̂ = 𝑋̃
min 𝑥𝑖 +max 𝑥𝑖
iii. Average of the extremes: 𝜇 = 2
̂ is a random variable, its value varies from sample to sample, so there are
A point estimator 𝜽
estimation errors.
̂ = 𝜽 + error of estimation
𝜽
Accurate estimators are close to the true population parameter and have small errors of
estimation, an estimator with unbiasedness and minimum variance will often be accurate
estimators in this sense.
̂
The standard error of an estimator 𝜽 ̂ ) . If the
is its standard deviation 𝜎𝜽̂ = √𝑉𝑎𝑟(𝜽
standard error itself involves unknown parameters whose values can be estimated, substitution of
these estimators into 𝜎𝜽̂ yields the estimated standard error of the estimator. The estimated
̂ 𝜽̂ or 𝑆𝜽̂ .
standard error can be denoted either by 𝜽
A point estimate may be the researcher’s best guess at the population value, but, by its nature, it
provides no information about how close the estimate is “likely” to be to the population
parameter. It does not, by itself, provide enough information for testing economic theories or for
informing policy discussions.
74
6.2 INTERVAL ESTIMATION
The reliability of a point estimator is measured by its standard error. Therefore, instead of relying
on the point estimate alone, we may construct an interval around the point estimator, say within
two or three standard errors on either side of the point estimator, such that this interval has, say,
95 percent probability of including the true parameter value. This is roughly the idea behind
interval estimation.
To be more specific, assume that we want to find out how “close,” say, 𝛽̂1 is to 𝛽1 . For this
purpose we try to find out two positive numbers δ and α, the latter lying between 0 and 1, such
that the probability that the random interval (𝛽̂1 − 𝛿 , 𝛽̂1 + δ) contains the true 𝛽̂1 is 1 − α.
Symbolically,
If the confidence interval contain zero, the parameter is not significant because it will take a zero
value at some point.
75
It is very important to know the following aspects of interval estimation:
Does not say that the probability of 𝛽̂1 lying between the given limits is 1 − α. Since 𝛽̂1,
although an unknown, is assumed to be some fixed number, either it lies in the interval or
it does not. The probability of constructing an interval that contains β2 is 1 − α.
The confidence interval is a random interval that is, it will vary from one sample to the
next because it is based on 𝛽̂1 which is random.
Since the confidence interval is random, the probability statements attached to it should
be understood in the long-run sense, that is, repeated sampling. More specifically, means:
If in repeated sampling confidence intervals like it are constructed a great many times on
the 1 − α probability basis, then, in the long run, on the average, such intervals will
enclose in 1 − α of the cases the true value of the parameter.
Once we have a specific sample and we obtain a specific numerical value of 𝛽̂1 , the
interval in is no longer random; it is fixed.
A hypothesis which says that a parameter has a specified value is called a point hypothesis. A
hypothesis which says that a parameter lies in a specified interval is called an internal hypothesis.
For instance, if β is the population mean, then 𝐻0 : β = 4 is a point by hypothesis. 𝐻0 : 4 ≤ p ≤ 7
is an interval hypothesis.
A hypothesis test is a procedure that answers the question of whether the observed difference
between the sample value and the population value hypothesized is real or due to chance
variation. For instance, if the hypothesis says that the population mean µ= 6 and the sample
mean 𝑥̅ = 8, then we want to know whether this difference is real or due to chance variation.
The hypothesis we are testing is called the null hypothesis and is often denoted by 𝐻0 .The
alternative hypothesis is denoted by 𝐻1 ,. The probability of rejecting 𝐻0 when, in fact, it is true,
is called the significance level. To test whether the observed difference between the data and
76
what is expected under the null hypothesis H,, is real or due to chance variation, we use a test
statistic.
A desirable criterion for the test statistic is that its sampling distribution be tractable, preferably
with tabulated probabilities. Tables are already available for the normal, t, χ2, and F distributions,
and hence the test statistics chosen are often those that have these distributions. The observed
significance level or P-value is the probability of getting a value of the test statistic that is at least
as extreme as the test statistic. This probability is computed on the basis that the null hypothesis
is correct.
Null hypothesis, 𝑯𝟎
The null hypothesis specifies the value you want to test. 𝐻0 is the belief that we maintain until
we test it with the data and the sample. The null hypothesis, denoted by Ho, is typically a clear
statement of equality: The unknown population parameter e is equal to some specific constant
value.
We begin with the assumption that H0 is true and any difference between the sample statistic and
true population parameter is due to chance and not a real (systematic) difference. Refers to the
status quo and a . The null hypothesis may or may not be
rejected.
Similar to the notion of “innocent until proven guilty” That is, “innocence” is a null hypothesis.
The population mean monthly cell phone bill of Lilongwe city is not more than MK
25,000.00: μ ≤ MK 25,000.00.
The average number of TV decoders in Blantyre homes is at least three; μ ≥ 3
77
Alternative hypothesis
It is the opposite of the null hypothesis. It challenges the status quo and never contains the signs
“=” , “≤” or “≥” .
May or may not be proven. It is generally the hypothesis that the researcher is trying to prove.
Evidence is always examined with respect to H1, never with respect to H0. We never “accept”
H0, we either “reject” or “not reject or fail to reject” it.
The population mean monthly cell phone bill of Lilongwe city is greater than MK
25,000.00: μ > MK 25,000.00.
The average number of TV decoders in Blantyre homes is less than three; μ < 3
Test statistic
Select the appropriate test statistic. Always state the test statistic. The following are test statistics
for one population.
When testing a hypothesis of a proportion, p, we use the z-statistic or z-test and where q
= 1 – p, the formula is as follows.
𝑝− 𝑝
𝑍=
𝑝𝑞
√
𝑛
When testing a hypothesis of a mean, we use the z-statistic or we use the t-statistic
according to the following conditions.
o If the population standard deviation, σ, is known and either the data are normally
distributed or the sample size n > 30, we use the normal distribution (z-statistic).
78
𝛽̂1 − 𝛽1
𝑍= 𝜎
⁄ 𝑛
√
o When the population standard deviation, σ, is unknown and either the data are
normally distributed or the sample size is greater than 30 (n > 30), we use the t-
distribution (t-statistic).
𝛽̂1 − 𝛽1
𝑡=
𝑆𝑒(𝛽1 )
𝛽̂1 − 𝛽1
𝑡= 𝑠
⁄ 𝑛
√
Level of significance, α
The researcher decides on the level of significance for the hypothesis testing. This is the
probability of rejecting a true null hypothesis. It is actually the lowest level at which the null
hypothesis can be rejected. There are three commonly used levels of significance and these are;
A 0.01 level of significance reduces the chance of rejecting a true null hypothesis while a 0.10
level of significance reduces the chance of failing to reject a false null hypothesis. Many people,
use the 0.05 level of significance as it balances up the possibility of making the two errors.
Rejection region
The level of significance is used to determine the critical value of the test statistic. The critical
value demarcates the distribution of the test statistic into rejection region (critical region) and
acceptance region.
79
For two tailed test, the rejection region is split into two halves, one on each tail of the
distribution.
Region of acceptance
−𝑡𝛼⁄2 𝑡𝛼⁄2
𝐻0 : µ = 0
We reject the null hypothesis if the calculated t-statistic is less than −𝑡𝛼⁄2 or is greater than
𝑡𝛼⁄2 . In absolute terms, we reject the null hypothesis if the calculated t-statistic is greater than
𝑡𝛼⁄2 .
For a one tailed test the rejection region is on one side. There are right tailed test and left tailed
test.
Rejection region
−𝑡𝛼⁄2
𝐻0 : 𝜇 ≥ 𝐾
80
We reject the null hypothesis if the calculated t-statistic is less than −𝑡𝛼⁄2 . In absolute terms,
we reject the null hypothesis if the calculated t-statistic is greater than 𝑡𝛼⁄2 .
Rejection region
𝑡𝛼⁄2
𝐻0 : 𝜇 ≤ 𝐾
We reject the null hypothesis if the calculated t-statistic is greater than 𝑡𝛼⁄2 . In absolute terms,
we also reject the null hypothesis if the calculated t-statistic is greater than 𝑡𝛼⁄2 . Usually, we
Conclusion
The calculated test statistic may fall within or outside the acceptance region. In the conclusion
for hypothesis testing, we may reject or fail to reject the null hypothesis depending upon where
the test statistic falls.
If the critical value falls in the rejection region, we reject the null hypothesis. In such a
case, the sample evidence is strong enough to warrant rejection of the null hypothesis.
The probability of rejecting a true null hypothesis is smaller than the level of
significance.
We fail to reject the null hypothesis when the critical value falls within the acceptance
region. In this case, we mean that on the basis of the sample evidence we have no reason
to reject it. The sample evidence is not strong enough to warrant rejection of the null
hypothesis. However, we are not saying that the null hypothesis is true beyond any doubt.
We fail to reject the null hypothesis says it more correctly than saying we accept the null
hypothesis.
81
6.4 TYPE I AND TYPE II ERRORS
When testing the hypothesis, we arrive at a conclusion of rejecting it or fail to it. Such
conclusions may be correct and sometimes wrong (even if we do everything correctly). There are
two different errors that we can make. We here by distinguish the two types of errors by calling
them type I error and type II error.
Type I error
This the mistake of rejecting the null hypothesis when it is actually true. The symbol α (alpha) is
used to represent the probability of committing type I error and it is called the level of
significance.
Type II error
This is the mistake of failing to reject the null hypothesis when it is actually false. The symbol β
(beta) is used to represent the probability of not committing type I error and it is called power of
the test. Put simply, power of the test is the probability of rejecting a false null hypothesis.
The type I and type II errors are better understood by relating them to a court case. The table
below shows the two types of errors and their analogy in the legal system.
Actual situation
Decision Hypothesis testing Legal system
True null False null Innocent Not innocent
hypothesis hypothesis
Do not reject No error Type II error No error Type II error
null hypothesis (1 − 𝛼) (𝛽) (Not guilty, not (Guilty, not
found guilty) found guilty)
(1 − 𝛼) (𝛽)
Reject null Type I error No error Type I error No error
hypothesis (𝛼) (1 − 𝛽) (Not guilty, (Guilty, found
found guilty) guilty)
(𝛼) (1 − 𝛽)
82
Controlling Type I and Type II errors
It is a standard procedure for a researcher to select the significance level, 𝛼 [p (Type I error)] but
we do not select 𝛽[ p(Type II error)]. It is desirable to have 𝜶 = 𝟎 and 𝜷 = 𝟎, but in reality
that is not possible. Therefore our main task is to manage the probabilities of Type I error and
Type II error.
There exist a mathematical relationship among 𝜶, 𝜷 and sample size, n such that when any
two are chosen, the third is automatically determined. You may refer to you statistics module or
research methods notes for formulas of determining sample size. The usual practice in research
and industry is to select values of 𝜶 and n, so that the value of 𝜷 is determined.
For any fixed 𝜶 an increase in sample size, n will cause a decrease in 𝜷. That is, a large
sample size will decrease the chance that you make the error of not rejecting the null
hypothesis when it is actually false.
For any fixed sample size, n a decrease in 𝜶, will cause an increase in 𝜷.
To decrease both 𝛼 and 𝛽, increase the sample size.
In section 6.1 we noted that there are three commonly used level of significance thus, 0.1, 0.05
and 0.01. Rather than testing at different significance levels, it is more informative to answer the
following question: Given the observed value of the t statistic, what is the smallest significance
level at which the null hypothesis would be rejected? This level is known as the p-value for the
test. Since a p-value is a probability, its value is always between zero and one.
In order to compute p-values, we either need extremely detailed printed tables of the t
distribution which is not very practical or a computer program that computes areas under the
probability density function of the t distribution. Most modern regression packages have this
capability. Some packages compute p-values routinely with each OLS regression.
The p-value is an alternative way, of concluding hypothesis testing, to the use of critical value of
the test statistic. The P–value is compared to the predetermined level of significance. We reject
the null hypothesis if the p-value is less than the level of significance.
83
6.6 CONCLUSION
In this unit, we have discussed about how to test hypotheses using a classical approach: after
stating the alternative hypothesis, we choose a significance level, which then determines a
critical value. Once the critical value has been identified, the value of the t statistic is compared
with the critical value, and the null is either rejected or not rejected at the given significance
level. You have also learnt that in rejecting or failing to reject the null hypothesis, we may
commit Type I or Type II error. You have finished the unit with use of P-value in hypothesis
testing.
84
7 UNIT 7: DATA CONFORMITY AND PROBLEMS OF
FITTING MODELS
UNIT INTRODUCTION
In the previous units, you learnt about estimation of population parameters, goodness of fit and
hypothesis testing. You have noticed that the quality of the estimates are as good as the quality of
data that is used in estimation. In this unit, you will learn problems associated with data. You
will also be introduced to model specification problems. In the last section, you will learn
different functional forms that you can use when analysing data.
OBJECTIVES
One of the assumptions of the classical linear regression model (CLRM), Assumption 9, is that
the regression model used in the analysis is “correctly” specified: If the model is not “correctly”
specified, we encounter the problem of model specification error or model specification bias.
Considering diversity of the topic, we hope to bring out some of the essential issues involved in
model specification and model evaluation.
85
DATA PROBLEMS
Missing data
The missing data problem can arise in a variety of forms. Often, we collect a random sample of
people, schools, cities, and so on, and then discover later that information is missing on some key
variables for several units in the sample. For example, some of the respondents have non
information on some variables. In panel data also, over time some respondents drop out or do not
provide information on all the questions.
If the reasons for the missing data are independent of the available observations, which are called
by Darnell the “ignorable case.” we simply ignore those observations.
If the missing observations may be systematically related to the available data. This is a more
serious case, for it may be the result of self-selection bias, that is, the observed data are not truly
randomly collected. There are more complicated solutions.
Non-random samples
A non-random sample is not representative of the population. The random sampling assumption
is violated, and we must worry about consequences for OLS estimation.
Provided there is enough variation in the independent variables in the sub-population, selection
on the basis of the independent variables is not a serious problem, other than that it results in
inefficient estimators.
86
Certain types of nonrandom sampling, like choosing a sample on the basis of the independent
variables, do not cause bias or inconsistency in OLS.
Outlying Observations
In some applications, especially, but not only, with small data sets, the OLS estimates are
influenced by one or several observations. Such observations are called outliers or influential
observations. In the regression context, an outlier may be defined as an observation with a “large
residual.” Loosely speaking, an observation is an outlier if dropping it from a regression analysis
makes the OLS estimates change by a practically “large” amount. OLS is susceptible to outlying
observations because it minimizes the sum of squared residuals.
a) A mistake has been made in entering the data. Adding extra zeros to a number or
misplacing a decimal point. It is always a good idea to compute summary statistics,
especially minimums and maximums, in order to catch mistakes in data entry. Outliers
can also arise when sampling from a small population if one or several
b) Members of the population are very different in some relevant aspect from the rest of the
population.
OLS results should probably be reported with and without outlying observations.
Outlying observations can provide important information by increasing the variation in
the explanatory variables which reduces standard errors.
Use functional forms that are less sensitive to outliers if possible. Certain functional
forms are less sensitive to outlying observations. For most economic variables, the
logarithmic transformation significantly narrows the range of the data and also yields
functional forms that can explain a broader range of data.
We can use an estimation method that is less sensitive to outliers than OLS.
This removes the need to explicitly search for outliers before estimation. One such
method is called least absolute deviations, or LAD. The LAD estimator minimizes the
sum of the absolute deviation of the residuals, rather than the sum of squared residuals.
87
MODEL FITTING
The first four are specification errors while the last two are misspecification errors.
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + 𝜀𝑖
i. If the left-out, or omitted, variable X2 is correlated with the included variable X1, that is,
r23, the correlation coefficient between the two variables is nonzero and 𝛼̂0 and 𝛼̂1 are
biased as well as inconsistent. The bias does not disappear as the sample size gets larger.
ii. Even if X1 and X2 are not correlated, 𝛼̂0 is biased, although 𝛼̂1 is now unbiased.
iii. The disturbance variance σ2 is incorrectly estimated.
𝜎2
iv. The conventionally measured variance of 𝛼̂1 (= ∑ 𝑋 2 ) is a biased estimator of the
1
88
Inclusion of an Irrelevant Variable (Overfitting a Model)
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝜀𝑖
In this case, we commit the specification error of including an unnecessary variable in the model.
The consequences of this specification error are as follows:
i. The OLS estimators of the parameters of the “incorrect” model are all unbiased and
consistent.
ii. The error variance σ2 is correctly estimated.
iii. The usual confidence interval and hypothesis-testing procedures remain valid.
iv. However, the estimated α’s will be generally inefficient, that is, their variances will be
generally larger than those of the ˆ β’s of the true model.
89
Errors of measurement
If there are errors of measurement in the regressand only, the OLS estimators are unbiased as
well as consistent but they are less efficient. If there are errors of measurement in the regressors,
the OLS estimators are biased as well as inconsistent. Even if errors of measurement are detected
or suspected, the remedies are often not easy. The use of instrumental or proxy variables is
theoretically attractive but not always practical. Thus it is very important in practice that the
researcher be careful in stating the sources of his/her data, how they were collected, definitions
used, etc. Data collected by official agencies often come with several footnotes and the
researcher should bring those to the attention of the reader.
Researchers have used a variety of criteria to choose between competing models and these
include
In this sections we consider some commonly used regression models that may be nonlinear in the
variables but are linear in the parameters or that can be made so by suitable transformations of
the variables. In particular, we discuss the following regression models:
90
The Log-log Model
𝛽
𝑌𝑖 = 𝛽0 𝑋1 1 𝑒 𝜀𝑖
ln = natural log (i.e., log to the base e, and where e = 2.718). If we let where α = 𝑙𝑛𝛽0 , the
model becomes;
𝑙𝑛𝑌𝑖 = 𝛼 + 𝛽1 𝑙𝑛𝑋𝑖 + ε𝑖
This model is linear in the parameters α and 𝛽1 and can be estimated by OLS regression.
Because of this linearity, such models are called log-log, double-log, or log-linear models. If the
assumptions of the classical linear regression model are fulfilled, the parameters can be estimated
by the OLS method by letting
𝑌𝑖∗ = 𝛼 + 𝛽1 𝑋𝑖∗ + ε𝑖
The coefficient 𝛽1 measures elasticity of Y with respect to X. The model assumes that the
elasticity 𝛽1 is constant.
Economists, businesspeople, and governments are often interested in finding out the rate of
growth of certain economic variables, such as population, GNP, money supply, employment,
productivity, and trade deficit.
Let 𝑌𝑡 denote real expenditure on services at time t and 𝑌0 the initial value of the expenditure
on services (i.e., the value at the end of 2017). Let us start from the well-known compound
interest formula
𝑌𝑡 = 𝑌0 (1 + 𝑟)𝑡
91
.where r is the compound (i.e., over time) rate of growth of Y. Taking the natural logarithm of the
formula we can write
Letting
𝑙𝑛𝑌0 = 𝛽0
𝑙𝑛(1 + 𝑟)𝑡 = 𝛽1
𝑙𝑛𝑌𝑖 = 𝛽0 + 𝛽1 𝑡 + ε𝑖
This model is like any other linear regression model in that the parameters 𝛽0 𝑎𝑛𝑑 𝛽1 β1 are
linear. The only difference is that the regressand is the logarithm of Y and the regressor is
Models like this are called semilog models because only one variable appears in the logarithmic
form. In addition, a model in which the regressand is logarithmic and the regressor is in level
form will be called a log–lin model. The slope coefficient measures the constant proportional or
relative change in Y for a given absolute change in the value of the regressor. When multiplied
by 100, will then give the percentage change or growth in Y for an absolute change in X.
Unlike the growth model just discussed, in which we were interested in finding the percent
growth in Y for an absolute change in X, suppose we now want to find the absolute change in Y
for a percent change in X. A model that can accomplish this purpose can be written as:
𝒀𝒊 = 𝜷𝟎 + 𝜷𝟏 𝒍𝒏𝑿𝒊 + 𝛆𝒊
For descriptive purposes we call such a model a lin–log model because the regressand is in level
form while the regressor is in logarithmic form. The slope coefficient measures absolute change
in Y for a given constant proportional or relative change in X.
92
Reciprocal Models
𝟏
𝒀𝒊 = 𝜷𝟎 + 𝜷𝟏 ( ) + 𝛆𝒊
𝑿𝒊
Although this model is nonlinear in the variable X because it enters inversely or reciprocally, the
model is linear in 𝜷𝟎 and 𝜷𝟏 is therefore a linear regression model.
𝟏
As X increases indefinitely, the term 𝜷𝟏 (𝑿 ) approaches zero (note: β2 is a constant)
𝒊
One of the important applications of reciprocal models is the celebrated Phillips curve of
macroeconomics. Using the data on percent rate of change of money wages (Y) and the
unemployment rate (X) for the United Kingdom for the period 1861–1957, Phillips obtained a
curve whose general shape resembles the curve of below;
Annual rate of inflation (Percent)
93
Log Hyperbola or Logarithmic Reciprocal Model
We conclude our discussion of reciprocal models by considering the logarithmic reciprocal
model, which takes the following form:
𝟏
𝒍𝒏𝒀𝒊 = 𝜷𝟎 − 𝜷𝟏 ( ) + 𝛆𝒊
𝑿𝒊
Its shape is as depicted in Figure 6.10. As this figure shows, initially Y increases at an increasing
rate (i.e., the curve is initially convex) and then it increases at a decreasing rate (i.e., the curve
becomes concave). Such a model may therefore be appropriate to model a short-run production
function. Recall from microeconomics that if labor and capital are the inputs in a production
function and if we keep the capital input constant but increase the labor input, the short-run
output–labor relationship will resemble Figure 6.10. (See Example
In this unit you have seen that specification errors occur when we have in mind a “true” model
but somehow we do not estimate the correct model. In model mis-specification errors, we do not
know the true model. You have learnt that missing data, non-random sample and outliers are
some of the data problems. In the same unit you learnt some commonly used regression models
that may be nonlinear in the variables but are linear in the parameters. The regression models
include log-log model, semi-log models, reciprocal models and logarithmic reciprocal model.
94
REFERENCES
Gujarati, D.N. and Porter, D.C., (2008). Basic Econometrics (Fifth edition). New York: The
McGraw−Hill Companies.
Ruth Bernstein and Stephen Bernstein (1999). Theory and Problems of Elements of Statistics II
Inferential Statistics. USA: McGraw-Hill
Triola, T.F., (2001). Elementary Statistics, (Eighth Edition). USA; Adson Wesley Longman, Inc.
95