0% found this document useful (1 vote)

235 views

ECONOMETRICS 1 Notes

The document discusses the meaning, types, importance and applications of econometrics. It introduces econometrics as the application of statistical methods to economic data in order to give empirical content to economic relationships. The document also discusses the aims of econometrics such as estimation, testing of economic theories and forecasting. Examples of applications in areas such as agriculture, finance and marketing are also provided.

Uploaded by

John Thindwa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

235 views

ECONOMETRICS 1 Notes

Uploaded by

John Thindwa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 100

Econometrics I

Course Code: AAE 316

LUANAR
Open and Distance Learning (ODL)
LILONGWE UNIVERSITY OF AGRICULTURE
AND NATURAL RESOURCES

Bachelor of Science Degree in Agricultural Economics

Econometrics I
COURSE CODE: AAE 316

By
Elisa Nathan Ebiyamu

2
TABLE OF CONTENTS

TABLE OF CONTENTS ................................................................................................................. i

1. UNIT ONE: MEANING OF ECONOMETRICS................................................................... 1

1.1 DEFINING ECONOMETRICS ....................................................................................... 1

1.2 AIMS OF ECONOMETRICS .......................................................................................... 2

1.3 IMPORTANCE OF ECONOMETRICS .......................................................................... 3

1.4 TYPES OF ECONOMETRICS ....................................................................................... 4

1.5 APPLICATION OF ECONOMETRICS IN REAL WORLD ......................................... 5

1.6 UNIT SUMMARY ........................................................................................................... 7

1.7 UNIT EXERCISE ............................................................................................................ 7

2 UNIT TWO: VARIABLES .................................................................................................... 9

2.1 DEFINING A RANDOM VARIABLE ........................................................................... 9

2.2 TYPES OF VARIABLES .............................................................................................. 10

2.3 MEASUREMENT SCALES OF VARIABLES ............................................................ 12

2.4 DATA SET..................................................................................................................... 13

2.5 UNIT SUMMARY ......................................................................................................... 15

2.6 UNIT EXERCISE .......................................................................................................... 16

3 UNIT THREE: ECONOMETRIC MODELLING ............................................................... 17

3.1 ECONOMETRIC PROCUDURE .................................................................................. 17

3.2 REGRESSION DEFINED ............................................................................................. 22

3.3 BASICS OF A SIMPLE LINEAR REGRESSION ....................................................... 24

Population and sample .......................................................................................................... 24

Importance of the error term ................................................................................................. 26

Linearity of a regression model ............................................................................................ 28

i
Distribution of the dependent variable .................................................................................. 28

3.4 UNIT SUMMARY ......................................................................................................... 29

4 Unit 4: ESTIMATION OF A SIMPLE LINEAR REGRESSION ....................................... 31

4.1 ESTIMATION TECHNIQUES ..................................................................................... 31

4.2 THE METHOD OF ORDINARY LEAST SQUARES ................................................. 32

4.3 ASSUMPTIUON OF A CLASSICAL LINEAR REGRESSION ................................. 39

4.4 MULTIPLE REGRESSION MODEL ........................................................................... 45

4.5 PROPERTIES OF OLS ESTIMATORS ....................................................................... 47

4.6 DUMMY VARIABLES ................................................................................................. 49

4.7 JOINT TEST .................................................................................................................. 55

4.8 UNIT SUMMARY ......................................................................................................... 57

4.9 UNIT EXERCISE .......................................................................................................... 57

5 GOODNESS OF FIT AND ANALYSIS OF VARIANCE .................................................. 59

5.1 COEFFICIENT OF DETERMINATION, R2 ................................................................ 59

5.2 ADJUSTED R-SQUARED ............................................................................................ 60

5.3 CORRELATION ............................................................................................................ 61

5.4 REVIEW OF SOME DISTRIBUTION ......................................................................... 63

NORMAL DISTRIBUTION ................................................................................................ 63

T-DISTRIBUTION ............................................................................................................... 65

F DISTRIBUTION ............................................................................................................... 66

5.5 ANALYSIS OF VARIANCE (ANOVA) ...................................................................... 67

5.6 UNIT SUMMARY ......................................................................................................... 71

5.7 UNIT EXERCISE .......................................................................................................... 71

6 UNIT SIX: INFERENCE AND PREDICTION ................................................................... 73

6.1 POINT ESTIMATE ....................................................................................................... 73

ii
6.2 INTERVAL ESTIMATION .......................................................................................... 75

6.3 HYPOTHESIS TESTING .............................................................................................. 76

6.4 TYPE I AND TYPE II ERRORS................................................................................... 82

6.5 P-VALUE OF HYPOTHESIS TESTING...................................................................... 83

6.6 CONCLUSION .............................................................................................................. 84

6.7 UNIT EXERCISE .......................................................................................................... 84

7 UNIT 7: DATA CONFORMITY AND PROBLEMS OF FITTING MODELS ................. 85

7.1 MODEL SPECIFICATION AND MISSPECIFICATION ERRORS ........................... 85

DATA PROBLEMS ............................................................................................................. 86

MODEL FITTING ................................................................................................................ 88

7.2 FUNCTIONAL FORMS OF REGRESSION MODELS .............................................. 90

The Log-log Model ............................................................................................................... 91

Semilog Models: Log–Lin and Lin–Log Models ................................................................. 91

Log Hyperbola or Logarithmic Reciprocal Model .......................................................... 94

7.3 UNIT SUMMARY ......................................................................................................... 94

7.4 UNIT EXERCISE .......................................................................................................... 94

REFERENCES ............................................................................................................................. 95

iii
1. UNIT ONE: MEANING OF ECONOMETRICS

INTRODUCTION

Economics suggests important relationships, often with policy implications, but virtually never
suggests quantitative magnitudes of causal effects. What is the quantitative effect of increasing
amount of fertilizer applied to maize on yield? Econometrics provide methods for estimating
causal effects as well as forecasting using observational data. You may be wondering as to what
econometrics means? This unit introduces you to the meaning, types and importance of
econometrics.

OBJECTIVES

By the end of this unit, students must be able to;

a. define econometrics
b. state the importance of econometrics
c. list types of econometrics
d. give examples of applications and use of econometrics in real world

1.1 DEFINING ECONOMETRICS

There are several aspects of the quantitative approach to economics, and no single one of these
aspects taken by itself, should be confounded with econometrics. Thus, econometrics is by no
means the same as economic statistics. Nor is it identical with what we call general economic
theory, although a considerable portion of this theory has a definitely quantitative character. Nor
should econometrics be taken as synonymous with the application of mathematics to economics.
Experience has shown that each of these three viewpoints, that of statistics, economic theory, and
mathematics, is a necessary, but not by itself a sufficient, condition for a real understanding of
the quantitative relations in modern economic life. It is the unification of all three that is
powerful. And it is this unification that constitutes econometrics.

1
Econometrics may be defined as the social science in which the tools of economic theory,
mathematics, and statistical inference are applied to the analysis of economic phenomena.

Social science is, in its broadest sense, the study of society and the manner in which people
behave and influence the world around us. It tells us about the world beyond our immediate
experience, and can help explain how our own society works from the causes of unemployment
or what helps economics.

Economic phenomenon refers to situations or problems that economists deal with. For example,
explaining changes in commodity prices. Econometrics is concerned with the empirical
determination of economic laws.

Econometrics can also be defined as a statistical and mathematical methods to the analysis of
economic data, with a purpose of giving empirical content to economic theories and verifying
them or refuting them.

The art of the econometrician consists in finding the set of assumptions that are both sufficiently
specific and sufficiently realistic to allow him to take the best possible advantage of the data
available to him. Econometrics is seen as the vehicle by which economics can claim scientific
validity.

1.2 AIMS OF ECONOMETRICS

The three aims of econometrics are formulation and specification of econometric models,
estimating as well as testing of models and using models.

Specification of econometric models

The economic models are formulated in an empirically testable form. Several econometric
models can be derived from an economic model. Such models differ due to different choice of
functional form, specification of stochastic structure of the variables etc.

Estimation and testing of models:

The models are estimated on the basis of observed set of data and are tested for their suitability.
This is the part of statistical inference of the modeling. Various estimation procedures are used to

2
know the numerical values of the unknown parameters of the model. Based on various
formulations of statistical models, a suitable and appropriate model is selected.

Use of models:

The obtained models are used for forecasting and policy formulation which is an essential part in
any policy decision. Such forecasts help the policy makers to judge the goodness of fitted model
and take necessary measures in order to re-adjust the relevant economic variables.

CHECK POINT

1. Define econometrics.
________________________________________________________________________
________________________________________________________________________
2. List three aims of econometrics.
a. __________________________________________________________________
b. __________________________________________________________________
c. __________________________________________________________________
3. Discuss whether econometrics is superior over economic theory, mathematics and
statistics.
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

1.3 IMPORTANCE OF ECONOMETRICS

Many economics students think that econometrics challenge their brains for nothing. However,
econometrics is a vital course in economics. Some of the importance of econometrics are
outlined below.

Econometrics provides necessary tools to an economist to derive useful information about

economic policies.

3
Economics as a science involves knowing the theory and establishing the set of hypotheses
which is understood by studying economics. Once the theory is known, the theory theory or the
hypotheses using is tested using various techniques. Testing of theory or hypothesis is achieved
through studying econometrics.

Econometrics contains statistical tools to help you defend or test assertions in economic theory.

For example, you think that the production in an economy is in Cobb-Douglas form. But do data
support your hypothesis? Econometrics can help you in this case. The econometrician uses the
mathematical equations proposed by the mathematical economist but puts these equations in
such a form that they lend themselves to empirical testing.

Studying econometrics is crucial in understanding economic policy issues.

Econometrics is interested in the empirical verification of economic theory. Econometrics use

economic data to test economic theories.

1.4 TYPES OF ECONOMETRICS

Econometrics can be classified into two groups as shown in Figure 1.6.1 below.

Econometrics

Theoretical Applied

Classica Bayesian Classical Bayesian

Figure 1.6.1: Types of Econometrics

4
Theoretical Econometrics

Theoretical econometrics is concerned with the development of appropriate methods for

measuring economic relationships specified by econometric models. For example, one of the
methods used extensively is ordinary least squares. Theoretical econometrics must spell out the
assumptions of this method, properties of its estimators, and what happens to these properties
when one or more of the assumptions of the method are not fulfilled.

Applied econometrics

In applied econometrics we use the tools of theoretical econometrics to study some special
field(s) of economics and business, such as the production function, investment function,
demand and supply functions, portfolio theory,

1.5 APPLICATION OF ECONOMETRICS IN REAL WORLD

Forecasting macroeconomic indicators: Some macroeconomists are concerned with the

expected effects of monetary and fiscal policy on the aggregate performance of the economy.
Time-series models can be used to make predictions about these economic indicators.

Estimating the impact of immigration on native workers: Immigration increases the supply of
workers, so standard economic theory predicts that equilibrium wages will decrease for all
workers. However, since immigration can also have positive demand effects, econometric
estimates are necessary to determine the net impact of immigration in the labor market.

Identifying the factors that affect a firm’s entry and exit into a market: The microeconomic field
of industrial organization, among many issues of interest, is concerned with firm concentration
and market power. Theory suggests that many factors, including existing profit levels, fixed costs
associated with entry/exit, and government regulations can influence market structure.
Econometric estimation helps determine which factors are the most important for firm entry and
exit.

Determining the influence of minimum-wage laws on employment levels: The minimum wage is
an example of a price floor, so higher minimum wages are supposed to create a surplus of labor
(higher levels of unemployment). However, the impact of price floors like the minimum wage

5
depends on the shapes of the demand and supply curves. Therefore, labor economists use
econometric techniques to estimate the actual effect of such policies.

Finding the relationship between management techniques and worker productivity: The use of
high-performance work practices (such as worker autonomy, flexible work schedules, and other
policies designed to keep workers happy) has become more popular among managers. At some
point, however, the cost of implementing these policies can exceed the productivity benefits.
Econometric models can be used to determine which policies le to the highest returns and
improve managerial efficiency.

Measuring the association between insurance coverage and individual health outcomes: One of
the arguments for increasing the availability (and affordability) of medical insurance coverage is
that it should improve health outcomes and reduce overall medical expenditures. Health
economists may use econometric models with aggregate data (from countries) on medical
coverage rates and health outcomes or use individual-level data with qualitative measures of
insurance coverage and health status.

Deriving the effect of dividend announcements on stock market prices and investor behavior:
Dividends represent the distribution of company profits to its shareholders. Sometimes the
announcement of a dividend payment can be viewed as good news when shareholders seek
investment income, but sometimes they can be viewed as bad news when shareholders prefer
reinvestment of firm profits through retained earnings. The net effect of dividend announcements
can be estimated using econometric models and data of investor behavior.

Predicting revenue increases in response to a marketing campaign: The field of marketing has
become increasingly dependent on empirical methods. A marketing or sales manager may want
to determine the relationship between marketing efforts and sales. How much additional revenue
is generated from an additional dollar spent on advertising? Which type of advertising (radio,
TV, newspaper, and so on) yields the largest impact on sales? These types of questions can be
addressed with econometric techniques.

Calculating the impact of a firm’s tax credits on R&D expenditure: Tax credits for research and
development (R&D) are designed to provide an incentive for firms to engage in activities related
to product innovation and quality improvement. Econometric estimates can be used to determine

6
how changes in the tax credits influence R&D expenditure and how distributional effects may
produce tax-credit effects that vary by firm size.

Estimating the impact of cap-and-trade policies on pollution levels: Environmental economists

have discovered that combining legal limits on emissions with the creation of a market that
allows firms to purchase the “right to pollute” can reduce overall pollution levels. Econometric
models can be used to determine the most efficient combination of state regulations, pollution
permits, and taxes to improve environmental conditions and minimize the impact on firms.

1.6 UNIT SUMMARY

This unit has defined Econometrics as the social science in which the tools of economic theory,
mathematics, and statistical inference are applied to the analysis of economic phenomena. You
have also learnt aims and application of econometrics

1.7 UNIT EXERCISE

1. Differentiate between theoretical and applied econometrics

________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
2. Explain five importance of econometrics
a. __________________________________________________________________
__________________________________________________________________
b. __________________________________________________________________
__________________________________________________________________
c. __________________________________________________________________
__________________________________________________________________
d. __________________________________________________________________
__________________________________________________________________

7
e. __________________________________________________________________
__________________________________________________________________
3. Describe any two application of econometrics in real world.
a. __________________________________________________________________
__________________________________________________________________
b. __________________________________________________________________
__________________________________________________________________

8
2 UNIT TWO: VARIABLES

INTRODUCTION

In unit 1, you defined econometrics as statistical and mathematical methods to the analysis of
economic data, with a purpose of giving empirical content to economic theories and verifying
them or refuting them. The data is collected on certain characteristics or attributes or variables
from sampled respondents. This unit discusses the meaning, types and measurement levels of
variables.

OBJECTIVES

By the end of this unit, students must be able to;

a. define a random variable.

b. classify variables
c. describe levels of measurement of variables.
d. categorize data

2.1 DEFINING A RANDOM VARIABLE

In microeconomics, you learnt about the theory of demand. This theory states that there is an
inverse relationship between price of a good or service and quantity demanded for the same
product. You also learnt that the demand function can be shifted due to changes in income, price
of substitutes or complements among other demand shifters. Just like quantity demanded, price
and income are random variables. A variable is an entity that can take different values.

The term random is a synonym for the term stochastic. The random or stochastic variable is a
variable that can take on any set of values, positive or negative, with a given probability. We
can also say that a random variable is a variable whose value is not known until it is observed.
The value of a random variable may result from an experiment and it is not perfectly predictable.

A fixed variable X is said to be a random variable if for every real number a there exists a
probability P(X ≤ a) that X takes on a value less than or equal to a. We shall denote random

9
variables by capital letters X, Y, Z, and so on. We shall use lowercase letters, x, y, z, and so on,
to denote particular values of the random variables. A fixed variable is predictable.

Examples of a discrete random variable are;

a. The number of customers arriving,

b. Height of a student.
c. Gross Domestic Product
d. Household expenditure

2.2 TYPES OF VARIABLES

Random variables can be categorized in different ways.

Quantitative and qualitative random variables

Quantitative random variable

A variable is quantitative if the description of a characteristic of interest results in a numerical

value. A quantitative variables can either be discrete of continuous.

If the random variable can assume only a particular finite set of values, it is said to be a discrete
random variable. The value of a discrete random variable in observed by counting.

Examples of discrete random variables include;

a. Number of minibuses passing through a road block at lunch hour.

b. Number of courses in BSc. In Agricultural Economics Programme at LUANAR.
c. Number of postgraduate degree programmes offered at LUANAR in 2018.
d. Household size

A random variable is said to be continuous if it can assume any value in a certain range. The
value of a continuous random variable is usually obtained by measurement.

10
Example of a continuous random variable are;

a. Income of a family in the southern region of Malawi.

b. Quantity of soya bean demanded
c. Height of a maize plant
d. Distance from home to the nearest market

Qualitative random variables

Not all random variables describe a characteristic using numerical values. Suppose you want to
report sex of a responded, the response is going to be either male or female. None of these two
responses is a numerical value.

Qualitative random variable is a variable whose description of a characteristic of interest results

in a non-numerical value.

Examples of qualitative variables are;

a. Gender
b. marital status
c. location
d. Participation in a project.

Dummy Variable

In economics, qualitative variables are coded using discrete variables. When a discrete variable
is used to recode a qualitative characteristic, it is called a dummy variable. A dummy variable is
also called a design variable or indicator variable.

Example

Let us create a dummy variable for sex of household head that takes a value of 1 if the household
head is male and 0 if the household head is female.

1 if house hold head is male

𝑫={
0, otherwise

11
Independent and dependent variables

In an econometric model, a random variable can be either dependent or independent variable. A

dependent variable is the variable being explained by the model. It is the response variable. An
independent variable is a variable that explains changes in the dependent variable. Like any other
variable, a dummy variable can either be a dependent variable or an independent variable.

The nature of the dependent variable is one of the factors that determine the choice of an
econometric model to be used in data analysis. For example, you may use a regression model if
the dependent variable is a continuous variable and you can use a probit model if the dependent
variable is dichotomous.

2.3 MEASUREMENT SCALES OF VARIABLES

The variables that we will generally encounter fall into four broad levels or scales of
measurement. These are ratio, interval, ordinal and nominal.

Ratio Scale

𝑥1
For a variable X, taking two values, 𝑥1 and 𝑥2 , the ratio ⁄𝑥2 and the distance (𝑥1 − 𝑥2 )
are meaningful quantities. Also, there is a natural ordering (ascending or descending) of the
values along the scale. Therefore, comparisons such as 𝑥2 ≤ 𝑥1 or 𝑥2 ≥ 𝑥1 are meaningful. Most
economic variables belong to this category. Thus, it is meaningful to ask how big this year’s
GDP is compared with the previous year’s GDP. Personal income, measured in dollars, is a ratio
variable; someone earning $100,000 is making twice as much as another person earning $50,000.

Interval Scale

An interval scale variable satisfies the last two properties of the ratio scale variable but not the
first. Thus, the distance between two time periods, say (2000–1995) is meaningful, but not the
ratio of two time periods (2000/1995). Without a true zero, it is impossible to compute
ratios. With interval data, we can add and subtract, but cannot multiply or divide.

Examples are temperature and time. You may have heard from a weather report that at 11:00
a.m. on 25th December, 2017, Lilongwe reported a temperature of 20 degrees Celsius while
12
Ngabu reached 40 degrees Celsius. Temperature is not measured on a ratio scale since it does not
make sense to claim that was Ngabu 100 percent warmer than Lilongwe. This is mainly due to
the fact that the Celsius scale does not use 0 degrees as a natural base. Thus 0 degrees Celsius is
arbitrary and does not mean absence of heat energy or internal energy.

Ordinal Scale

A variable belongs to this category only if it satisfies the third property of the ratio scale (i.e.,
natural ordering). For these variables the ordering exists but the distances between the categories
cannot be quantified. Students of economics will recall the indifference curves between two
goods. Each higher indifference curve indicates a higher level of utility, but one cannot quantify
by how much one indifference curve is higher than the others.

Examples are;

a. Grading systems (A, B, C grades)

b. Income class (upper, middle, lower).

Nominal Scale

Variables in this category have none of the features of the ratio scale variables. Variables such as
gender (male, female) and marital status (married, unmarried, divorced, separated) simply denote
categories.

As we shall see, econometric techniques that may be suitable for ratio scale variables may not be
suitable for nominal scale variables. It is therefore important to know how the variables were
measured

2.4 DATA SET

Data are observations that have been collected on variables. Data are sometimes used to calculate
statistics. The success of any econometric analysis ultimately depends on the availability of the
appropriate data. Data collection is a very important stage in research. If you are carrying out a
research, make sure all necessary steps are followed to ensure the data that is collected is of good
quality. The results of research are only as good as the quality of the data.

13
Types of Data

Four types of data may be available for empirical analysis: time series, cross-section, pooled data
and panel data.

Time Series Data

A time series is a set of observations on the values that a variable takes at different times. Such
data may be collected at regular time intervals.

Examples

a. daily (e.g., stock prices, weather reports)

b. weekly (e.g., money supply figures)
c. monthly (e.g., the unemployment rate, the Consumer Price Index [CPI])
d. quarterly (e.g., GDP)
e. annually (e.g., government budgets )

Cross-Section Data

Cross-section data are data collected on one or more variables collected at the same point in
time.

Examples

a. Population Census data collected by the National Statistical Office (NSO) every 10 years
b. Integrated Household Survey data collected by NSO every 5 years
c. Base line survey data for a project

Pooled data

These are data with combined elements of both time series and cross-section data. In pooled data
we have a “time series of cross sections,” but the observations in each cross section do not
necessarily refer to the same unit.

14
Panel data

Department of Commerce carries out a census of housing at periodic intervals. At each periodic
survey the same household (or the people living at the same address). Interviewed to find out if
there has been any change in the housing and financial conditions of that household since the last
survey. By interviewing the same household periodically, the data becomes panel data that
provides very useful information on the dynamics of household behavior. Panel data are data
from samples of the same cross-sectional units observed at multiple points in time.

Each type of data in analysed using specific econometric models. There are some models that
can be used to analyse cross sectional data but cannot be used to analyse time series data. For
example you can use Auto-Regression Models (AR) to analyse tome series data but you cannot
use these models to analyse cross-section data. Therefore, before choosing a model for analysing
data, a research need to understand the type of data that is available for analysis.

2.5 UNIT SUMMARY

We started this unit by defining a random variable as a variable that can take on any set of
values, positive or negative, with a given probability. We also classified variables into
quantitative variables and qualitative variables. It has been observed that quantitative variables
can be continuous or discrete. A dummy variable was defined as a numerical variable used to
represent a qualitative variable (subgroups). Variables can be measured on different levels
including ratio scales, interval scale, ordinal scale and nominal scale. Data are observations that
have been collected on variables. Data can be categorized as cross-section data, time series data,
pooled data or panel data.

15
2.6 UNIT EXERCISE

1. Differentiate between the following ;

a. Continuous data and discrete data
__________________________________________________________________
__________________________________________________________________

b. Continuous data and quantitative data

__________________________________________________________________
________________________________________________________________
c. Nominal data and interval data
__________________________________________________________________
__________________________________________________________________
Ratio data and ordinal data
d. __________________________________________________________________
__________________________________________________________________
2. Give two examples for each of the following;
a. Cross-section data
__________________________________________________________________
__________________________________________________________________
b. Time series data
__________________________________________________________________
__________________________________________________________________
c. Interval data
__________________________________________________________________
__________________________________________________________________
d. Qualitative variable
__________________________________________________________________
__________________________________________________________________
3. Define a random variable.
________________________________________________________________________
________________________________________________________________________
16
3 UNIT THREE: ECONOMETRIC MODELLING

INTRODUCTION

In unit 2, you learnt about variables and data. It was noted that econometric analysis of data
partly depends on the type of data and nature of dependent variable. Economics students often
ask themselves questions like; what model can I use to analyse my data? How can I interpret the
results? In this unit we introduce to you one of the mostly used econometric models called Linear
Regression Model (LRM). You will find this unit very important in your career and research.

OBJECTIVES

By the end of this unit students must be able to;

a. Describe the econometric procedure.

b. Classify regressions
c. Explain assumptions of a simple linear regression
d. Calculate parameters of a simple linear regression
e. Interpret a simple linear regression
f. Use statistical software to run a multiple regression

3.1 ECONOMETRIC PROCUDURE

Figure 3.1 summarises the procedure in econometric analysis. Broadly speaking, traditional
econometric methodology proceeds along the following lines.

a. Statement of theory or hypothesis.

b. Specification of the mathematical model of the theory.
c. Specification of the statistical, or econometric, model.
d. Obtaining the data.
e. Estimation of the parameters of the econometric model.
f. Hypothesis testing.
g. Forecasting or prediction.
h. Using the model for control or policy purposes.

17
Figure 3.1: Econometric procedure

18
Statement of Theory or Hypothesis

This is where the researcher states what economic theory says about the inter-dependency of
economic variables of interest. Economists include variables in econometric models based on
theory. For example, the theory of demand states that quantity demanded for a product for a
given time period is inversely related to its own price. Economic theory also indicates the
direction of relationship between variables. However, the theory does not quantify the
dependency.

Specification of the Mathematical Model of Consumption

The theory of demand can be presented in a mathematical form as shown below;

𝑌 = 𝛽0 + 𝛽1 𝑋 ……………………………………………………………...……… 3.1

In equation 3.1,

𝑦 is the dependent variable

𝑥 is the independent variable.
𝛽0 is the intercept parameter
𝛽1 is the slope parameter

The mathematical model is deterministic or exact. It does not take into account the possibility of
error.

Specification of the Econometric Model

The purely mathematical model of the consumption function given in Eq. (3.1) is of limited
interest to the econometrician, for it assumes that there is an exact or deterministic relationship
between the two variables. But relationships between economic variables are generally inexact.
Thus, if we were to obtain data on consumption expenditure and disposable (i.e., after tax)
income of a sample of, say, 500 Malawian families and plot these data on a graph paper with
consumption expenditure on the vertical axis and disposable income on the horizontal axis, we
would not expect all 500 observations to lie exactly on the straight line of Eq. (3.1) because, in
addition to income, other variables affect consumption expenditure.

19
To allow for the inexact relationships between economic variables, the econometrician would
modify the deterministic consumption function in Eq. (3.1) as follows:

𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜀 ……………………………….…………………………………… 3.2

Where ε, known as the disturbance, or error, term, is a random (stochastic) variable that has well-
defined probabilistic properties. The disturbance term ε may well represent all those factors that
affect consumption but are not taken into account explicitly. The error term makes the
econometric model stochastic.

Equation 3.2 is an example of an econometric model. More technically, it is an example of a

linear regression model, which is the major concern from this unit to the end of the module. The
econometric consumption function hypothesizes that the dependent variable Y (consumption) is
linearly related to the explanatory variable X (income) but that the relationship between the two
is not exact; it is subject to individual variation.

Data collection

To estimate the econometric model given in Eq. (3.2), that is, to obtain the numerical values of
𝛽0 and 𝛽1 we need data. Data is collected on both dependent and independent variables. As noted
in section 2.4, the quality of estimates are as good as the quality of data. There is need for
selecting an adequate sample using appropriate sampling technique. Data collection officers
(enumerators) should be those who are trustworthy, well trained and preferably experienced.

Estimation of parameters

Now that we have the data, our next task is to estimate the parameters of the model. The
numerical estimates of the parameters give empirical content to the consumption function. The
actual mechanics of estimating the parameters will be discussed later in this module. For now,
note that the statistical technique of regression analysis is the main tool used to obtain the
estimates.

20
You may obtain the following estimates of 𝛽0 and 𝛽1 as −299.5913 and 0.7218 respectively.
Thus, the estimated model could be:

𝑌̂𝑖 = −299.5913 + 0.7218Xt …………………………………………………………………… 3.3

The hat on the Y indicates that it is an estimate.

6. Hypothesis Testing

Assuming that the fitted model is a reasonably good approximation of reality, we have to
develop suitable criteria of finding out whether the estimates obtained in, say, Equation 3.3 are in
accord with the expectations of the theory that is being tested. A theory or hypothesis that is not
verifiable by appeal to empirical evidence may not be acceptable as a part of scientific enquiry.

7. Forecasting or Prediction

If the chosen model does not refute the hypothesis or theory under consideration, we may use it
to predict the future value(s) of the dependent, or forecast, variable Y on the basis of the known
or expected future value(s) of the explanatory, or predictor, variable X.

8. Use of the Model for Control or Policy Purposes

Suppose we have the estimated consumption function given in Eq. (3.3). Suppose further the
government believes that consumer expenditure of about 8750 (billions of 2000 dollars) will
keep the unemployment rate at its current level of about 4.2 percent (early 2006). What level of
income will guarantee the target amount of consumption expenditure? If the regression results
given in Eq. (3.3) seem reasonable, simple arithmetic will show that

8750 = −299.5913 + 0.7218(GDP2006) ……………………………………………………(3.4)

Which gives X = 12537, approximately. That is, an income level of about 12537 (billion) dollars,
given an MPC of about 0.72, will produce an expenditure of about 8750 billion dollars.

As these calculations suggest, an estimated model may be used for control, or policy, purposes.
By appropriate fiscal and monetary policy mix, the government can manipulate the control
variable X to produce the desired level of the target variable Y.

21
CHECK POINT

1. In the econometric procedure, state four things that you would do before testing the
hypothesis.
a. __________________________________________________________________
b. __________________________________________________________________
c. __________________________________________________________________
d. __________________________________________________________________
2. What is the difference between a mathematical model and an econometrical model?
________________________________________________________________________
________________________________________________________________________
3. Given that 𝑌̂𝑖 = −299.5913 + 0.7218Xi, what does the hut above variable Y mean?
________________________________________________________________________
4. Suppose you are carrying out a research, how would you ensure that good quality data is
collected?
a. __________________________________________________________________
b. __________________________________________________________________
c. __________________________________________________________________
d. __________________________________________________________________

3.2 REGRESSION DEFINED

A cock craws early in the morning and this is followed by rising of the sum. Does it mean that
crowing of a cock causes the sun to rise? Obviously no. Many times economists want to know
the effect of one variable on another variable. For example, what is the effect of one additional
year of education on earnings? How would tobacco output be affected by using an additional unit
of unit of fertilizer? In econometrics, this is no longer a big challenge. We use regression
analysis to come up with marginal effects of independent variables on the dependent variable.

22
Regression is a technique for determining the statistical relationship between two or more
variables where a change in a dependent variable is associated with, and depends on, a change in
one or more independent variables.

A regression model can also be defined as a mathematical equation that helps to predict or
forecast the value of the dependent variable based on the known values of independent variables.

Types of regression models

a. Simple linear regression model

b. Multiple regression model
c. Multivariate regression model

A simple linear regression is a statistical equation that characterizes the relationship between a
dependent variable and only one independent variable.

For example

Tobacco yield is a function of quantity of fertilizer applied. Such a study is known as simple, or
two-variable, regression analysis.

𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜀 ………………………………………………………………… 3.5

In equation 3.5, Y is tobacco output, X is quantity of fertilizer applied, 𝛽0 𝑎𝑛𝑑 𝛽1 are

parameters while ε is the error term.

However, if we are studying the dependence of one variable on more than one explanatory
variable, as in the crop-yield, rainfall, temperature, sunshine, and fertilizer example, it is known
as multiple regression analysis. Equation 3.6 is a multiple linear regression.

𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + ⋯ + 𝛽𝑘 𝛽𝑘 + 𝜀 …………………………………… 3.6

A multivariate regression has more than one dependent variables and more than one
independent variables.

23
3.3 BASICS OF A SIMPLE LINEAR REGRESSION

Population and sample

As used in statistics, the term population refers to a complete list of possible measurements or
outcomes that are of interest.

For instance;

a. To study the effect of study hour on GPA among undergraduate students at LUANAR,
the population is a list of all undergraduate students at LUANAR.

b. In a study aimed at finding impact of a project on livelihoods of smallholder farmers in

TA Khombedza in Salima district, the population is a list of all small holder farmers in
TA Khombedza.

Let us say we want to determine the effect of study hours on GPA. Then the population linear
regression can be as in equation 3.7.

𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜀 ………………………………………………………………… 3.7

Where 𝑦 is GPA, 𝑥 is study duration in hours 𝛽0 and 𝛽1 are population parameter while ε is the
stochastic term.

If possible the population parameters could be determined by collecting data on study duration
and GPA from all the students at LUANAR. However, it is very difficult and costly to study the
whole population. Therefore we estimate the population parameters using data collected from a
sample of students selected from the population.

Sample

The tern “sample” refers to a proportion of the population that is representative of the population
from which it was selected. You may refer to modules of statistics for economists and Research
methods for details on sampling techniques that ensures that the sample is both adequate and
representative.

24
A sample regression function can be as in equation 3.8.

𝑌̂𝑖 = 𝛽̂0 + 𝛽̂1 𝑋𝑖 ……………….…………………………………………………… 3.8

Where

𝑌̂𝑖 is the estimator of expected value (mean) of Y. i.e. E(Y|𝑋𝑖 )

𝛽̂0 is the estimator of 𝛽0

𝛽̂1 is the estimator of 𝛽1

We can make the sample regression function stochastic by including the error term as in equation
3.9.

𝑌𝑖 = 𝛽̂0 + 𝛽̂1 𝑥𝑖 + 𝜀𝑖 ………………………………………………………………… 3.9

POPULATION AND SAMPLE REGRESSION MODELS

POPULATION SAMPLE

𝑌̂i = 𝛽̂0 + 𝛽̂1 𝑥𝑖 + 𝜀i

𝑌1 = 𝛽0 + 𝛽1 𝑋1 + 𝜀1

Figure 3.1: population and sample regressions

25
Importance of the error term

1. Vagueness of theory

The theory, if any, determining the behavior of Y may be, and often is, incomplete. We might
know for certain that weekly income X influences weekly consumption expenditure Y, but we
might be ignorant or unsure about the other variables affecting Y. Therefore, 𝜀𝑖 may be used as a
substitute for all the excluded or omitted variables from the model.

2. Unavailability of data

Even if we know what some of the excluded variables are and therefore consider a multiple
regression rather than a simple regression, we may not have quantitative information about these
variables. It is a common experience in empirical analysis that the data we would ideally like to
have often are not available.

For example, in principle we could introduce family wealth as an explanatory variable in

addition to the income variable to explain family consumption expenditure. But unfortunately,
information on family wealth generally is not available. Therefore, we may be forced to omit the
wealth variable from our model despite its great theoretical relevance in explaining consumption
expenditure.

3. Core variables versus peripheral variables

Assume that we want to study consumption-income relationship and economic theory guides us
that explanatory variables include income, number of children per household, religion, education
and geographical location. It is quite possible that the joint influence of all or some of these
variables may be so small and at best nonsystematic or random to the extent that it does not pay
to introduce them into the model explicitly. Their combined effect can be treated as a random
variable 𝜀𝑖 .

26
4. Intrinsic randomness in human behavior

Even if we succeed in introducing all the relevant variables into the model, there is bound to be
some “intrinsic” randomness in individual Y’s that cannot be explained no matter how hard we
try. The disturbances, the ε’s, may very well reflect this intrinsic randomness.

5. Poor proxy variables:

Although the classical regression model assumes that the variables Y and X are measured
accurately. In practice the data may be plagued by errors of measurement and data on some
variable sis not available. But since data on these variables are not directly observable, in
practice we use proxy variables. For example expenditure may be used as a proxy for income.
Obviously expenditure may not always be equal to income as some people may be saving or
donating to others. Therefore use of expenditure as a proxy for income comes with the problem
of errors of measurement. The disturbance term 𝜀𝑖 may in this case then also represent the errors
of measurement.

6. Principle of parsimony:

A regression model need to be as simple as possible. If we can explain the behavior of Y

“substantially” with two or three explanatory variables and if our theory is not strong enough to
suggest what other variables might be included, why introduce more variables? Let 𝜀𝑖 represent
all other variables. Of course, we should not exclude relevant and important variables just to
keep the regression model simple.

7. Wrong functional form:

Even if we have theoretically correct variables explaining a phenomenon and even if we can
obtain data on these variables, very often we do not know the form of the functional relationship
between the regressand and the regressors. Is consumption expenditure a linear (invariable)
function of income or a nonlinear (invariable) function?

27
In two-variable models the functional form of the relationship can often be judged from the
scatter gram. But in a multiple regression model, it is not easy to determine the appropriate
functional form, for graphically we cannot visualize scatter grams in multiple dimensions.

Linearity of a regression model

Linearity in explanatory variables

A model may be linear in variables or linear in parameters. The first and perhaps more “natural”
meaning of linearity is that the conditional expectation of Y is a linear function of Xi as in
equation 3.10. Geometrically, the regression curve in this case is a straight line.

E(Y | Xi) = β0 + β1X1 ………………………………………………………………………..3.10

In this interpretation, a regression function such as 3.11 is not a linear function because the
variable X appears with a power or index of 2.

𝐸(𝑌|𝑋𝑖 ) = 𝛽0 + 𝛽1 𝑋 + 𝛽2 𝑋22 + 𝜀 ………………………………………………………...3.11

Linearity in parameters

The second interpretation of linearity is that the conditional expectation of Y, E(Y | Xi), is a
linear function of the parameters, the β’s; it may or may not be linear in the variable X. In this
interpretation equation 3.11 is a linear (in the parameter) regression model.

Of the two interpretations of linearity, linearity in the parameters is relevant for the development
of the regression theory to be presented shortly. Therefore, from now on, the term “linear”
regression will always mean a regression that is linear in the parameters the β’s (that is, the
parameters) are raised to the first power only. It may or may not be linear in the explanatory
variables, the X’s.

Distribution of the dependent variable

When 10 farmers apply10 kilograms of fertilizer per 70 square metre plot, do you expect them to
have the same yield? Obviously, the output will be different. This means that for each value of
the independent value, the observed values of the dependent variables are many and different.

28
They form a distribution that has a mean and variance. Figure 3.2 is an illustration of this
observation

The regression line is a schedule that connects all the mean values of the dependent variable for
each level of the independent variable. Not all observed values will be along the regression line.
Some of the observed values will be above the line while others will be below it. The difference
between the observed value of the dependent variable and the fitted value for the same level of
independent variable is the error.

3.4 UNIT SUMMARY

This unit has introduced to the econometric procedure. You have learnt eight steps of
econometric analysis. We have differentiated between a mathematical model and an econometric
model. In econometrics the error term is very important.

The unit also defined a regression. The linear regression is linear in parameters. Another point is
that the dependent variable has a distribution for each value of the independent variable and that
the mean value of the distribution lies on the regression line. It may be linear or non-linear in

29
variables. You may have noticed that we usually do not know the population parameters but we
use a sample to estimate population parameters.

UNIT EXERCISE

1. Define the following terms;

a. Population
__________________________________________________________________
__________________________________________________________________
b. Sample
__________________________________________________________________
__________________________________________________________________
c. Regression
__________________________________________________________________
__________________________________________________________________
2. With reference to figure 3.1, how different is the sample regression from a population
regression?
________________________________________________________________________
____________________________________________________________
3. Discuss linearity of a linear regression.
________________________________________________________________________
____________________________________________________________
4. Explain the three types of regressions
________________________________________________________________________
____________________________________________________________
5. Suggest four reasons why the error term is important in an econometric model.
__________________________________________________________________
__________________________________________________________________
__________________________________________________________________
__________________________________________________________________

30
4 Unit 4: ESTIMATION OF A SIMPLE LINEAR
REGRESSION

INTRODUCTION

In unit 3, you were introduced to linear regression. We said that it is one of the most commonly
used models in in estimating population parameters using data collected from a sample. In this
unit, you will learn how to estimate a linear regression model. You will also learn assumptions of
a classical linear regression. Furthermore, the unit provides a guide to estimation of a simple
linear regression as well as a multiple linear regression in STATA. We shall end the unit by
discussing more about the dummy variable and carrying out joint test.

OBJECTIVES

By the end of this unit students must be able to;

 explain Ordinary Least Squares (OLS) technique of estimating parameters.

 discuss assumptions of a classical linear regression model
 derive formulas for parameters of a simple linear regression.
 perform manual calculation of estimated values of parameters of a simple linear
regression.
 list properties of OLS estimators
 discuss types of dummy variables
 conduct a joint test.

4.1 ESTIMATION TECHNIQUES

We can estimate the population parameters using sample parameters in various ways and these
are;

a. Ordinary Least Squares (OLS)

b. Maximum likelihood (ML)
c. Instrumental variable (IV)

31
By and large, it is the method of OLS that is used extensively in regression analysis primarily
because it is intuitively appealing and mathematically much simpler than the method of
maximum likelihood. Besides, as we will show later, in the linear regression context the
Ordinary Least Squares (OLS) and Maximum likelihood methods generally give similar results.

4.2 THE METHOD OF ORDINARY LEAST SQUARES

Recall that we can have a population regression model

𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝜀𝑖 ………………………………………………….………………….4.1

We also have the sample regression model

𝑌𝑖 = 𝛽̂0 + 𝛽̂1 𝑋𝑖 + 𝜀𝑖 ……………………………………………………………………...4.2

This estimated simple regression based on data collected form a sample is as follows;

𝑌̂i = 𝛽̂0 + 𝛽̂1 𝑋𝑖 …………………………….……………………………………………4.3

Where 𝑌̂i is the estimated (fitted) value of 𝑌𝑖 .

Then,

𝑌𝑖 = 𝑌̂i + 𝜀𝑖 ………………………………………………………………………………4.4

This shows that the error term is just the difference between the observed value and the error
term or residual or disturbance.

𝜀𝑖 = 𝑌𝑖 − ̂𝑌i ………………………………………………………………………….4.5

𝜀𝑖 = 𝑌𝑖 − 𝛽̂0 − 𝛽̂1 𝑋𝑖 ……………………………………………………………….4.6

32
Figure 4.1: Error as the vertical distance.

To fit the simple regression line to the data, the sum of squares of the error (vertical distances),
as illustrated in Figure 4.1, must be as small as possible.

Computation of sum of squares

You may recall from your introductory statistics class that the procedure is as follows;

 Find the mean

 Subtract the mean from each observation to find deviation from the mean (error).
 Square the deviations from the mea.
 Add up all the deviations from the mean. Notice that the sum of deviations from the mean
before squaring is always zero.

Example

Given the following values of y;

70, 65, 90, 95, 110, 115, 120, 140,155, 260

33
Compute the following

a. Error sum of squares

b. Mean sum of squares (Variance)
c. Standard deviation

Working

Y ̅
𝐘 ̅)
(𝐘 − 𝐘 ̅) 𝟐
(𝐘 − 𝐘
70 122 -52 2704
65 122 -57 3249
90 122 -32 1024
95 122 -27 729
110 122 -12 144
115 122 -7 49
120 122 -2 4
140 122 18 324
155 122 33 1089
260 122 138 19044

a. The error sum of squares for y is 19,044.

̅ )𝟐
(𝒀−𝒀
b. 𝜎 2 = 𝑁
19044
𝜎2 = 10
2
𝜎 = 1904.4

̅ )𝟐
(𝒀−𝒀
c. 𝜎 = √ 𝑁

𝜎 = √1904.4
𝜎 = 43.64

34
The OLS selects estimates that minimizes the error sum or squares over all sample data points.
Methods of calculus are used to derive the formulas for the parameter of a simple linear
regression model. Below is the step by step derivation of the estimators.

𝜀𝑖 = 𝑌𝑖 − ̂𝑌i

𝜀𝑖 = 𝑌𝑖 − 𝛽̂0 − 𝛽̂1 𝑋𝑖

2
∑ 𝜀𝑖 = ∑(𝑌𝑖 − ̂𝑌i )

2
∑ 𝜀𝑖 = ∑(𝑌𝑖 − 𝛽̂0 − 𝛽̂1 𝑋𝑖 ) ……………………………………………………………..4.7

To minimize the residual sum of squares, we equate the first derivatives of the sum of squares
with respect to each parameter.

𝜕 ∑ 𝜀𝑖
̂0 =0
𝜕𝛽

𝜕 ∑(𝑌𝑖 − 𝛽 ̂1 𝑋𝑖 )2
̂0 − 𝛽 𝜕(𝑌𝑖 −𝛽0 −𝛽1 𝑋𝑖 )
̂0 = 2 ∑(𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖 ) ̂0
𝜕𝛽 𝜕𝛽

0 = 2∑(𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖 ) (−1)

0 = 2 ∑(−𝑌𝑖 + 𝛽0 + 𝛽1 𝑋𝑖 )

0 = −𝑌𝑖 + 𝑛𝛽0 + ∑ 𝛽1 𝑋𝑖
0 = − ∑ 𝑦𝑖 + 𝑛𝛽0 + 𝛽1 ∑ 𝑋𝑖

∑ 𝑌𝑖 = 𝑛𝛽0 + 𝛽1 ∑ 𝑋𝑖 …………………………………….4.8

𝜕 ∑ 𝜀𝑖
̂1 =0
𝜕𝛽

𝜕 ∑ 𝜀𝑖 𝜕(𝑌𝑖 −𝛽0 −𝛽1 𝑋1 )

̂1 = 2 ∑(𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖 ) ̂1
𝜕𝛽 𝜕𝛽

0 = 2∑(𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖 ) (−𝑋𝑖 )

0 = − ∑ 𝑌𝑖 𝑋𝑖 + 𝛽0 ∑ 𝑋𝑖 + 𝛽1 ∑ 𝑋𝑖 2

∑ 𝑌𝑖 𝑋𝑖 = 𝛽0 ∑ 𝑋𝑖 + 𝛽1 ∑ 𝑋𝑖 2 ……………………………………..….4.9

35
Equations 4.8 and 4.9 are called normal equations. Let us now solve the two equations
simultaneously to find 𝛽0 and 𝛽1.

Multiply equation 4.8 by ∑ 𝑋𝑖 ; ∑ 𝑋𝑖 ∑ 𝑋𝑖 = 𝑛𝛽0 ∑ 𝑋𝑖 + 𝛽1 (∑ 𝑋𝑖 )2 …….………..4.10

Multiply equation 4.9 by n ; n∑ 𝑌𝑖 𝑋𝑖 = 𝑛𝛽0 ∑ 𝑋𝑖 + 𝑛𝛽1 ∑ 𝑋𝑖 2 ..….………...4.11

Equation 4.11 – Equation 4.12; 𝑛 ∑ 𝑌𝑖 𝑋𝑖 − ∑ 𝑋𝑖 ∑ 𝑌𝑖 = 𝑛𝛽1 ∑ 𝑋𝑖 2 − 𝛽1 (∑ 𝑋𝑖 )2

2
𝑛 ∑ 𝑌𝑖 𝑋𝑖 − ∑ 𝑋𝑖 ∑ 𝑌𝑖 = 𝛽1 (𝑛 ∑ 𝑋𝑖 2 − (∑ 𝑋𝑖 ) )

𝒏 ∑ 𝒀𝒊 𝑿𝒊 −∑ 𝑿 ∑ 𝒚𝒊
𝜷𝟏 = ……………………………….4.12
𝒏 ∑ 𝑿𝒊 𝟐 − (∑ 𝑿𝒊 )𝟐

𝑛2
If we multiply the right hand side of equation 4.5 by 𝑛2

∑ 𝑌𝑖𝑋𝑖 ∑ 𝑋𝑖 ∑ 𝑦𝑖
−
𝑛 𝑛2
𝛽1 = 2
∑ 𝑋𝑖 2 (∑ 𝑋𝑖 )
−
𝑛 𝑛2

∑ 𝒀𝒊 𝑿𝒊
̅𝒀
− 𝑿 ̅
𝒏
𝜷𝟏 = ∑ 𝑿𝒊 𝟐
……………………..............4.13
̅𝟐
− 𝑿
𝒏

Recall that equation 4.1 says

∑ 𝑌𝑖 = 𝑛𝛽0 + 𝛽1 ∑ 𝑋𝑖

Therefore;

𝑛𝛽0 = ∑ 𝑌𝑖 − 𝛽1 ∑ 𝑋𝑖

∑ 𝑌𝑖 ∑ 𝑋𝑖
𝛽0 = − 𝛽1
𝑛 𝑛

𝑿𝒊 ………………………………4.14
𝜷𝟎 = 𝒀̅𝒊 − 𝜷𝟏̅̅̅

36
Example

The data in Table 4.1 show values of the dependent variable Y for each value of
the independent variable X.

Table 4.1: sample data

𝑿𝒊 𝒀𝒊
1 1
2 1
3 2
4 2
5 4

Use the data to;

a. Calculate
i. ∑ 𝑋𝑖 iii. ∑ 𝑋𝑖 𝑌𝑖
ii. ∑ 𝑌𝑖 iv. ∑ 𝑋𝑖2
b. Calculate
i. The slope parameter 𝛽1

ii. The intercept parameter 𝛽0 .

c. Write the equation of the estimated linear regression.

d. What would be the value of y if x = 6

Working

𝒙𝒊 . 𝒚𝒊 𝒙𝟐𝒊 𝒚𝟐𝒊 𝒙 𝒊 𝒚𝒊
1 1 1 1 1
2 1 4 1 2
3 2 9 4 6
4 2 16 4 8
5 4 25 16 20
∑ 𝑥𝑖 = 15
∑ 𝑦𝑖 = 10 ∑ 𝑥𝑖2 = 55
∑ 𝑦𝑖2 = 26 ∑ 𝑥𝑖 𝑦𝑖 = 37

37
a. Sums
i. ∑ 𝑥𝑖 = 15
ii. ∑ 𝑦𝑖 = 10
iii. ∑ 𝑥𝑖2 = 55
iv. ∑ 𝑥𝑖 𝑦𝑖 = 37

b. Parameters
𝑛 ∑ 𝑌𝑖 𝑥𝑖 −∑ 𝑥𝑖 ∑ 𝑦𝑖
i. 𝛽1 =
𝑛 ∑ 𝑥𝑖 2 − (∑ 𝑥𝑖 )2

(5)(37) − (15)(10)
𝛽1 =
(5)(55) − 152

𝜷𝟏 = 𝟎. 𝟕

ii. 𝛽0 = 𝑦̅𝑖 − 𝛽1 𝑥̅𝑖

∑ 𝑦𝑖
𝑦̅𝑖 =
𝑛
10
= =𝟐
5

∑ 𝑥𝑖
𝑥̅𝑖 =
𝑛
15
= =𝟑
5

𝛽0 = 2 − (0.7)(3)

𝜷𝟎 = − 𝟎. 𝟏

̂ 𝐢 = −𝟎. 𝟏 + 𝟎. 𝟕𝒙𝒊
c. The estimated simple linear regression is 𝒀
d. The estimated simple linear regression is
𝑌̂i = −0.1 + 0.7𝑥𝑖
𝑌̂i = −0.1 + (0.7)(6)
𝑌̂i = −0.1 + 4.2
̂ 𝐢 = 𝟒. 𝟏
𝒀

38
Interpretation of parameters of a linear regression

Parameters of a linear regression are interpreted as follows

Intercept parameter

This is the average value of the dependent variable when the value of the explanatory variable is
zero holding all other factors constant.

For example

The estimated simple linear regression in the example above is 𝑌̂i = −0.1 + 0.7𝑋𝑖 . The
intercept parameter −0.1 means that the average value of y is −𝟎. 𝟏 when the value of x = 0 .

Slope parameter

The slope parameter is interpreted as marginal effect of the independent variable on the
dependent variable..

Example

The estimated simple linear regression in the example above is 𝑌̂i = −0.1 + 0.7𝑋𝑖 . The slope
parameter, = 0.7 means that a unit increase in the value of x will increase the value of y by 0.7
on average.

4.3 ASSUMPTIUON OF A CLASSICAL LINEAR REGRESSION

i. Linearity

The regression model is linear in the parameters, though it may or may not be linear in the
variables. This model can be extended to include more explanatory variables.

ii. Independent variable, X is non- stochastic

Values taken by the regressor X are considered fixed in repeated samples. This assumption is
made to keep the regression simple for now. This assumption is relaxed later.

39
iii. The error term is assumed to have a mean of zero.

Mathematically, this means that

𝑬(𝜺) = 𝟎 For fixed independent variable or

𝑬(𝜺|𝒙𝒊 ) = 𝟎 In the case of stochastic independent variables

Figure 4.2: zero mean of error term

iv. Homoskedasticity or constant variance of the error term

The variance of the error, or disturbance, term is the same regardless of the value of X. Put
simply, the variation around the regression line (which is the line of average relationship
between Y and X) is the same across the X values; it neither increases nor decreases as X varies.

Symbolically,

𝑽𝒂𝒓(𝜺) = 𝝈

40
Violation of this assumption causes a problem of heteroskedasticity or unequal spread, or
unequal variance. In this case 𝑽𝒂𝒓(𝜺) ≠ 𝝈

Figure 4.3a: Homoskedasticity

Figure 4.3b: Heteroskedasticity

41
v. No autocorrelation between the disturbances

Given any two X values 𝒙𝒊 and 𝒙𝒋 (𝒊 ≠ 𝒋), the correlation between any two error terms 𝜺𝒊 and
𝜀𝑗 (𝑖 ≠ 𝑗) is zero. In short, the observations are sampled independently. Thus, 𝐶𝑜𝑣 (𝜀𝑖 , 𝜀𝑗 ) = 0.

But it should be added here that the justification of this assumption depends on the type of data
used in the analysis. If the data are cross-sectional and are obtained as a random sample from the
relevant population, this assumption can often be justified. However, if the data are time series,
the assumption of independence is difficult to maintain,

vi. The Number of observations n must be greater than the number of parameters to Be
estimated:

Alternatively, the number of observations must be greater than the number of explanatory
variables. We need at least two pairs of observations to estimate the two unknowns. You may as
well recall from your high school mathematics that we need at least two equations to solve
simultaneously for values of two unknown variables.

vii. Variability of X

The X values in a given sample must not all be the same. Technically, the independent variable
(X) must be a positive number. Furthermore, there can be no outliers in the values of the X
variable.

viii. No exact collinearity between the X variables.

No exact linear relationship between X1 and X2. No X variable has a linear relationship with
another X variable. Informally, no collinearity means none of the regressors can be written as
exact linear combinations of the remaining regressors in the model. Formally, no collinearity
means that there exists no set of numbers, λ1 and λ2, not both zero such that

λ1X1i + λ2X2i = 0.

Violation of this assumption causes a problem called multicollinearity.

ix. There is no specification bias. The model is correctly specified. We will discuss further
on model specification in unit 7 of this module.

42
Simple linear regression in STATA

Open the data editor in STATA and enter the data with a column for each variable.

Fertilizer Output
1 1
1 2
2 3
2 4
4 5

Execute the command reg output fertilizer

. reg output fertilizer

Source SS df MS Number of obs = 5

F( 1, 3) = 13.36
Model 4.9 1 4.9 Prob > F = 0.0354
Residual 1.1 3 .366666667 R-squared = 0.8167
Adj R-squared = 0.7556
Total 6 4 1.5 Root MSE = .60553

output Coef. Std. Err. t P>|t| [95% Conf. Interval]

fertilizer .7 .1914854 3.66 0.035 .0906079 1.309392

_cons -.1 .6350853 -0.16 0.885 -2.121125 1.921125

Description of the output

The top part of the table gives analysis of variance. At the top right corner we find the F statistic
and its calculated P-value. A large calculated F value means that the variation in the dependent
variable due to error is very small. The p-value is compared to a predefined significance level
like 0.05. When the P value is less than the level of significance, we conclude that the model in
significant.

43
We also have R-squared and Adjusted R-squared showing the proportion of variation in the
dependent variable explained by the independent variable(s) in the regression.

The lower part of the output shows the parameters together with their t-statistics and p-values.
The coefficient in the row of constant, is the intercept parameter that shows the value of the
dependent variable when the value of the independent variable if zero. The coefficient in the row
of fertilizer (independent variable) in the slope parameter for fertilizer. It is the marginal effect of
additional unit of fertilizer.

We use t-test to determine significance of the parameter.

To test

𝐻0 : 𝛽1 = 0

𝐻1 : 𝛽1 ≠ 0

Test statistic z

𝛽̂1
𝑡=
𝑠(𝛽̂1 )

Reject 𝐻0 if;

𝛼 𝛼
𝑡 > 𝑛 ( 2 ; 𝑛 − 𝑘 − 1) 𝑜𝑟 𝑡 < −𝑛 ( 2 ; 𝑛 − 𝑘 − 1)

Alternatively, the P-value associated with t-statistic is compared to the level of significance like
0.05 such that if the P-value is less than the level of significance, we reject the null hypothesis.

In the sample output above,

The P = value associated with the F statistic is 0.0354 that is less than 0.05. This means model is
significant at 0.05 level of significance. Fertilizer explain 81 percent of the variation in the
output. The coefficient (slope parameter) for fertilizer is 0.7 and the p-value for fertilizer is 0.035
that is less than 0.05. This means that fertilizer has a significant positive effect on the level of
output. An additional unit of fertilizer will increase the output by 0.7 units on average.

44
4.4 MULTIPLE REGRESSION MODEL

A simple linear regression for a demand function expresses quantity demanded as a function of
price only.

𝑄 = 𝛽0 + 𝛽1 𝑃𝑅𝐼𝐶𝐸 + 𝜀

However, demand for a commodity is likely to depend not only on its own price but also on the
prices of other competing or complementary goods, income of the consumer, social status, etc.
Therefore, we need to extend our simple two-variable regression model to cover models
involving more than two variables. Adding more independent variables leads us to the discussion
of multiple regression models, that is, models in which the dependent variable, or regressand, Y
depends on two or more explanatory variables, or regressors.

The notation of a multiple regression is;

𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + 𝛽3 𝑋3𝑖 + ⋯ + 𝛽𝑘 𝑋𝑘𝑖 + 𝜀𝑖

Where Y is the dependent variable, Xs are the explanatory variables (or regressors), 𝜀𝑖 is the
stochastic disturbance term for the ith observation.

The simplest possible multiple regression model is three-variable regression, with one dependent
variable and two explanatory variables.

𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + 𝜀𝑖

Assumptions of a multiple regression model

1. Linear regression model, or linear in the parameters.

2. Fixed X values or X values independent of the error term. Here, this means we require
zero covariance between 𝜀𝑖 and each X variables.
3. Zero mean value of disturbance𝜀𝑖 .
4. Homoscedasticity or constant variance of 𝜀𝑖 .
5. No autocorrelation, or serial correlation, between the disturbances.
6. The number of observations n must be greater than the number of parameters to be
estimated, which is 3 in our current case.
7. There must be variation in the values of the X variables.

45
8. No exact collinearity between the X variables. No exact linear relationship between X2
and X3.
9. There is no specification bias. The model is correctly specified.

Interpretation of the multiple regression

Like in the simple linear regression, we use the F-Statistic to test the overall statistical
significance of the regression relation between the response variable Y and the set of variables.

Given the assumptions of the classical regression model, it follows that, on taking the conditional
expectation of Y on both sides of the three variable multiple regression equation we have’

𝐸(𝑌𝑖 |𝑋1𝑖 , 𝑋2𝑖 ) = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖

In words, the equation gives conditional mean or expected value of Y conditional upon the given
or fixed values of X2 and X3. Therefore, as in the two-variable case, multiple regression analysis
is regression analysis conditional upon the fixed values of the regressors, and what we obtain is
the average or mean value of Y or the mean response of Y for the given values of the regressors.

The slope coefficients are partial coefficients. The meaning of partial regression coefficient is as
follows: β2 measures the change in the mean value of Y, E(Y), per unit change in X2, holding the
value of X1 constant.

Let us consider a multiple regression model fitted to an extract from the data collected by the
National Statistical Office (NSO) from tobacco farmers during the third integrated survey. The
variable are Tobacco output in Kg, fertilizer (FERT) in Kg, labour in man-days and number of
schooling years, disregarding repetition, for the household head (edu).

The dependent variable in the model is output and the explanatory variables are labour, quantity
of fertilizer and number of effect schooling years (edu).

46
. reg output labor FERT edu

Source SS df MS Number of obs = 1138

F( 3, 1134) = 20.46
Model 104480985 3 34826995.1 Prob > F = 0.0000
Residual 1.9306e+09 1134 1702481.53 R-squared = 0.0513
Adj R-squared = 0.0488
Total 2.0351e+09 1137 1789881.3 Root MSE = 1304.8

output Coef. Std. Err. t P>|t| [95% Conf. Interval]

labor .7495765 .1688714 4.44 0.000 .418241 1.080912

FERT 1.115747 .2215873 5.04 0.000 .68098 1.550514
edu 38.73264 14.23564 2.72 0.007 10.8015 66.66379
_cons -21.49851 91.62165 -0.23 0.815 -201.2655 158.2685

Interpretation

The P-value associated with the F statistic in 0.000 < 0.05. This means that the model as a whole
is significant at 0.05 level of significance. However the adjusted R-squared of 0.049 shows that
the independent variables in the model only explain 5 per cent of the variation in the dependent
variable.

Labour has a significant positive effect on tobacco output. Increasing quantity of labour used in
the production of tobacco by one man-day increases tobacco output by 0.7 kilogram holding all
other factors fixed. The reader should interpret the remaining variables as part of practice.

4.5 PROPERTIES OF OLS ESTIMATORS

a. Best
This means that the OLS estimator have minimum (smallest) variance of all unbiased
linear estimators. This property is what makes the OLS estimators to be the best.
b. Linear
OLS estimators are linear functions of the dependent variable. So the OLS estimator is a
linear with respect to how it uses the values of the dependent variable only irrespective of
how it uses the values of the regressors.
c. Unbiased

47
In repeated sampling, the estimates have a normal distribution with mean equal to the
population mean. Unbiased estimator with minimum variance is efficient.
𝐸(𝛽̂0 ) = 𝛽0 and 𝐸(𝛽̂1 ) = 𝛽1

Gauss–Markov summarized the properties by saying that OLS estimators are best linear unbiased
estimators (BLUE).

This property extends to the entire 𝛽̂ vector; that is, 𝛽̂ is linear (each of its elements is a linear
function of Y, the dependent variable). E(𝛽̂ ) = 𝛽̂ , that is, the expected value of each element of 𝛽̂
is equal to the corresponding element of the true β, and in the class of all linear unbiased
estimators of β, the OLS estimator ˆβ has minimum variance.

The OLS estimators are also consistent. This means that the estimators approach the real value of
the population parameter as sample size increases.

CHECK POINT

1. How do you determine significance of each regressor in a multiple regression model?

________________________________________________________________________
________________________________________________________________________
2. What is the difference between autocorrelation and multicollinearity?
________________________________________________________________________
________________________________________________________________________
3. Illustrate homoskedasticity.

48
4.6 DUMMY VARIABLES

In section 2.4 of this module, we said that when a discrete variable is used to recode a qualitative
characteristic, it is called a dummy variable. A dummy variable takes a binary or dichotomous
values (0 or 1).

Uses of dummy variables

a. Capturing qualitative variables.

For instance;

Sex: male / female

Location: urban/rural
b. Capturing interaction between variables especially qualitative and quantitative
variables. For example
In a hedonic model, the price of a house if expressed as a function of size of the house
(continuous variable) and location (qualitative variable). Location enters the model as a
dummy variable taking the value 1 when the house is in a desirable location and zero
otherwise. An interaction variable may be generated by multiplying the location dummy
variable by the continuous variable, size.
c. Capturing seasonal effects.
In marketing of agricultural products, we may notice that demand for certain
commodities is high in some seasons and low in others seasons.
d. Capturing regime effects
 Malawi, has been ruled by different political regimes since 1964 when the country
became independent. An economist may be interested to find out the effect of a
political regime of national development. Dummy variables may be used to
represent each regime.
 Considering that Malawi is affected by natural disasters like drought and floods.
A researcher may want to assess the effect of drought of food production. In that
study a dummy variable may be introduced to take a value of one when drought
occurred in a particular year.

49
Using dummy variables

When dummy variables are conclusive, they sum up to one. You cannot put all of them in the
model because they may cause multicollinearity. We therefore leave out one of the dummy
variables to be used as base category (for comparison).

Including dummy variables for both genders is the simplest example of the so-called dummy
variable trap, which arises when too many dummy variables describe a given number of groups.
The number of dummy variables describing a variable should therefore be small.

Suppose wage is a function of education, there need to be a separate dummy variable for each
education level like no education, primary education, secondary education and tertiary education.

1 no education
Ed0 = {
0 otherwise

1 Primary education
Ed1 = {
0 otherwise

1 Secondary education
Ed2 = {
0 otherwise

1 Tertiary education
Ed3 = {
0 otherwise

𝑤𝑎𝑔𝑒 = 𝛽0 + 𝛽1 Exp + 𝛿1 𝐸𝑑1 + 𝛿2 𝐸𝑑2 + 𝛿3 𝐸𝑑3 + 𝜀

(𝛽0 + 𝛿1 + 𝛿2 + 𝛿3 ) + 𝛽1 𝐸𝑥𝑝

(𝛽0 + 𝛿1 ) + 𝛽1 𝐸𝑥𝑝

𝐸(𝑤𝑎𝑔𝑒) = (𝛽0 + 𝛿2 ) + 𝛽1 𝐸𝑥𝑝

(𝛽0 + 𝛿3 ) + 𝛽1 𝐸𝑥𝑝

{ 𝛽0 + 𝛽1 Exp

50
We leave out 𝐸𝑑0 to avoid perfect multicollinearity and 𝐸𝑑0 becomes the reference group or
base category. It does not matter which category is left out. However, care must be taken when
interpreting the coefficients.

(𝛽0 + 𝛿1 ) + 𝛽1 𝐸 𝑝𝑟𝑖𝑚𝑎𝑟𝑦 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛

(𝛽0 + 𝛿2 ) + 𝛽1 𝐸𝑥𝑝 𝑠𝑒𝑐𝑜𝑛𝑑𝑎𝑟𝑦 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛

𝐸(𝑤𝑎𝑔𝑒) =
(𝛽0 + 𝛿3 ) + 𝛽1 𝐸𝑥𝑝 𝑈𝑛𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛

𝛽0 + 𝛽1 Exp no education
{

Dummy variables can be categorized into intercept dummy variable and slope dummy variable
depending on how it has been used in the regression model. Let us consider a hedonic model.
This model says that price of a commodity is identified by looking at characteristics of the item.

For example, price of a house is a function of;

 Age of the house

 Proximity to the university (location)
 Size of the house
 Number of bedrooms

Intercept dummy variable

In many cases a dummy variable modifies the intercept parameter. In the hedonic model, let us
assume;

𝑃𝑖 = 𝛽0 + 𝛽1 𝑆𝑖 + 𝜀𝑖

𝛽1 is the added value of the house for each additional m2.

𝛽0 is land rent

In real estate, location is very important. Let us introduce a dummy variable for location.
𝑃𝑖 = 𝛽0 + 𝛿𝐷𝑖 + 𝛽1 𝑆𝑖 + 𝜀𝑖

51
1 𝑖𝑓 𝑡ℎ𝑒 𝑝𝑟𝑜𝑝𝑒𝑟𝑡𝑦 𝑖𝑠 𝑖𝑛 𝑎 𝑑𝑒𝑠𝑖𝑟𝑎𝑏𝑙𝑒 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑢𝑟ℎ𝑜𝑜𝑑
𝐷𝑖 = {
0 𝑖𝑓 𝑡ℎ𝑒 𝑝𝑟𝑜𝑝𝑒𝑟𝑡𝑦 𝑖𝑠 𝑛𝑜𝑡 𝑖𝑛 𝑎 𝑑𝑒𝑠𝑖𝑟𝑎𝑏𝑙𝑒 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑢𝑟ℎ𝑜𝑜𝑑

The researcher can decide the group to take the value of zero. Notice that the coefficient of the
dummy variable 𝛿 affects the intercept of the regression when 𝐷𝑖 = 1. 𝛿 is location premium.

(𝛽0 + 𝛿) + 𝛽1 𝑆 𝑖𝑓 𝐷𝑖 = 1
𝐸(𝑃) = {
𝛽0 + 𝛽1 𝑆 𝑖𝑓 𝐷𝑖 = 0

𝐸(𝑃) = (𝛽0 + 𝛿) + 𝛽1 𝑆

𝐸(𝑃) = 𝛽0 + 𝛽1 𝑆

Slope dummy variable

We may want to consider the interaction term for location and size of the house. Let us assume
that;

𝑃𝑖 = 𝛽0 + 𝛽1 𝑆𝑖 + 𝛾(𝑆𝑖 𝐷𝑖 ) + 𝜀𝑖

In this case, 𝛾 is the interaction term of size and dummy variable for location 𝑆1 𝐷𝑖 .

52
If 𝐷𝑖 = 1 then change in price, ∆𝑃𝑖 is due to size and the difference will be 𝛾. When 𝑆𝑖 = 1 then
change in price, ∆𝑃𝑖 is due to size and the difference will be due to 𝐷𝑖 .

𝛽0 + (𝛽1 + 𝛾)𝑆𝑖 𝑖𝑓 𝐷𝑖 = 1
𝐸(𝑃𝑖 ) = 𝛽0 + 𝛽1 𝑆𝑖 + 𝛾(𝑆𝑖 𝐷𝑖 ) = {
𝛽0 + 𝛽1 𝑆𝑖 𝑖𝑓 𝐷𝑖 = 0

𝐸(𝑃) = 𝛽0 + (𝛽1 + 𝛾)𝑆

𝐸(𝑃) = 𝛽0 + 𝛽1 𝑆

The slopes are different by 𝛾. Therefore 𝛾 is a slope dummy variable. 𝛽0 is deliberately kept
the same in both equations to simplify the phenomena. However, the land in desirable location is
normally expensive. We now introduce an intercept dummy variable so that as we move to
another location the price of the house should change.

53
𝐸(𝑃) = (𝛽0 + 𝛿) + (𝛽1 + 𝛾)𝑆

𝛾 𝐸(𝑃) = (𝛽0 + 𝛿) + 𝛽1 𝑆

𝐸(𝑃) = 𝛽0 + 𝛽1 𝑆

(𝛽0 + 𝛿) + (𝛽1 + 𝛾)𝑆𝑖 𝑖𝑓 𝐷𝑖 = 1

𝐸(𝑃𝑖 ) = 𝛽0 + 𝛽1 𝑆𝑖 + 𝛾(𝑆𝑖 𝐷𝑖 ) = {
𝛽0 + 𝛽1 𝑆𝑖 𝑖𝑓 𝐷𝑖 = 0

𝑃𝑖 = 𝛽0 + 𝛿1 𝐷𝑖 + 𝛽1 𝑆𝑖 + 𝛾(𝑆𝑖 𝐷𝑖 ) + 𝜀𝑖

54
4.7 JOINT TEST

In the hedonic model above, one of the key questions to be answered could be, does location
affect the price? Since there are two parameters to be tested, this is a joint test.

We can use the F-test approach (Restricted Least squares).

Procedure

 Run unrestricted model. This is the usual regression with all the in dependent variables of
interest.
 Record the Error sum of Squares from the ANOVA table of unrestricted model (𝑆𝑆𝐸𝑈𝑅 ) .
 Run a restricted model. This is a model with all the independent variables of interest
except the variable you want to test.
 Record the Error sum of Squares from the ANOVA table of the restricted model (𝑆𝑆𝐸𝑈 ).
 Compute the F-statistic.
 Compare the F- statistic to the critical F-value. If the calculated F-statistic is greater than
the critical value, reject the null hypothesis of no effect and conclude that the variable has
an effect. Alternatively, compute the P-value of the calculated F-statistic and reject the
null hypothesis if the P-value is less that the level of significance.

Computing the F-statistic

Let

𝑆𝑆𝐸𝑈𝑅 = RSS of the unrestricted regression (8.6.2)

𝑆𝑆𝐸𝑅 = RSS of the restricted regression (8.6.7)
m = number of linear restrictions
k = number of parameters in the unrestricted regression
n = number of observations
(𝑆𝑆𝐸𝑅 − 𝑆𝑆𝐸𝑈𝑅 )⁄
𝑚
𝐹=
𝑆𝑆𝐸𝑈𝑅
⁄(𝑛 − 𝑘)

Example

Using IHS3 data, you may want to estimate the effect of location on aggregate expenditure.
55
Run the unrestricted regression model.

. reg expagg location hhsize locXhhsize

Source SS df MS Number of obs = 12271

F( 3, 12267) = 961.51
Model 2.6972e+14 3 8.9906e+13 Prob > F = 0.0000
Residual 1.1470e+15 12267 9.3506e+10 R-squared = 0.1904
Adj R-squared = 0.1902
Total 1.4168e+15 12270 1.1546e+11 Root MSE = 3.1e+05

expagg Coef. Std. Err. t P>|t| [95% Conf. Interval]

location 166053.8 16104.5 10.31 0.000 134486.5 197621.2

hhsize 24432.23 1385.489 17.63 0.000 21716.46 27148.01
locXhhsize 37456.25 3222.151 11.62 0.000 31140.32 43772.17
_cons 83894.13 7051.576 11.90 0.000 70071.93 97716.32

Run a restricted regression model

. reg expagg hhsize

Source SS df MS Number of obs = 12271

F( 1, 12269) = 483.51
Model 5.3716e+13 1 5.3716e+13 Prob > F = 0.0000
Residual 1.3630e+15 12269 1.1110e+11 R-squared = 0.0379
Adj R-squared = 0.0378
Total 1.4168e+15 12270 1.1546e+11 Root MSE = 3.3e+05

expagg Coef. Std. Err. t P>|t| [95% Conf. Interval]

hhsize 29972.31 1363.073 21.99 0.000 27300.48 32644.15

_cons 119176.9 6909.827 17.25 0.000 105632.5 132721.2

Compute F-statistic

(𝑆𝑆𝐸𝑅 − 𝑆𝑆𝐸𝑈𝑅 )⁄
𝑚
𝐹=
𝑆𝑆𝐸𝑈𝑅
⁄(𝑛 − 𝑘)

(1.3630 EXP 15 − 1.1470 EXP 15)⁄

𝐹= 1
1.1470 EXP 15⁄
(12271 − 1)

F = 2310.6539 far greater than the critical value

56
We reject the null hypothesis and conclude that location (being in urban) has a significant
positive effect on aggregate expenditure. Urban residents spend MK 166053.80 more than rural
residents on average holding all other factors constant.

4.8 UNIT SUMMARY

This unit has introduce to you the method of Ordinary Least Squares (OLS) for estimating
parameters of a regressions module. You learnt manual computation of estimators of population
parameters for a simple linear regression based on the OLS technique. In addition, you
interpreted parameters of a simple linear regression and a multiple linear regression. Then you
covered properties of OLS estimators and the two types of dummy variables.

4.9 UNIT EXERCISE

1. The data in Table 4.1 show values of the dependent variable Y for each
value of the independent variable X.

Table 4.1: sample data

𝒀𝒊 𝑿𝒊
70 80
65 100
90 120
95 140
110 160
115 180
120 200
140 220
155 240
150 260

Use the data to;

57
a. Calculate
i. ∑ 𝑋𝑖 iii. ∑ 𝑋𝑖 𝑌𝑖
ii. ∑ 𝑌𝑖 iv. ∑ 𝑋𝑖2
b. Calculate
i. The slope parameter 𝛽1

ii. The intercept parameter 𝛽0 .

c. Write the equation of the estimated linear regression.

d. What would be the value of y if x = 150

2. Study the STATA output of a multiple regression below and answer the following
questions.
3. Dummy variable are used a lot in econometrics.
a. Define a dummy variable.
b. Why would you introduce a dummy variable in a regression? Give four reasons.
c. Distinguish between intercept and slope dummy variable.
4. Suppose household expenditure in Malawi is a function of household size, region, age of
household head and number of schooling years.
a. Which of these variables would require introduction of dummy variables?
b. Write a multiple regression model that you would use to assess factors that affect
household expenditure in Malawi.

58
5 GOODNESS OF FIT AND ANALYSIS OF VARIANCE

UNIT INTRODUCTION

Unit 4 introduced to you the simple linear regression and a multiple linear regression. Not every
model fit the data very well. It is very important for economists to know how well the model fits
the data. This unit provides useful techniques for testing the goodness of fit. You will learn more
about coefficient of determination and correlation coefficient. You will also review same
commonly used distributions and the analysis of variance.

UNIT OBJECTIVES

By the end of this unit, students must be able to;

a. Compute and use coefficient of determination.

b. Differentiate coefficient of determination from correlation.
c. Describe normal distribution, t-distribution and F-distribution.
d. Perform one way analysis of variance

5.1 COEFFICIENT OF DETERMINATION, R2

The total variation or Total Sum of Squares (TSS) in the dependent variable, Y, is equal to the
sum of explained variation and residual variation.

Explained variation or Regression Sum of Squares (RSS) is the variation in the dependent
variable due to the effect of the independent variables in the model.

Residual variation or Error Sum of Squares refer to variation in the dependent variable due to the
error.

∑(𝑦𝑖 − 𝑦̅)2 = ∑(𝑦̂𝑖 − 𝑦̅)2 + ∑(𝑦𝑖 − 𝑦̂𝑖 )2

TSS = RSS + ESS

The coefficient of determination is the proportion if the variation in the dependent variable that
in explained by the independent variables in the regression model.

59
If we divide both sides of the equation by TSS, and decomposing the response and the
explanatory variables, there is a possibility of calculating the coefficient of determination.

TSS = RSS + ESS means that

RSS = TSS - ESS

∑(𝑦̂𝑖 − 𝑦̅)2
𝑅2 =
∑(𝑦𝑖 − 𝑦̅)2

𝑹𝑺𝑺
𝑹𝟐 =
𝑻𝑺𝑺

𝑇𝑆𝑆 − 𝐸𝑆𝑆
𝑅2 =
𝑇𝑆𝑆
𝑇𝑆𝑆 𝐸𝑆𝑆
𝑅 2 = 𝑇𝑆𝑆 - 𝑇𝑆𝑆

𝟐
̂ 𝒊 )𝟐
𝑬𝑺𝑺 ∑(𝒚𝒊 − 𝒚
𝑹 =𝟏− =
̅ )𝟐
𝑻𝑺𝑺 ∑(𝒚𝒊 − 𝒚

Coefficient of determination ranges from 0 to 1. If 𝑹𝟐 = 𝟎. 𝟖 it means that the estimated

regression explains 80 percent of the variation in the dependent variable.

5.2 ADJUSTED R-SQUARED

2
The coefficient of determination can also be expressed as adjusted R-squared (𝑅𝐴𝑑𝑗 ) . Some of
2
the reasons for using 𝑅𝐴𝑑𝑗 are;

a. To allow for comparison of several regressions fitted to data in order to choose the best
model.
b. To correct for degrees of freedom. R squared tend to increase as more and more
regressors enter the model. There is need to have a measure of goodness of fit that takes
into account the number of variables in the regression. The term adjusted means adjusted
for the degrees of freedom (df) associated with the sum of squares. Where k is the
number of variables and n is the sample size, the adjusted R squared can be computed
from the formula below.

60
2
𝑛−1
𝑅𝐴𝑑𝑗 = 1 − (1 − 𝑅 2 )
𝑛−𝑘

Do not misinterpret the coefficient of determination.

a. Coefficient of determination does not measure the magnitude of the slope of the
regression line.
b. Coefficient of determination is not a complete measure of the overall fitness of the
overall linear regression model.
c. Coefficient of determination is not a verification of correct specification of a fitted
model.

5.3 CORRELATION

A correlation exist between two variables when one of them is related to the other in some way.
Linear coefficient of correlation measures the strength of linear relationship between paired
values of two variables in a sample. Since we use sample data, coefficient of correlation is a
sample statistic, r. it is an estimate of a population parameter 𝜌 . This quantity is closely related
to but conceptually very much different from the coefficient of determination for a regression,
R2. To develop methods of using sample correlation coefficient to make inference to the
population coefficient of determination, we make the following assumptions;

 The sample of paired (x, y) data is a random sample.

 The pairs of (x, y) have a bivariate normal distribution.

Linear coefficient of correlation can be computed from coefficient of determination by taking

square root.

𝑟 = ± √𝑟 2

Alternatively, r is computed from its definition.

∑ 𝑥𝑖 𝑦𝑖
𝑟=
√(∑ 𝑥𝑖2 )(∑ 𝑦𝑖2 )

61
𝑛 ∑ 𝑋𝑖 𝑌𝑖 − ∑ 𝑋𝑖 ∑ 𝑌𝑖
𝑟=
√[𝑛 ∑ 𝑋𝑖2 − (∑ 𝑋𝑖 )2 ][𝑛 ∑ 𝑌𝑖2 − (∑ 𝑌𝑖 )2 ]

Some of the properties of linear coefficient of correlation are as follows

i. It can be positive or negative, the sign depending on the sign of the term in the numerator
of definitional formula which measures the sample covariation of two variables.
ii. It lies between the limits of −1 and +1; that is, −1 ≤ r ≤ 1.
iii. It is symmetrical in nature; that is, the coefficient of correlation between X and Y (rXY )
is the same as that between Y and X(rY X).
iv. It is independent of the origin and scale
v. If X and Y are statistically independent the correlation coefficient between them is zero;
but if r = 0, it does not mean that two variables are independent. In other words, zero
ucorrelation does not necessarily imply independence.
vi. It is a measure of linear association or linear dependence only; it has no meaning for
describing nonlinear relations. Y = X2 is an exact relationship yet r is zero.
vii. Although it is a measure of linear association between two variables, it does not
necessarily imply any cause-and-effect relationship.

Example
The table below show hypothetical data for sales of two commodities in a supermarket
for 10 weeks. Calculate linear correlation of coefficient.

y x
1130 780
621 916
813 793
996 1188
1030 499
1257 1180
898 1229
743 1450
921 1071
1179 1153

62
Working

Y X xs ys xy
1130 780 608400 1276900 881400
621 916 839056 385641 568836
813 793 628849 660969 644709
996 1188 1411344 992016 1183248
1030 499 249001 1060900 513970
1257 1180 1392400 1580049 1483260
898 1229 1510441 806404 1103642
743 1450 2102500 552049 1077350
921 1071 1147041 848241 986391
1179 1153 1329409 1390041 1359387
9588 10259 11218441 9553210 9802193

𝑛 ∑ 𝑋𝑖 𝑌𝑖 − ∑ 𝑋𝑖 ∑ 𝑌𝑖
𝑟=
√[𝑛 ∑ 𝑋𝑖2 − (∑ 𝑋𝑖 )2 ][𝑛 ∑ 𝑌𝑖2 − (∑ 𝑌𝑖 )2 ]

10(9802193) − (10259)(9588)
𝑟=
√[(10)(11218441) − 102592 )[(10)(9553210) − 95882

𝑟 = −0.068

5.4 REVIEW OF SOME DISTRIBUTION

NORMAL DISTRIBUTION

The best known of all the theoretical probability distributions is the normal distribution, whose
bell-shaped picture is familiar to anyone with statistical knowledge. A (continuous) random
variable X is said to be normally distributed if its PDF has the following form:

1 1 (𝑥−𝜇)2
𝑓(𝑥) = 𝜎√2𝜋 𝑒𝑥𝑝 (− 2 ) −∞ < 𝑥 < ∞
𝜎2

63
The curve of a normal distribution is bell shaped with 95 percent of the observations lying within
two standard deviations from the mean as shown in the figure below.

| | | | | |
−3𝜎 − 2𝜎 −𝜎 µ 𝜎 2𝜎 3𝜎
68% (approx.)
95% (approx.)
99.7% (approx.)
Properties of a normal distribution

a. It is symmetrical around its mean value.

b. Approximately 68 percent of the area under the normal curve lies between the values of μ
± σ , about 95 percent of the area lies between μ ± 2σ , and about 99.7 percent of the area
lies between μ ± 3σ , as shown in Figure A.4.
c. The normal distribution depends on the two parameters μ and σ2, so once these are
specified, one can find the probability that X will lie within a certain interval by using the
Probability Density Function (PDF) of the normal distribution.
d. The mean, median, and mode are equal.
e. The total area under the curve is equal to one.
f. The normal curve approaches, but never touches, the x-axis.

We use tables to find the probability that X lies within a certain interval. To use the table, we
convert the given normally distributed variable X with mean μ and σ2 into a standardized normal
variable Z by the following transformation.

𝑋−𝜇
𝑍=
𝜎

64
T-DISTRIBUTION

If the sample size n is less than 30, the population standard deviation is unknown, and the
population distribution can be assumed to be normal, then the boundaries a and b of a confidence
interval a ≤ µ ≤ b are determined by means of:

 The sample point estimate (x),

 The t statistic , and
 The sampling distribution of that statistic, known as the t distribution.

The t statistic is the random variable

𝑋̅ − 𝜇
𝑇=
𝑆𝑋̅

If Z1 is a standardized normal variable [that is, Z1 ∼ N(0, 1)] and another variable Z2 follows the
chi-square distribution with k df and is distributed independently of Z1, then the variable defined
as

𝑍1
𝑡=
√𝑍2⁄
𝑘

follows a t distribution with k degrees of freedom. A t-distributed variable is often designated as

𝑡𝑘 , where the subscript k denotes the df . Geometrically, the t-distribution is shown in the figure
below;

k = 120 (normal)

k = 20

k=5

65
Properties of the t distribution

a. Like the normal distribution, is symmetrical, but it is flatter than the normal distribution.
But as the df increase, the t distribution approximates the normal distribution. One can
think of the number of degrees of freedom associated with a statistic as the number of
unrestricted, free-to-vary values that are used in calculating the statistic
b. The mean of the t distribution is zero, and its variance is k/(k − 2).

It can be proven mathematically that at infinite degrees of freedom, the t distribution and the
standard normal distribution are identical. As the degrees of freedom decrease from infinity, the t
curves remain bell-shaped and symmetric around a mean of zero, but they become progressively
flatter than the standard normal, with more area added to the tails. Below v = 30, the t distribution
is quite different from the standard normal distribution.

F DISTRIBUTION

Apart from the normal distribution, there are other distributions in statistics that are relevant in
econometrics. If we select randomly two independent samples from two normally distributed
𝑠2
populations with equal variances, the distribution of sample variances ( 𝑠12 ) is called the F-
2

distribution.

The F-distribution has the following properties;

a. It is not symmetrical but nit is skewed to the right.

b. Its values can be zero or positive.
c. There is a different F-distribution for each pair of degrees of freedom for the numerator
and denominator.

α
F-distribution showing non-negative values only

66
5.5 ANALYSIS OF VARIANCE (ANOVA)

If you select two samples from two populations with equal variance and give a specific treatment
to one of the samples, the observed variances may be due to the treatment or due to errors. The
analysis of variance is based on a comparison of two different estimates of population variance.

a. The variances between samples. This is the variance due to the treatment.
b. The variance within samples. This is the variance due to the error.

There are one way ANOVA and Two way ANOVA

One way ANOVA

The method is called one way ANOVA because we use one property or characteristic to
categorize the two populations. This characteristic is called treatment of factor. One way
ANOVA helps us to test the null hypothesis that three of more population means are equal.

𝑯𝟎 ; 𝝁 𝟏 = 𝝁 𝟐 = 𝝁 𝟑

Assumptions of one way ANOVA

i. The populations have populations that are approximately normal

ii. The populations have the same variance
iii. The samples are independent of each other
iv. The different samples are from populations that are categorized in only one way

The test statistic for one way ANOVA is F statistic.

𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑎𝑚𝑜𝑛𝑔 𝑠𝑎𝑚𝑝𝑙𝑒𝑠

𝐹=
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑤𝑖𝑡ℎ𝑖𝑛 𝑠𝑎𝑚𝑝𝑙𝑒𝑠

If the variation among the samples (due to treatment) is equal to variation within samples (due to
error), it means that the treatment did not have any effect and the F statistic will be equal to one.
F statistic close to one, shows that there is insufficient evidence for us to reject the null
hypothesis of equality of means. But if the F-statistic is very large, it shows that the variation due
to treatment in much larger than variation due to error. In such a case we reject the null
hypothesis especially if the calculated F statistic is larger than the critical value.

67
Below is the ANOVA Table, where k is the number of population means being compared (equal
to number of treatments) while n is sample size

One way ANOVA calculations

We need data form observations from at least three samples.

 Calculate the mean of each sample, 𝑥̅𝑖

 Calculate the grand mean, 𝑥̿
 Calculate the variance within each sample, 𝑠𝑖2
 Find SSA = 𝑛1 (𝑥̅1 − 𝑥̿ )2 + 𝑛2 (𝑥̅2 − 𝑥̿ )2 + ⋯ + 𝑛𝑘 (𝑥̅𝑘 − 𝑥̿ )2
 Find SSE = (𝑛1 − 1)𝑠12 + (𝑛2 − 1)𝑠22 + ⋯ + (𝑛𝑘 − 1)𝑠𝑘2
 Find SST = SSA + SSE
 Find degrees of freedom for SSA = k – 1
 Find degrees of freedom for SSE = N – k
 Find degrees of freedom for SST = N – 1
 Compute MSA, MSE and F.

N is the total sample size = 𝑛1 + 𝑛2 + ⋯ + 𝑛𝑘

One way ANOVA calculations with equal sample means.

68
Example

A company sells identical soap in three different wrappings at the same price. Sales data are
normally distributed with equal variance. The sales for 5 months are given in the table below.

Wrapping 1 Wrapping 2 Wrapping 3

87 78 90
83 81 91
79 79 84
81 82 82
80 80 88

Test at 5% level of significance whether the mean soap sales for each wrapping is equal or not.

410 400 435

𝑥̅1 = = 82 , 𝑥̅2 = = 80 , 𝑥̅3 = = 83 ,
5 5 5

SSA = 5[(82 − 83)2 + (80 − 83)2 + (87 − 83)2 ] = 130

SSE = (87 − 82)2 + (83 − 82)2 + (79 − 82)2 +(81 − 82)2 +(80 − 82)2 + (78 − 80)2 +
(81 − 80)2 + (79 − 80)2 + (82 − 80)2 + (80 − 80)2 + (90 − 87)2 + (91 − 87)2 +
(84 − 87)2 + (82 − 87)2 + (88 − 87)2

= 110

SST = 130 + 110

= 240
At 5 % significance level, the critical F value is 3.885

Source of Sum of Degrees of Mean F statistic

variation squares freedom squares
Among samples SSA= 130 3-1=2 130 65
= 65 = 7.09
2 9.17
Within samples SSE =110 15 - 3=12 110
= 9.17
12
Total SST =240 15 - 1=14

69
The calculated F value is 7.09 which is greater that the tabulated value. Therefore we reject the
hull hypothesis of equal means and concluded that the three means are not equal. This means that
the difference in wrapping had an effect on the sales of the clothes.

One way ANOVA in Excel

Data analysis is one of Add ins for excel. Add it by dropping down through file, options, add ins,
Analysis ToolPak- VBA, go, check in ASnalysis ToolPak-VBA box then click ok.
Enter the data in excel with the treatments as columns.
Go to;
 data
 data analysis
 ANOVA single factor
 Select all cells with data as input cells
 You may select the output range, the default is new sheet.
 Enter

Anova: Single Factor

SUMMARY
Groups Count Sum Average Variance
Wrapping 1 5 410 82 10
Wrapping 2 5 400 80 2.5
Wrapping 3 5 435 87 15

ANOVA
Source of VariationSS df MS F P-value F crit
Between Groups130 2 65 7.090909 0.00927 3.885294
Within Groups 110 12 9.166667

Total 240 14

70
Notice that the ANOVA table from excel has the same information as the one we got through
manual calculations. The interpretation is also the same. Notice that from the excel output, we
also have the P-value. This is an alternative way to the critical value approach for us to make
rejection decision. In this case we reject the null hypothesis is the P-value associated with the F
statistic is smaller than the level of significance like 0.05.

5.6 UNIT SUMMARY

This unit has defined coefficient of determination as the proportion of variation in the dependent
variable that is explained by the regression model. For a multiple regression, we use adjusted R2.
You have also learnt that linear coefficient of correlation measures the strength of linear
relationship between paired values of two variables in a sample. Among the distributions that are
commonly used, you have studied normal distribution, t-distribution and F-distribution. You
were also able to compute the F-statistic using ANOVA

5.7 UNIT EXERCISE

1. The table below gives the output for 8 years of an experimental farm that used each of 4
types of fertilizers. Assume that the outputs with each fertilizer are normally distributed
with equal variance.

Fertilizer 1 Fertilizer 2 Fertilizer 3

51 47 57
47 50 48
56 58 52
52 61 60
57 51 61
59 48 57
58 59 51
60 50 46

a. Find the mean output for each fertilizer and the grand mean for all the years and for all
four fertilizers.

71
b. Estimate the population variance from the variance between the means or columns.
c. Estimate the population variance from the variance within the samples or columns.
d. Test the hypothesis that the population means are the same at the 5% level of
significance.
2. An experiment was conducted to compare three different computer keyboard designs with respect
to their effect on repetitive stress injuries (rsi). Fifteen businesses of comparable size participated
in a study to compare the three keyboard designs. Five of the fifteen businesses were randomly
selected and their computers were equipped with design 1 keyboards. Five of the remaining ten
were selected and equipped with design 2 keyboards, and the remaining five used design 3
keyboards. After one year, the number of rsi were recorded for each company. The results are
shown in Table 1.

Table1: Number of repetitive stress injury

Design 1 Design 2 Design 3

10 24 17
10 22 17
8 24 15
10 24 19
12 26 17

a. Calculate the mean number of rsi for each design. (2marks)

b. Calculate the total sum of squares (SST) (4marks)
c. Write null hypothesis and alternative hypothesis if the aim of the survey was to find out if
the difference in designs influence the variation in repetitive stress injuries. (2marks)
d. Using excel, produce the output for one way ANOVA. (5marks)
e. Using your output to draw a possible conclusion from the study. (4marks)

72
6 UNIT SIX: INFERENCE AND PREDICTION

UNIT INTRODUCTION

In unit 3, you learnt that the population parameters are usually unknown and you can estimate
them by using data collected from an adequate and representative sample. In this unit you will
learn about hypothesis testing, point estimation and interval estimation. Statisticians and
econometricians are very systematic in their approach to hypothesis testing. You will learn the
steps in hypothesis testing and types of error that are committed in the process of hypothesis
testing.

UNIT OBJECTIVES

By the end of this unit students must be able to;

a. Differentiate point estimate from interval estimate.

b. Explain steps in hypothesis testing
c. Discuss type I and type II errors.
d. Use P-value in hypothesis testing

6.1 POINT ESTIMATE

Very often we know or are willing to assume that a random variable X follows a particular
probability distribution but do not know the value(s) of the parameter(s) of the distribution. For
example, if a random variable, X follows a normal distribution, we may want to know the value
of its two parameters, namely, the mean and the variance. To estimate the unknowns, we assume
that we have a random sample of size n from the known probability distribution and use the
sample data to estimate the unknown parameters. This is known as the problem of estimation.
The problem of estimation is categorized into point estimation and interval estimation.

A point estimate of a parameter 𝜽 is a single number that can be regarded as a sensible value
̂ to denote the point estimator of a parameter 𝜽 . A point estimate is
for 𝜽 . Usually we use 𝜽
obtained by selecting a suitable statistic and computing its value from given sample data. The
selected statistic is called the point estimator. For example, mean and variance are point
estimators. The sample value of the mean is the point estimator. Different statistic can be used to

73
estimate the same parameter, i.e., a parameter may have multiple point estimators. For example,
the following are

i. Sample mean: µ̂ = 𝑋̅
ii. Sample median: µ̂ = 𝑋̃
min 𝑥𝑖 +max 𝑥𝑖
iii. Average of the extremes: 𝜇 = 2

̂ is a random variable, its value varies from sample to sample, so there are
A point estimator 𝜽
estimation errors.

̂ = 𝜽 + error of estimation
𝜽

Accurate estimators are close to the true population parameter and have small errors of
estimation, an estimator with unbiasedness and minimum variance will often be accurate
estimators in this sense.

̂
The standard error of an estimator 𝜽 ̂ ) . If the
is its standard deviation 𝜎𝜽̂ = √𝑉𝑎𝑟(𝜽

standard error itself involves unknown parameters whose values can be estimated, substitution of
these estimators into 𝜎𝜽̂ yields the estimated standard error of the estimator. The estimated
̂ 𝜽̂ or 𝑆𝜽̂ .
standard error can be denoted either by 𝜽

A point estimate may be the researcher’s best guess at the population value, but, by its nature, it
provides no information about how close the estimate is “likely” to be to the population
parameter. It does not, by itself, provide enough information for testing economic theories or for
informing policy discussions.

As an example, suppose a researcher reports, on the basis of a random sample of villagers in a

project area, participation in a microfinance project increases monthly income by MK6000.00.
How are we going to know whether or not this is close to the effect in the population of villagers
who could have participated in the project? Since we do not know the population value, we
cannot know how close an estimate is for a particular sample. However, we can make statements
involving probabilities, and this is where interval estimation comes in.

74
6.2 INTERVAL ESTIMATION

The reliability of a point estimator is measured by its standard error. Therefore, instead of relying
on the point estimate alone, we may construct an interval around the point estimator, say within
two or three standard errors on either side of the point estimator, such that this interval has, say,
95 percent probability of including the true parameter value. This is roughly the idea behind
interval estimation.

To be more specific, assume that we want to find out how “close,” say, 𝛽̂1 is to 𝛽1 . For this
purpose we try to find out two positive numbers δ and α, the latter lying between 0 and 1, such
that the probability that the random interval (𝛽̂1 − 𝛿 , 𝛽̂1 + δ) contains the true 𝛽̂1 is 1 − α.

Symbolically,

𝑃(𝛽̂1 − 𝛿 ≤ 𝛽̂1 ≤ 𝛽̂1 + δ) = 1 – α

 This interval 𝑃(𝛽̂1 − 𝛿 ≤ 𝛽̂1 ≤ 𝛽̂1 + δ) is called confidence interval,

 𝛿 is the margin of error equal to 𝒕𝜶⁄ 𝑺𝒆 (𝜷𝟏 ) when the variance the population variance
𝟐

is not known. Margin of error is the maximum allowable error.

 1 − α is known as the confidence coefficient;
 α (0 < α < 1) is known as the level of significance.
 The endpoints of the confidence interval are known as the confidence limits (also known
as critical values), 𝛽̂1 − δ being the lower confidence limit and 𝛽̂1 + δ the upper
confidence limit.

If the confidence interval contain zero, the parameter is not significant because it will take a zero
value at some point.

An interval estimator, in contrast to a point estimator, is an interval constructed in such a manner

that it has a specified probability 1 – α of including within its limits the true value of the
parameter. For example, if α = 0.05, or 5 percent, sample confidence interval would read: The
probability that the (random) interval shown there includes the true 𝛽̂1 is 0.95, or 95 percent.
The interval estimator thus gives a range of values within which the true 𝛽̂1 may lie.

75
It is very important to know the following aspects of interval estimation:

 Does not say that the probability of 𝛽̂1 lying between the given limits is 1 − α. Since 𝛽̂1,
although an unknown, is assumed to be some fixed number, either it lies in the interval or
it does not. The probability of constructing an interval that contains β2 is 1 − α.
 The confidence interval is a random interval that is, it will vary from one sample to the
next because it is based on 𝛽̂1 which is random.
 Since the confidence interval is random, the probability statements attached to it should
be understood in the long-run sense, that is, repeated sampling. More specifically, means:
If in repeated sampling confidence intervals like it are constructed a great many times on
the 1 − α probability basis, then, in the long run, on the average, such intervals will
enclose in 1 − α of the cases the true value of the parameter.
 Once we have a specific sample and we obtain a specific numerical value of 𝛽̂1 , the
interval in is no longer random; it is fixed.

6.3 HYPOTHESIS TESTING

A statistical hypothesis is statement about the values of some parameters of hypothetical

population from which a sample is drawn.

A hypothesis which says that a parameter has a specified value is called a point hypothesis. A
hypothesis which says that a parameter lies in a specified interval is called an internal hypothesis.
For instance, if β is the population mean, then 𝐻0 : β = 4 is a point by hypothesis. 𝐻0 : 4 ≤ p ≤ 7
is an interval hypothesis.

A hypothesis test is a procedure that answers the question of whether the observed difference
between the sample value and the population value hypothesized is real or due to chance
variation. For instance, if the hypothesis says that the population mean µ= 6 and the sample
mean 𝑥̅ = 8, then we want to know whether this difference is real or due to chance variation.

The hypothesis we are testing is called the null hypothesis and is often denoted by 𝐻0 .The
alternative hypothesis is denoted by 𝐻1 ,. The probability of rejecting 𝐻0 when, in fact, it is true,
is called the significance level. To test whether the observed difference between the data and

76
what is expected under the null hypothesis H,, is real or due to chance variation, we use a test
statistic.

A desirable criterion for the test statistic is that its sampling distribution be tractable, preferably
with tabulated probabilities. Tables are already available for the normal, t, χ2, and F distributions,
and hence the test statistics chosen are often those that have these distributions. The observed
significance level or P-value is the probability of getting a value of the test statistic that is at least
as extreme as the test statistic. This probability is computed on the basis that the null hypothesis
is correct.

COMPONENTS OF HYPOTHESIS TESTING

Null hypothesis, 𝑯𝟎

The null hypothesis specifies the value you want to test. 𝐻0 is the belief that we maintain until
we test it with the data and the sample. The null hypothesis, denoted by Ho, is typically a clear
statement of equality: The unknown population parameter e is equal to some specific constant
value.

We begin with the assumption that H0 is true and any difference between the sample statistic and
true population parameter is due to chance and not a real (systematic) difference. Refers to the
status quo and a . The null hypothesis may or may not be
rejected.

Similar to the notion of “innocent until proven guilty” That is, “innocence” is a null hypothesis.

Examples of null hypotheses.

Two tailed test.

 Average price of a pair of trousers in the Malawi, μ = MK 3500.00.

One tailed test

 The population mean monthly cell phone bill of Lilongwe city is not more than MK
25,000.00: μ ≤ MK 25,000.00.
 The average number of TV decoders in Blantyre homes is at least three; μ ≥ 3

77
Alternative hypothesis

It is the opposite of the null hypothesis. It challenges the status quo and never contains the signs
“=” , “≤” or “≥” .

May or may not be proven. It is generally the hypothesis that the researcher is trying to prove.
Evidence is always examined with respect to H1, never with respect to H0. We never “accept”
H0, we either “reject” or “not reject or fail to reject” it.

Examples of alternative hypotheses

 Average price of a pair of trousers in the Malawi, μ ≠ MK 3500.00

 The population mean monthly cell phone bill of Lilongwe city is greater than MK
25,000.00: μ > MK 25,000.00.

 The average number of TV decoders in Blantyre homes is less than three; μ < 3

Test statistic

Select the appropriate test statistic. Always state the test statistic. The following are test statistics
for one population.

 When testing a hypothesis of a proportion, p, we use the z-statistic or z-test and where q
= 1 – p, the formula is as follows.

𝑝− 𝑝
𝑍=
𝑝𝑞
√
𝑛

 When testing a hypothesis of a mean, we use the z-statistic or we use the t-statistic
according to the following conditions.

o If the population standard deviation, σ, is known and either the data are normally
distributed or the sample size n > 30, we use the normal distribution (z-statistic).

78
𝛽̂1 − 𝛽1
𝑍= 𝜎
⁄ 𝑛
√

o When the population standard deviation, σ, is unknown and either the data are
normally distributed or the sample size is greater than 30 (n > 30), we use the t-
distribution (t-statistic).

𝛽̂1 − 𝛽1
𝑡=
𝑆𝑒(𝛽1 )

𝛽̂1 − 𝛽1
𝑡= 𝑠
⁄ 𝑛
√

Level of significance, α

The researcher decides on the level of significance for the hypothesis testing. This is the
probability of rejecting a true null hypothesis. It is actually the lowest level at which the null
hypothesis can be rejected. There are three commonly used levels of significance and these are;

 One percent, 0.01

 Five percent, 0.05
 Ten percent, 0.10

A 0.01 level of significance reduces the chance of rejecting a true null hypothesis while a 0.10
level of significance reduces the chance of failing to reject a false null hypothesis. Many people,
use the 0.05 level of significance as it balances up the possibility of making the two errors.

Rejection region

The level of significance is used to determine the critical value of the test statistic. The critical
value demarcates the distribution of the test statistic into rejection region (critical region) and
acceptance region.

79
For two tailed test, the rejection region is split into two halves, one on each tail of the
distribution.

Rejection region rejection region

Region of acceptance

−𝑡𝛼⁄2 𝑡𝛼⁄2

𝐻0 : µ = 0

We reject the null hypothesis if the calculated t-statistic is less than −𝑡𝛼⁄2 or is greater than

𝑡𝛼⁄2 . In absolute terms, we reject the null hypothesis if the calculated t-statistic is greater than

𝑡𝛼⁄2 .

For a one tailed test the rejection region is on one side. There are right tailed test and left tailed
test.

a. Left tailed test

Rejection region

−𝑡𝛼⁄2

𝐻0 : 𝜇 ≥ 𝐾

80
We reject the null hypothesis if the calculated t-statistic is less than −𝑡𝛼⁄2 . In absolute terms,

we reject the null hypothesis if the calculated t-statistic is greater than 𝑡𝛼⁄2 .

ii. Right tailed test

Rejection region

𝑡𝛼⁄2

𝐻0 : 𝜇 ≤ 𝐾
We reject the null hypothesis if the calculated t-statistic is greater than 𝑡𝛼⁄2 . In absolute terms,

we also reject the null hypothesis if the calculated t-statistic is greater than 𝑡𝛼⁄2 . Usually, we

use the absolute terms interpretation.

Conclusion

The calculated test statistic may fall within or outside the acceptance region. In the conclusion
for hypothesis testing, we may reject or fail to reject the null hypothesis depending upon where
the test statistic falls.

 If the critical value falls in the rejection region, we reject the null hypothesis. In such a
case, the sample evidence is strong enough to warrant rejection of the null hypothesis.
The probability of rejecting a true null hypothesis is smaller than the level of
significance.
 We fail to reject the null hypothesis when the critical value falls within the acceptance
region. In this case, we mean that on the basis of the sample evidence we have no reason
to reject it. The sample evidence is not strong enough to warrant rejection of the null
hypothesis. However, we are not saying that the null hypothesis is true beyond any doubt.
We fail to reject the null hypothesis says it more correctly than saying we accept the null
hypothesis.

81
6.4 TYPE I AND TYPE II ERRORS

When testing the hypothesis, we arrive at a conclusion of rejecting it or fail to it. Such
conclusions may be correct and sometimes wrong (even if we do everything correctly). There are
two different errors that we can make. We here by distinguish the two types of errors by calling
them type I error and type II error.

Type I error

This the mistake of rejecting the null hypothesis when it is actually true. The symbol α (alpha) is
used to represent the probability of committing type I error and it is called the level of
significance.

Type II error

This is the mistake of failing to reject the null hypothesis when it is actually false. The symbol β
(beta) is used to represent the probability of not committing type I error and it is called power of
the test. Put simply, power of the test is the probability of rejecting a false null hypothesis.

The type I and type II errors are better understood by relating them to a court case. The table
below shows the two types of errors and their analogy in the legal system.

Actual situation
Decision Hypothesis testing Legal system
True null False null Innocent Not innocent
hypothesis hypothesis
Do not reject No error Type II error No error Type II error
null hypothesis (1 − 𝛼) (𝛽) (Not guilty, not (Guilty, not
found guilty) found guilty)
(1 − 𝛼) (𝛽)
Reject null Type I error No error Type I error No error
hypothesis (𝛼) (1 − 𝛽) (Not guilty, (Guilty, found
found guilty) guilty)
(𝛼) (1 − 𝛽)

82
Controlling Type I and Type II errors

It is a standard procedure for a researcher to select the significance level, 𝛼 [p (Type I error)] but
we do not select 𝛽[ p(Type II error)]. It is desirable to have 𝜶 = 𝟎 and 𝜷 = 𝟎, but in reality
that is not possible. Therefore our main task is to manage the probabilities of Type I error and
Type II error.

There exist a mathematical relationship among 𝜶, 𝜷 and sample size, n such that when any
two are chosen, the third is automatically determined. You may refer to you statistics module or
research methods notes for formulas of determining sample size. The usual practice in research
and industry is to select values of 𝜶 and n, so that the value of 𝜷 is determined.

 For any fixed 𝜶 an increase in sample size, n will cause a decrease in 𝜷. That is, a large
sample size will decrease the chance that you make the error of not rejecting the null
hypothesis when it is actually false.
 For any fixed sample size, n a decrease in 𝜶, will cause an increase in 𝜷.
 To decrease both 𝛼 and 𝛽, increase the sample size.

6.5 P-VALUE OF HYPOTHESIS TESTING

In section 6.1 we noted that there are three commonly used level of significance thus, 0.1, 0.05
and 0.01. Rather than testing at different significance levels, it is more informative to answer the
following question: Given the observed value of the t statistic, what is the smallest significance
level at which the null hypothesis would be rejected? This level is known as the p-value for the
test. Since a p-value is a probability, its value is always between zero and one.

In order to compute p-values, we either need extremely detailed printed tables of the t
distribution which is not very practical or a computer program that computes areas under the
probability density function of the t distribution. Most modern regression packages have this
capability. Some packages compute p-values routinely with each OLS regression.

The p-value is an alternative way, of concluding hypothesis testing, to the use of critical value of
the test statistic. The P–value is compared to the predetermined level of significance. We reject
the null hypothesis if the p-value is less than the level of significance.

83
6.6 CONCLUSION

In this unit, we have discussed about how to test hypotheses using a classical approach: after
stating the alternative hypothesis, we choose a significance level, which then determines a
critical value. Once the critical value has been identified, the value of the t statistic is compared
with the critical value, and the null is either rejected or not rejected at the given significance
level. You have also learnt that in rejecting or failing to reject the null hypothesis, we may
commit Type I or Type II error. You have finished the unit with use of P-value in hypothesis
testing.

6.7 UNIT EXERCISE

1. Consider the following regression output:

𝑌̂𝑖 = 0.2033 + 0.6560𝑋𝑡
Se = (0.0976) (0.1961)
Where Y = labor force participation rate (LFPR) of women in 1972 and X = LFPR of
women in 1968. The regression results were obtained from a sample of 19 cities in the
United States. Test the hypothesis: 𝐻0 : 𝛽1 = 1 against 𝐻1 : 𝛽1 ≠ 1 at 5% level of
significance. Which test would you use? And why?
2. Discuss advantages and disadvantage the use of 1% and 10% as level of significance in
hypothesis testing and suggest a reason why 5% is the most commonly used level of
significance.

84
7 UNIT 7: DATA CONFORMITY AND PROBLEMS OF
FITTING MODELS

UNIT INTRODUCTION

In the previous units, you learnt about estimation of population parameters, goodness of fit and
hypothesis testing. You have noticed that the quality of the estimates are as good as the quality of
data that is used in estimation. In this unit, you will learn problems associated with data. You
will also be introduced to model specification problems. In the last section, you will learn
different functional forms that you can use when analysing data.

OBJECTIVES

By the end of the unit, students will be able to;

a. Discuss model specification errors

b. List model misspecification errors
c. Explain and handle some data problems.
d. Use different functional forms.

7.1 MODEL SPECIFICATION AND MISSPECIFICATION ERRORS

One of the assumptions of the classical linear regression model (CLRM), Assumption 9, is that
the regression model used in the analysis is “correctly” specified: If the model is not “correctly”
specified, we encounter the problem of model specification error or model specification bias.
Considering diversity of the topic, we hope to bring out some of the essential issues involved in
model specification and model evaluation.

Before turning to an examination of specification errors in some detail, it may be fruitful to

distinguish between model specification errors and model misspecification errors. Specification
errors occur when we have in mind a “true” model but somehow we do not estimate the correct
model. In model mis-specification errors, we do not know the true model.

85
DATA PROBLEMS

Missing data

The missing data problem can arise in a variety of forms. Often, we collect a random sample of
people, schools, cities, and so on, and then discover later that information is missing on some key
variables for several units in the sample. For example, some of the respondents have non
information on some variables. In panel data also, over time some respondents drop out or do not
provide information on all the questions.

Consequences of missing data

a) Reduction of sample size

Modern regression packages keep track of missing data and simply ignore observations
when computing a regression. If the data are missing at random, then the size of the
random sample available from the population is simply reduced.
b) Makes estimators less precise
This is a result of a reduction in sample size. While missing data makes the estimators
less precise, it does not introduce any bias: the random sampling assumption still holds.

Handling missing data

If the reasons for the missing data are independent of the available observations, which are called
by Darnell the “ignorable case.” we simply ignore those observations.

If the missing observations may be systematically related to the available data. This is a more
serious case, for it may be the result of self-selection bias, that is, the observed data are not truly
randomly collected. There are more complicated solutions.

Non-random samples

A non-random sample is not representative of the population. The random sampling assumption
is violated, and we must worry about consequences for OLS estimation.

Provided there is enough variation in the independent variables in the sub-population, selection
on the basis of the independent variables is not a serious problem, other than that it results in
inefficient estimators.

86
Certain types of nonrandom sampling, like choosing a sample on the basis of the independent
variables, do not cause bias or inconsistency in OLS.

Outlying Observations

In some applications, especially, but not only, with small data sets, the OLS estimates are
influenced by one or several observations. Such observations are called outliers or influential
observations. In the regression context, an outlier may be defined as an observation with a “large
residual.” Loosely speaking, an observation is an outlier if dropping it from a regression analysis
makes the OLS estimates change by a practically “large” amount. OLS is susceptible to outlying
observations because it minimizes the sum of squared residuals.

Reasons for outlying observations

a) A mistake has been made in entering the data. Adding extra zeros to a number or
misplacing a decimal point. It is always a good idea to compute summary statistics,
especially minimums and maximums, in order to catch mistakes in data entry. Outliers
can also arise when sampling from a small population if one or several
b) Members of the population are very different in some relevant aspect from the rest of the
population.

Handling outlying observations

 OLS results should probably be reported with and without outlying observations.
Outlying observations can provide important information by increasing the variation in
the explanatory variables which reduces standard errors.
 Use functional forms that are less sensitive to outliers if possible. Certain functional
forms are less sensitive to outlying observations. For most economic variables, the
logarithmic transformation significantly narrows the range of the data and also yields
functional forms that can explain a broader range of data.
 We can use an estimation method that is less sensitive to outliers than OLS.
This removes the need to explicitly search for outliers before estimation. One such
method is called least absolute deviations, or LAD. The LAD estimator minimizes the
sum of the absolute deviation of the residuals, rather than the sum of squared residuals.

87
MODEL FITTING

TYPES OF SPECIFICATION ERRORS

i. Omission of a relevant variable(s).

ii. Inclusion of an unnecessary variable(s).
iii. Adoption of the wrong functional form.
iv. Errors of measurement.
v. Incorrect specification of the stochastic error term.
vi. Assumption that the error term is normally distributed.

The first four are specification errors while the last two are misspecification errors.

Underfitting a Model (Omitting a Relevant Variable)

Suppose the true model is:

𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + 𝜀𝑖

But for some reason we fit the following model;

𝑌𝑖 = 𝛼0 + 𝛼1 𝑋1𝑖 + 𝜀𝑖

The consequences of omitting variable X2 are as follows:

i. If the left-out, or omitted, variable X2 is correlated with the included variable X1, that is,
r23, the correlation coefficient between the two variables is nonzero and 𝛼̂0 and 𝛼̂1 are
biased as well as inconsistent. The bias does not disappear as the sample size gets larger.
ii. Even if X1 and X2 are not correlated, 𝛼̂0 is biased, although 𝛼̂1 is now unbiased.
iii. The disturbance variance σ2 is incorrectly estimated.
𝜎2
iv. The conventionally measured variance of 𝛼̂1 (= ∑ 𝑋 2 ) is a biased estimator of the
1

variance of the true estimator β̂1

v. In consequence, the usual confidence interval and hypothesis-testing procedures are
likely to give misleading conclusions about the statistical significance of the estimated
parameters.
vi. As another consequence, the forecasts based on the incorrect model and the forecast
(confidence) intervals will be unreliable.

88
Inclusion of an Irrelevant Variable (Overfitting a Model)

Suppose the true model is:

𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝜀𝑖

But for some reason we fit the following model;

𝑌𝑖 = 𝛼0 + 𝛼1 𝑋1𝑖 + 𝛼2 𝑋2𝑖 + 𝜀𝑖

In this case, we commit the specification error of including an unnecessary variable in the model.
The consequences of this specification error are as follows:

i. The OLS estimators of the parameters of the “incorrect” model are all unbiased and
consistent.
ii. The error variance σ2 is correctly estimated.
iii. The usual confidence interval and hypothesis-testing procedures remain valid.
iv. However, the estimated α’s will be generally inefficient, that is, their variances will be
generally larger than those of the ˆ β’s of the true model.

An unwanted conclusion from omission of relevant variables and inclusion of irrelevant

variables would be that it is better to include irrelevant variables than to omit the relevant ones.
But this philosophy is not to be espoused because the addition of unnecessary variables will lead
to a loss in the efficiency of the estimators and may also lead to the problem of multicollinearity
and the loss of degrees of freedom. Therefore, In general, the best approach is to include only
explanatory variables that, on theoretical grounds, directly influence the dependent variable and
that are not accounted for by other included variables.

Tests for specification and misspecification errors;

1. Inclusion of irrelevant variables can be tested using F-test and t-tests.

2. Omission of relevant variables can be detected by examination of residuals, Durbin–
Watson d Statistic, Ramsey’s RESET Test, Lagrange Multiplier (LM) Test for Adding
Variables

89
Errors of measurement

If there are errors of measurement in the regressand only, the OLS estimators are unbiased as
well as consistent but they are less efficient. If there are errors of measurement in the regressors,
the OLS estimators are biased as well as inconsistent. Even if errors of measurement are detected
or suspected, the remedies are often not easy. The use of instrumental or proxy variables is
theoretically attractive but not always practical. Thus it is very important in practice that the
researcher be careful in stating the sources of his/her data, how they were collected, definitions
used, etc. Data collected by official agencies often come with several footnotes and the
researcher should bring those to the attention of the reader.

Model misspecification errors

These errors can be as serious as equation specification errors. In particular, we distinguished

between nested and non-nested models. To decide on the appropriate model we discussed the
non-nested, or encompassing, F test and the Davidson–MacKinnon J test

Choosing an empirical model

Researchers have used a variety of criteria to choose between competing models and these
include

a. Akaike and Schwarz information criteria,

b. Mallows’s Cp criterion,
c. Forecast χ2 criterion.

7.2 FUNCTIONAL FORMS OF REGRESSION MODELS

In this sections we consider some commonly used regression models that may be nonlinear in the
variables but are linear in the parameters or that can be made so by suitable transformations of
the variables. In particular, we discuss the following regression models:

 The log-log model

 Semi-log models
 Reciprocal models
 The logarithmic reciprocal model

90
The Log-log Model

Consider the following model, known as the exponential regression model:

𝛽
𝑌𝑖 = 𝛽0 𝑋1 1 𝑒 𝜀𝑖

If we logarithm of the equation, the model may be expressed alternatively as

𝑙𝑛𝑌𝑖 = 𝑙𝑛𝛽0 + 𝛽1 𝑙𝑛𝑋𝑖 + ε𝑖

ln = natural log (i.e., log to the base e, and where e = 2.718). If we let where α = 𝑙𝑛𝛽0 , the
model becomes;

𝑙𝑛𝑌𝑖 = 𝛼 + 𝛽1 𝑙𝑛𝑋𝑖 + ε𝑖

This model is linear in the parameters α and 𝛽1 and can be estimated by OLS regression.
Because of this linearity, such models are called log-log, double-log, or log-linear models. If the
assumptions of the classical linear regression model are fulfilled, the parameters can be estimated
by the OLS method by letting

𝑌𝑖∗ = 𝛼 + 𝛽1 𝑋𝑖∗ + ε𝑖

where 𝑌𝑖∗ = 𝑙𝑛𝑌𝑖 and 𝑋𝑖∗ = 𝑙𝑛𝑋𝑖 .

The coefficient 𝛽1 measures elasticity of Y with respect to X. The model assumes that the
elasticity 𝛽1 is constant.

Semilog Models: Log–Lin and Lin–Log Models

Economists, businesspeople, and governments are often interested in finding out the rate of
growth of certain economic variables, such as population, GNP, money supply, employment,
productivity, and trade deficit.

Let 𝑌𝑡 denote real expenditure on services at time t and 𝑌0 the initial value of the expenditure
on services (i.e., the value at the end of 2017). Let us start from the well-known compound
interest formula

𝑌𝑡 = 𝑌0 (1 + 𝑟)𝑡

91
.where r is the compound (i.e., over time) rate of growth of Y. Taking the natural logarithm of the
formula we can write

𝑙𝑛𝑌𝑖 = 𝑙𝑛𝑌0 + 𝑡 𝑙𝑛(1 + 𝑟)

Letting

𝑙𝑛𝑌0 = 𝛽0

𝑙𝑛(1 + 𝑟)𝑡 = 𝛽1

The equation becomes

𝑙𝑛𝑌𝑖 = 𝛽0 + 𝛽1 𝑡 + ε𝑖

ε𝑖 is the error term.

This model is like any other linear regression model in that the parameters 𝛽0 𝑎𝑛𝑑 𝛽1 β1 are
linear. The only difference is that the regressand is the logarithm of Y and the regressor is

“time,” which will take values of 1, 2, 3, etc.

Models like this are called semilog models because only one variable appears in the logarithmic
form. In addition, a model in which the regressand is logarithmic and the regressor is in level
form will be called a log–lin model. The slope coefficient measures the constant proportional or
relative change in Y for a given absolute change in the value of the regressor. When multiplied
by 100, will then give the percentage change or growth in Y for an absolute change in X.

The Lin–Log Model

Unlike the growth model just discussed, in which we were interested in finding the percent
growth in Y for an absolute change in X, suppose we now want to find the absolute change in Y
for a percent change in X. A model that can accomplish this purpose can be written as:

𝒀𝒊 = 𝜷𝟎 + 𝜷𝟏 𝒍𝒏𝑿𝒊 + 𝛆𝒊

For descriptive purposes we call such a model a lin–log model because the regressand is in level
form while the regressor is in logarithmic form. The slope coefficient measures absolute change
in Y for a given constant proportional or relative change in X.

92
Reciprocal Models

Models of the following type are known as reciprocal models.

𝟏
𝒀𝒊 = 𝜷𝟎 + 𝜷𝟏 ( ) + 𝛆𝒊
𝑿𝒊

Although this model is nonlinear in the variable X because it enters inversely or reciprocally, the
model is linear in 𝜷𝟎 and 𝜷𝟏 is therefore a linear regression model.

This model has these features;

𝟏
 As X increases indefinitely, the term 𝜷𝟏 (𝑿 ) approaches zero (note: β2 is a constant)
𝒊

 As X increases indefinitely, Y approaches the limiting or asymptotic value 𝜷𝟎 .

One of the important applications of reciprocal models is the celebrated Phillips curve of
macroeconomics. Using the data on percent rate of change of money wages (Y) and the
unemployment rate (X) for the United Kingdom for the period 1861–1957, Phillips obtained a
curve whose general shape resembles the curve of below;
Annual rate of inflation (Percent)

Unemployment rate (Percent)

The Philips Curve

93
Log Hyperbola or Logarithmic Reciprocal Model
We conclude our discussion of reciprocal models by considering the logarithmic reciprocal
model, which takes the following form:

𝟏
𝒍𝒏𝒀𝒊 = 𝜷𝟎 − 𝜷𝟏 ( ) + 𝛆𝒊
𝑿𝒊

Its shape is as depicted in Figure 6.10. As this figure shows, initially Y increases at an increasing
rate (i.e., the curve is initially convex) and then it increases at a decreasing rate (i.e., the curve
becomes concave). Such a model may therefore be appropriate to model a short-run production
function. Recall from microeconomics that if labor and capital are the inputs in a production
function and if we keep the capital input constant but increase the labor input, the short-run
output–labor relationship will resemble Figure 6.10. (See Example

7.3 UNIT SUMMARY

In this unit you have seen that specification errors occur when we have in mind a “true” model
but somehow we do not estimate the correct model. In model mis-specification errors, we do not
know the true model. You have learnt that missing data, non-random sample and outliers are
some of the data problems. In the same unit you learnt some commonly used regression models
that may be nonlinear in the variables but are linear in the parameters. The regression models
include log-log model, semi-log models, reciprocal models and logarithmic reciprocal model.

7.4 UNIT EXERCISE

1. Discuss causes of missing data.

2. How would you handle the problem of outlying observations
3. Explain three model specification errors.
4. Suggest one application of reciprocal models.
5. Suppose you want to estimate own price elasticity of demand, what be one of the suitable
functional forms? Explain your answer.
6. What do the slope parameters of a log-lin model mean?

94
REFERENCES

Gujarati, D.N. and Porter, D.C., (2008). Basic Econometrics (Fifth edition). New York: The

McGraw−Hill Companies.

Gujarat N. D., (1995). Basic Econometrics (3rd Edition), McGraw-Hill.

Maddala, G.S., (1992). Introduction to Econometrics, (second edition). Macmillan Publishing

Companies. Maxwell Macmillan Canada, Inc.

Ruth Bernstein and Stephen Bernstein (1999). Theory and Problems of Elements of Statistics II
Inferential Statistics. USA: McGraw-Hill

Triola, T.F., (2001). Elementary Statistics, (Eighth Edition). USA; Adson Wesley Longman, Inc.

[Ebooks PDF] download Essentials of Econometrics 5th Edition Damodar Gujarati full chapters
100% (6)
[Ebooks PDF] download Essentials of Econometrics 5th Edition Damodar Gujarati full chapters
50 pages
International Finance PDF
No ratings yet
International Finance PDF
339 pages
Introduction Econometrics
100% (1)
Introduction Econometrics
27 pages
Business Decision Analysis (Lecture Notes)
No ratings yet
Business Decision Analysis (Lecture Notes)
40 pages
Nitu's Business School Selection: Riding On An Indifference Curve
No ratings yet
Nitu's Business School Selection: Riding On An Indifference Curve
11 pages
Bank Valuation and Value Based Management: Deposit and Loan Pricing, Performance Evaluation, and Risk, 2nd Edition
From Everand
Bank Valuation and Value Based Management: Deposit and Loan Pricing, Performance Evaluation, and Risk, 2nd Edition
Jean Dermine
1/5 (1)
Introductory Econometrics IGNOU
No ratings yet
Introductory Econometrics IGNOU
212 pages
Econometrics Chapter One
No ratings yet
Econometrics Chapter One
11 pages
Examination Paper: Instruction To Candidates
No ratings yet
Examination Paper: Instruction To Candidates
8 pages
Econometrics 1
No ratings yet
Econometrics 1
74 pages
Managerial Economics: Economic Optimization
No ratings yet
Managerial Economics: Economic Optimization
47 pages
The Nature and Scope of Econometrics: Confirming Pages
No ratings yet
The Nature and Scope of Econometrics: Confirming Pages
18 pages
Section II: Question 1 (20 Marks)
No ratings yet
Section II: Question 1 (20 Marks)
5 pages
Financial Econometrics Introduction
No ratings yet
Financial Econometrics Introduction
13 pages
Financial Economics Study Material
No ratings yet
Financial Economics Study Material
13 pages
Econometric S
No ratings yet
Econometric S
26 pages
Introductory Econometrics Exam Memo
100% (4)
Introductory Econometrics Exam Memo
9 pages
ECN 702 Final Examination Question Paper
No ratings yet
ECN 702 Final Examination Question Paper
6 pages
Lecture 1
0% (1)
Lecture 1
54 pages
Chapter 09 - Dummy Variables
No ratings yet
Chapter 09 - Dummy Variables
21 pages
Qualitative Response Regression Questions
No ratings yet
Qualitative Response Regression Questions
10 pages
Open Economy Macroeconomics
No ratings yet
Open Economy Macroeconomics
42 pages
Handout Econometrics - Module
No ratings yet
Handout Econometrics - Module
86 pages
Simultaneous Equation Models
100% (1)
Simultaneous Equation Models
17 pages
Alternative To Profit Maximisation
No ratings yet
Alternative To Profit Maximisation
11 pages
By: Domodar N. Gujarati: Prof. M. El-Sakka
No ratings yet
By: Domodar N. Gujarati: Prof. M. El-Sakka
19 pages
Econometricians Assignment
No ratings yet
Econometricians Assignment
4 pages
Chapter 7 PDF
No ratings yet
Chapter 7 PDF
17 pages
Regression Analysis 1 2020
No ratings yet
Regression Analysis 1 2020
40 pages
Chapter 1
No ratings yet
Chapter 1
61 pages
Simultaneous Equations Models
No ratings yet
Simultaneous Equations Models
30 pages
Chapter 1: Economics For Managers (Question Bank & Answers)
100% (2)
Chapter 1: Economics For Managers (Question Bank & Answers)
13 pages
Introduction To Econometrics
No ratings yet
Introduction To Econometrics
21 pages
Business Cycles Quiz
No ratings yet
Business Cycles Quiz
8 pages
1macroeconomics Under Graduate Course Chapters 123
No ratings yet
1macroeconomics Under Graduate Course Chapters 123
159 pages
Econ 3010 Final Exam Multiple Choice (100 Points)
No ratings yet
Econ 3010 Final Exam Multiple Choice (100 Points)
9 pages
Docx
No ratings yet
Docx
50 pages
Revision - Classical and Keynesian
No ratings yet
Revision - Classical and Keynesian
29 pages
Econometrics Chapter # 0: Introduction
No ratings yet
Econometrics Chapter # 0: Introduction
19 pages
Chapter 1 The Nature of Econometrics and Economic Data
No ratings yet
Chapter 1 The Nature of Econometrics and Economic Data
8 pages
Introduction To Econometrics
No ratings yet
Introduction To Econometrics
90 pages
The Linear Regression Model: An Overview: Damodar Gujarati
No ratings yet
The Linear Regression Model: An Overview: Damodar Gujarati
17 pages
Dummy Dependent Variable
100% (1)
Dummy Dependent Variable
58 pages
Chapter 2 (Econometrics)
No ratings yet
Chapter 2 (Econometrics)
36 pages
Econometrics
100% (1)
Econometrics
115 pages
QB For Me Unit 2
No ratings yet
QB For Me Unit 2
4 pages
Productivity Movement in India
No ratings yet
Productivity Movement in India
4 pages
Economics 1A and B Workbook
100% (1)
Economics 1A and B Workbook
90 pages
Econometrics Final Paper Question
No ratings yet
Econometrics Final Paper Question
3 pages
Economics Exam Paper PDF
No ratings yet
Economics Exam Paper PDF
43 pages
International Economics Final Exam Review
No ratings yet
International Economics Final Exam Review
28 pages
Capital Market Line
No ratings yet
Capital Market Line
10 pages
Rules of Derivative Used in Economics (ECO401)
100% (2)
Rules of Derivative Used in Economics (ECO401)
8 pages
Agricultural Economics Chapter 3
No ratings yet
Agricultural Economics Chapter 3
46 pages
Monetary Economics For Exit Exam 2023 Chap1-4
No ratings yet
Monetary Economics For Exit Exam 2023 Chap1-4
101 pages
Chapter 3
No ratings yet
Chapter 3
19 pages
CH-6 Pure Monopoly
100% (2)
CH-6 Pure Monopoly
37 pages
PDF Test Bank For Introductory Econometrics A Modern Approach 5th Edition by Wooldridge
No ratings yet
PDF Test Bank For Introductory Econometrics A Modern Approach 5th Edition by Wooldridge
7 pages
Question Bank of Managerial Economics - 4 Mark
100% (4)
Question Bank of Managerial Economics - 4 Mark
23 pages
Wealth Management
From Everand
Wealth Management
Erik Lie
No ratings yet