M2L2 CLRM & Simple Linear Regression Analysis
M2L2 CLRM & Simple Linear Regression Analysis
Module Overview
More often than not, what we do and how we do it is really due to some factors we take
into consideration. Right? This is exactly how many variables behave! In this module,
we try to measure how well some (dependent) variables are influenced by other
(independent) variables. This is actually the foundation for many other statistical
techniques, so make sure that you understand the concepts herein.
67
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics
Module 2: Dependence Techniques
Time Frame:
Learning Outcomes
Introduction
Empirical analysis has a lot to do with explaining relationships between variables and
it tries to do so using regression model. This technique attempts to quantify how much
one variable changes as another variable changes and also provides information that
would allow prediction and hypothesis testing. The regression model, however, has
assumptions and it would be necessary that you know and understand these before you
move into its many uses.
ABSTRACTION
Danao (2002) simplifies the concepts in CLRM as follows: With one explanatory
variable, the CLRM is formally given by the equation
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝜀𝑖
The error term 𝜀𝑖 is assumed to be a random variable and follows some probability
distribution. This term can be due to many reasons including those (1) due to non-
inclusion of essential variables in the model, (2) inclusion of variables that should not
be in the model, (3) errors in measurement, (4) randomness of events, and (5) model
misspecification. Since 𝜀𝑖 is a random variable, 𝑌𝑖 will also be a random variable based
on the equation above. And, 𝑋𝑖 will be associated with a probability of 𝑌.
68
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics
At it may now seem clear, econometric analysis employing regression analysis attempts
to establish a causal relationship between 2 or more economic variables (Hill, Griffiths,
& Lim, 2018). Having an estimate of the model allows one to make predictions and do
hypothesis testing about its parameters. We will expound this further in a later section
here.
Assumptions of the CLRM and brief discussions below were culled from Hill, Griffiths,
& Lim (2018) and Danao (2002). These do not provide an exhaustive exposition of
these assumptions. Detailed discussions are in the ebooks provided with this course
pack. The only intention here is to provide direction as to what you may have to read
on further.
2. The expected value of the error term is zero given 𝑋 = (𝑥1 , 𝑥2 , … , 𝑥𝑛 ) such that
𝐸 (𝑒𝑖 |𝑋) = 0. If this is true then
𝑣𝑎𝑟(𝑒𝑖 |𝑋) = 𝜎 2
5. The explanatory variable 𝑥𝑖 must take at least 2 values. If this were not true then
𝑥1 = 𝑥2 = ⋯ = 𝑥𝑛 and there will be no relationship between 𝑋 and 𝑌.
7. There must be positive degrees of freedom. This means that the number of
observations must be greater than the number of parameters being estimated. This
also means that there has to be more than 2 data points. If there is only one data
point (𝑥1 , 𝑦1 ) then there will be an infinite number of lines that go through this
point. If, on the other hand, there are only 2 data points, (𝑥1 , 𝑦1 ) and (𝑥2 , 𝑦2 ), then
there is a unique line that connects the two but the problem will not be statistical.
69
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics
Ordinary Least Squares (OLS) Estimation
The OLS is commonly used in estimating the coefficients (𝛽 ) or the parameters of the
regression equation. This least squares rule asserts that the estimated regression line
that best fit the data points is that which minimizes the sum of squared distances from
each data point to the line. That is 𝑦̂𝑖 = 𝑏0 + 𝑏1 𝑥𝑖 and 𝑒̂𝑖 = 𝑦𝑖 − 𝑦̂𝑖 = 𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 .
Derived OLS estimators are computed using the following equations (technically,
referred to as normal equations):
∑ 𝑦𝑖 ∑ 𝑥𝑖
𝑏0 = − 𝑏1
𝑛 𝑛
The OLS is referred to as “ordinary least squares” to distinguish it from the other
methods as generalized least squares, weighted least squares and the 2-stage least
squares.
The coefficient of determination (𝑅2 ) is an overall measure of how well the estimated
regression line fits the data. Technically, econometricians define this as “how much
variation in the dependent variable is explained by the regression model”.
𝑆𝑆𝑅 𝑆𝑆𝐸 ̂2
(𝑛 − 2 )𝜎
𝑅2 = =1− =1−
𝑆𝑆𝑇 𝑆𝑆𝑇 𝑆𝑆𝑇
where
70
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics
𝑆𝑆𝐸 = ∑(𝑦𝑖 − 𝑦̂𝑖 )2 = ∑ 𝑒̂2 Error sum of squares; that part of total variation
𝑖
in 𝑦 about its sample mean that is not explained
by the regression
𝑆𝑆𝑅 = ∑(𝑦̂𝑖 − 𝑦̅)2 Explained sum of squares; that part of total
variation in 𝑦 about its sample mean that is
explained by the regression
𝑆𝑆𝑇 = ∑(𝑦𝑖 − 𝑦̅)2 Total sum of squares; measures the total
variation in 𝑦 about its sample mean
Since 𝑅2 measures the proportion (or that percentage) of the total variation in 𝑦 that is
explained by regression model, then the higher the 𝑅2 , better for you. This also means
that the estimated regression have better predictive ability. There are some limitations
to this though as will be explained later. Also, 0 ≤ 𝑅2 ≤ 1. If 𝑅2 = 0, then 𝑆𝑆𝑅 = 0
and that 𝑦 and 𝑥 are uncorrelated. Graphically, there is no linear association and the
fitted line is horizontal and identical to 𝑦̅. If 𝑅2 = 1, then all the sample data fall exactly
on the fitted line so that 𝑆𝑆𝐸 = 0.
The decomposition for 𝑆𝑆𝑇 above is usually presented in an analysis of variance table
of the statistical software output as shown below:
The “simple” in simple linear regression model does not in any way mean that it is easy
to do. This actually refers to a regression model presented above where there is only 1
explanatory variable.
At this point, it might be good to review some of the statistical concepts you have
learned in a previous course. Some of the important ones in an introductory course in
econometrics are presented here.
71
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics
different from 0. This means that the whole equation, 𝑦̂𝑖 = 𝑏0 + 𝑏1 𝑥𝑖 is significant
and may prove to have some utility.
• The 𝑝 − 𝑣𝑎𝑙𝑢𝑒 of the 𝑡 statistics are used to evaluate whether the individual
explanatory variables are significant or not. In a simple regression analysis, we are
only looking at one independent variable. Values of 0.05 or lower is also usually
indicative that the variable is significantly different from 0 such that it is useful in
the regression model.
• Confidence intervals (𝐶𝐼 )are useful because these allow sensitivity analysis.
Statistical softwares usually has a default output of 95% confidence interval.
Simply put, these confidence intervals tells you that values of the coefficient
(assuming different samples are used from the same population) will fall within this
range 95% of the time.
Let’s do a simple regression analysis using the consumption and income data found in
Danao (2002). Results presented in the book used Eviews but we will try to replicate
the results using MS Excel that should be readily available in your computer unit.
Philippine data for the real personal consumption expenditure (𝑃𝐶𝐸𝑅) and for the real
gross national product (𝐺𝑁𝑃𝑅) for the period 1975-1996 will be used in this example.
Initially, you may want to do a scatter diagram for 𝑃𝐶𝐸𝑅 and 𝐺𝑁𝑃𝑅 just to give you
some indication that you are indeed working on a linear regression problem here. To
do this, highlight the data then click on “insert”. Then, choose “Scatter” (the type of
plot) under “Charts”. As you may well observe, these data points appear to line-up
increasingly from lower left to upper right of the graph. The regression line has been
super-imposed there to even further prove this point.
72
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics
Let’s do it first by using the normal equations presented above (see MS Excel file:
M2L1 Simple Regression Example 1).
∑ 𝑦𝑖 ∑ 𝑥𝑖
𝑏0 = − 𝑏1
𝑛 𝑛
Try doing it by your self first. Just work on the formula in MS Excel. 𝑥𝑖 and 𝑦𝑖 refer
to the 𝑖 𝑡ℎ observation of the independent and dependent variables, respectively. The
∑𝑥 ∑𝑦
𝑥̅ = 𝑛 𝑖 and 𝑦̅ = 𝑛 𝑖 are the averages for the independent and dependent variables
respectively. Do your calculations in MS Excel. Then, compare those calculations in
“Sheet1” of the same file.
73
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics
If you are not familiar yet with formulas or equations in MS Excel, just point to a cell
to see how that particular value is being computed clicking on F2. Example below for
(𝑥𝑖 − 𝑥̅ ):
You should see there that the value of -191.126 is equal to that value in cell B10 less
that average value in B33. The $ signs in $B$33 is really just a way to fix that cell
when copying that formula for the rest of the data points.
• Goodness of fit statistic, 𝑅2 , appears to be acceptable. This says that 93.5% of the
variation in 𝑃𝐶𝐸𝑅 is explained by the regression model with 𝐺𝑁𝑃𝑅 as explanatory
variable. Another way of stating this is “93.5% of the variation in 𝑃𝐶𝐸𝑅 is
explained by variation in 𝐺𝑁𝑃𝑅.
74
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics
• When asked what is the estimated equation, simply pick-up those 𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠
estimates. For this problem, it will be 𝑃𝐶𝐸𝑅𝑖 = −96.675 + 0.869𝐺𝑁𝑃𝑅𝑖 . This
means that for every unit increase in 𝐺𝑁𝑃𝑅, there is an expected increase of about
0.869 in the 𝑃𝐶𝐸𝑅. This positive slope of 0.869 confirms the sign put forward in
economic theory. This means that an increase in 𝐺𝑁𝑃𝑅 will result to an increase
in 𝑃𝐶𝐸𝑅. Economists refer to this slope as the marginal propensity to consume.
So, assuming that you can project the 𝐺𝑁𝑃𝑅 for the years 1997-2001 as shown
below, you will also be able to predict values for your 𝑃𝐶𝐸𝑅 over the same period.
But just how did we arrive at these estimates for 𝑃𝐶𝐸𝑅? Simply substitute values
of the 𝐺𝑁𝑃𝑅 in the estimated equation. Hence, for 1997,
You will see some differences in the computed estimated values above and the
values of “Estimated 𝑃𝐶𝐸𝑅” indicated in the table because values in the former
were based on rounded-off values from the software output.
• The 95% 𝐶𝐼 for 𝐺𝑁𝑃𝑅 tells us that assuming repeated sampling (or repeated trials)
the values of the coefficient will fall within 0.763 and 0.976. This is also another
proof that 𝐺𝑁𝑃𝑅 is significant! The fact that the range 0.763 and 0.976 does not
include 0, the coefficient for cannot be zero, and therefore significant.
75
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics
M2L2 Example 2: Simple Linear Regression Analysis. Food Expenditure
and Household Income (Hill, Griffiths, & Judge, 1997). Data shows survey
results from 40 households regarding their week income and food
expenditure. We will try to find the relationship here between the two variables.
1. Create a scatterplot that will allow you to see whether the data seem to be linearly
related.
To draw the line through the data points, double click on any of the data points
and the Chart Editor will appear as shown below. Click on Add Fit Line at Total
and a Properties window is shown. Click on Linear\Close. Close the Chart
Editor window and the graph will now be updated with regression line.
2. Do a regression analysis with exp as the dependent variable and inc the independent
variable using SPSS.
76
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics
3. What is the regression equation? 𝑒𝑥𝑝 = 40.768 + 0.128𝑖𝑛𝑐 .What does this mean?
This means that a unit increase in Weekly Household Income will result to an
increase in Food Expenditure of 0.12829. Hence, if Weekly Household Income
increases by P1,000.00, Food Expenditure will increase by P128.29.
3. Is the regression equation significant? How do you know? Explain. Yes, because of
the 𝑆𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 𝐹 of 0.000.
4. Is Income a good explanatory variable for Expenditure? How do you know?
Explain. Yes, the P-value of t-Stat for Income is also very low at 0.000.
5. How do you assess the goodness of fit? This would seem low considering an 𝑅2 =
0.317. This means that variation in Income is only able to explain about 32% of
the variation in Expenditure.
6. What does the 95% CI tell us? It says that the increase in Expenditure can be as
low as P66.47 or as high as P190.01 if Income increases by P1,000.00.
77
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics
4. Is the independent variable significant? Why do you say so?
5. What is the 95% 𝐶𝐼 for the independent variable?
6. Let’s convert our variable salary into its natural logarithmic form (ln).
Transform\Compute Variable
Type your variable name (example: lnsalary) on Target Variable:
Click on All in Function Group: and select Ln on Functions and Special Variables:
Click on your variable to be transformed (example: salary) and move to Numeric
Expression using the move arrow then OK
You should see the newly created variable in your Data View
Do a scatter plot for roe and lnsalary. What can you observed? Do you see any
advantage in converting a variable into its ln form?
CLOSURE
78
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics
Congratulations! You have just completed Lesson 2 of this module. The next lesson
should be even more interesting as we include more variables in the models being
estimated.
79
Course Pack for DGM 324 Empirical Methods 1: Applications of Basic Statistics