Lecture 2
Lecture 2
Regression
Analysis with Cross-
Sectional Data
Prepared by Quanquan Liu
Fall 2024
Lecture 2. The Simple
Regression Model
Definition
This means that the average value of the dependent variable can be expressed as a
linear function of the explanatory variable.
Definition
ui is the error term for observation i; it contains all factors affecting yi other than xi.
Deriving the Ordinary Least Squares
Estimates
Ordinary Least Squares (OLS): A method for estimating the parameters of a multiple linear
regression model. The ordinary least squares estimates are obtained by minimizing the
sum of squared residuals.
For any and , define a fitted value for y when as
.
The residual for observation i is the difference between the actual yi and its fitted value:
.
Deriving the Ordinary Least Squares
Estimates
Figure. Fitted values and residuals.
Deriving the Ordinary Least Squares
Estimates
We choose and to minimize the sum of squared residuals,
Solution:
Deriving the Ordinary Least Squares
Estimates
Once we have determined the OLS intercept and slope estimates, we form the OLS
regression line:
It is also called the sample regression function (SRF): it is the estimated version of the
population regression function
In most cases,
is of primary interest. It tells us the amount by which changes when x increases by one
unit.
Deriving the Ordinary Least Squares
Estimates
Example. CEO Salary and Return on Equity
.
Using CEOSAL1 – a dataset that contains information on 209 CEOs for the year 1990, the
OLS regression line relating salary to roe is
.
use CEOSAL1, clear
reg salary roe
Properties of OLS
Property 2. The sample covariance between the regressors and the OLS residuals is zero.
Goodness-of-Fit
The R-squared of the regression, or the coefficient of determination, is defined as
is the ratio of the explained variation compared to the total variation; it is interpreted as
the fraction of the sample variation in y that is explained by x.
In particular, if , then
.
We multiply by 100 to get the percentage change in wage given one additional year of education.
Units of Measurement and Functional
Form
Using the data in WAGE1 we obtain:
The question is what the estimators will estimate on average and how large will their
variability be in repeated samples:
Unbiasedness of OLS
The values of the explanatory variables are not all the same (otherwise it would be impossible to study
how different values of explanatory variable lead to different values of the dependent variable.)
Assumption SLR.4. Zero Conditional Mean
The error u has an expected value of zero given any value of the explanatory variable. In other
words,
.
The value of the explanatory variable must contain no information about the mean of the unobserved
factors.
For a random sample, this assumption implies that , for all
Unbiasedness of OLS
Heteroskedasticity: The variance of the error term, given the explanatory variables, is not
constant.
An example for heteroskedasticity: Wage and education
Variances of the OLS Estimators
where now x is a binary variable. If we impose the zero conditional mean assumption SLR.4 then
we obtain
The difference now is that x can take on only two values. By plugging the values zero and one
into the equation above, it is easily seen that
and .
It follows immediately that
is the difference in the average value of y over the subpopulations with and .
Regression on a Binary Explanatory
Variable
Example. Wage and Race
,
where if a person is classified as white and zero otherwise. Then
.
is the difference in average hourly wages between white and nonwhite workers.
The mechanics of OLS do not change just because x is binary.
The statistical properties of OLS are also unchanged when x is binary.
Summary