Cross Sectional
Cross Sectional
Type of data
A cross-sectional data set consists of a sample of
individuals, households, firms, cities, states,
countries, or a variety of other units, taken at a given
point in time.
• Cross-sectional data are widely used in economics
and other social sciences.
• In economics, the analysis of cross-sectional data is
closely aligned with the applied microeconomics
fields, such as labor economics, state and local
public finance, industrial organization, urban
economics, demography, and health economics.
Type of data….
• A time series data set consists of observations
on a variable or several variables overtime.
• Examples of time series data include stock prices,
money supply, consumer price index, gross
domestic product, Annual homicide rates, and
automobile sales figures.
• Pooled cross section; have both cross-sectional
and time series features.
• For example, suppose that two cross-sectional
household surveys are taken in the United States,
one in 1985 and one in 1990.
Type of data………
• A panel data (or longitudinal data) set consists
of a time series for each cross-sectional
member in the data set.
• The key feature of panel data that distinguishes
them from a pooled cross section is that the
same cross-sectional units (individuals, firms, or
counties in the preceding examples) are followed
over a given time period.
linear regression model
linear regression estimates how much Y changes when
X changes one unit.
1. the simple linear regression model.
• It is also called the two-variable linear regression
model or bivariate linear regression model because it
relates the two variables x and y.
• simple regression = regression with 2 variables
• the variables y and x have several different names used
interchangeably, as follows: y is called the dependent
variable, the explained variable, the response variable,
the predicted variable, or the regressand.
linear regression model….
• x is called the independent variable, the
explanatory variable, the control variable, the
predictor variable, or the regressor.
beo Fitted values
10
beo
0 10 20 30
bpop
linear regression model….
• The relationship between variables Y and X is
described using the equation of the line of best
fit.
• 𝑦 = 𝛼 + 𝛽𝑥
• α indicating the value of Y when X is equal to
zero (also known as the intercept) and
• ß indicating the slope of the line (also known as
the regression coefficient).
• The regression coefficient ß describes the change
in Y that is associated with a unit change in X.
beo Fitted values
10
income
beo
0 10 20 30
bpop
education
Yi 0 1 X i i
How did we get that line?
Yi
εi Yi-Y^i
“residual”
^
Y i
MWTP
of the
respond
ent
education
beo Fitted values
10
Black % in
state
beo
5
legislatures
(Y)
0 1.31
0
0 10 20 30
bpop
Black % in state population (X)
1 0.359
Yi 0 1 X i i
OLS in stata
• In Stata use the command regress, type:
• regress [dependent variable] [independent
variable(s)]
regress y x for simple regression
• In a multivariate setting we type:
regress y x1 x2 x3 …
For example using the data rayu.dta ,if we
need to asses the determinants of income.
OLS in stata……
• Outcome (Y) variable – income of the respondents
• Predictor (X) variables the gender, age, education,
family size and marital states of the respondents.
• If Income=f(edu)
Income= a+ b1 edu
reg income edu
• If income=f(gen,age,edu,fs,ms)
Income=a+b1gender+b2age+b3education+b4
familysize +b5marital states
reg Income gender age education familysize
marital states
Simple linear regrassion using stata
• Simple linear regrassion
reg INC EDU
multi linear regrassion using stata
• Multi linear regrassion
regress INC GEN age MRS FS EDU EMS
General interpretation
Basic assumptions in OLS
1. linearity in parameter
2. Random sampling
3. No perfect collinearity among independent
variables
4. zero conditional Mean
E(u/x1,x2….xn)=0
If the above conditioned fulfiled the outcome
becomes un biased i.e E(b)=β
Basic assumptions in OLS….
5. homoskedasticity
• The error u has the same variance given any values of
the explanatory variables. In other words,
6. Normality
• The population error u is independent of the
explanatory variables x and is normally distributed with
zero mean and variance ; u ~ Normal(0, δ2 )
Linearity
• Linear in the variables vs. linear in the
parameters
– Y = a + bX + e (linear in both)
– Y = a + bX + cX2 + e (linear in parms.)
– Y = a + b2 X + e (linear in variables, not parms.)
• Regression must be linear in parameters
multicollinearity
• An important assumption for the multiple regression
model is that independent variables are not perfectly
multicolinear.
• One regressor should not be a linear function of
another.
The presence of multicollinearity will not lead to biased
coefficients. .
• If a variable which you think should be statistically
significant is not, consult the correlation coefficients.
• The Stata command to check for multicollinearity is vif
(variance inflation factor).
• If A vif > 10 or a 1/vif < 0.10 indicates trouble.
Multicollinearity…….