0% found this document useful (0 votes)
42 views

Cross Sectional

best econometric pdf

Uploaded by

mengstuhagos1223
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Cross Sectional

best econometric pdf

Uploaded by

mengstuhagos1223
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Cross sectional

 Type of data
A cross-sectional data set consists of a sample of
individuals, households, firms, cities, states,
countries, or a variety of other units, taken at a given
point in time.
• Cross-sectional data are widely used in economics
and other social sciences.
• In economics, the analysis of cross-sectional data is
closely aligned with the applied microeconomics
fields, such as labor economics, state and local
public finance, industrial organization, urban
economics, demography, and health economics.
 Type of data….
• A time series data set consists of observations
on a variable or several variables overtime.
• Examples of time series data include stock prices,
money supply, consumer price index, gross
domestic product, Annual homicide rates, and
automobile sales figures.
• Pooled cross section; have both cross-sectional
and time series features.
• For example, suppose that two cross-sectional
household surveys are taken in the United States,
one in 1985 and one in 1990.
 Type of data………
• A panel data (or longitudinal data) set consists
of a time series for each cross-sectional
member in the data set.
• The key feature of panel data that distinguishes
them from a pooled cross section is that the
same cross-sectional units (individuals, firms, or
counties in the preceding examples) are followed
over a given time period.
 linear regression model
linear regression estimates how much Y changes when
X changes one unit.
1. the simple linear regression model.
• It is also called the two-variable linear regression
model or bivariate linear regression model because it
relates the two variables x and y.
• simple regression = regression with 2 variables
• the variables y and x have several different names used
interchangeably, as follows: y is called the dependent
variable, the explained variable, the response variable,
the predicted variable, or the regressand.
 linear regression model….
• x is called the independent variable, the
explanatory variable, the control variable, the
predictor variable, or the regressor.
beo Fitted values

10
beo

0 10 20 30
bpop
 linear regression model….
• The relationship between variables Y and X is
described using the equation of the line of best
fit.
• 𝑦 = 𝛼 + 𝛽𝑥
• α indicating the value of Y when X is equal to
zero (also known as the intercept) and
• ß indicating the slope of the line (also known as
the regression coefficient).
• The regression coefficient ß describes the change
in Y that is associated with a unit change in X.
beo Fitted values

10

income
beo

0 10 20 30
bpop
education
Yi   0  1 X i   i
How did we get that line?

Yi

εi Yi-Y^i
“residual”
^
Y i
MWTP
of the
respond
ent

education
beo Fitted values

10

Black % in
state
beo

5
legislatures
(Y)

 0  1.31
0

0 10 20 30
bpop
Black % in state population (X)

1  0.359
Yi   0  1 X i   i
 OLS in stata
• In Stata use the command regress, type:
• regress [dependent variable] [independent
variable(s)]
regress y x for simple regression
• In a multivariate setting we type:
regress y x1 x2 x3 …
For example using the data rayu.dta ,if we
need to asses the determinants of income.
OLS in stata……
• Outcome (Y) variable – income of the respondents
• Predictor (X) variables the gender, age, education,
family size and marital states of the respondents.
• If Income=f(edu)
Income= a+ b1 edu
reg income edu
• If income=f(gen,age,edu,fs,ms)
Income=a+b1gender+b2age+b3education+b4
familysize +b5marital states
reg Income gender age education familysize
marital states

Simple linear regrassion using stata
• Simple linear regrassion
reg INC EDU
multi linear regrassion using stata
• Multi linear regrassion
regress INC GEN age MRS FS EDU EMS
 General interpretation
 Basic assumptions in OLS
1. linearity in parameter
2. Random sampling
3. No perfect collinearity among independent
variables
4. zero conditional Mean
E(u/x1,x2….xn)=0
If the above conditioned fulfiled the outcome
becomes un biased i.e E(b)=β
 Basic assumptions in OLS….
5. homoskedasticity
• The error u has the same variance given any values of
the explanatory variables. In other words,

6. Normality
• The population error u is independent of the
explanatory variables x and is normally distributed with
zero mean and variance ; u ~ Normal(0, δ2 )
 Linearity
• Linear in the variables vs. linear in the
parameters
– Y = a + bX + e (linear in both)
– Y = a + bX + cX2 + e (linear in parms.)
– Y = a + b2 X + e (linear in variables, not parms.)
• Regression must be linear in parameters
 multicollinearity
• An important assumption for the multiple regression
model is that independent variables are not perfectly
multicolinear.
• One regressor should not be a linear function of
another.
 The presence of multicollinearity will not lead to biased
coefficients. .
• If a variable which you think should be statistically
significant is not, consult the correlation coefficients.
• The Stata command to check for multicollinearity is vif
(variance inflation factor).
• If A vif > 10 or a 1/vif < 0.10 indicates trouble.
 Multicollinearity…….

1.78<10 so, There is no multicollinarity problem


 Multicollinearity…
• Another method to see weather muticolinarity
exist or not
• use correlate command in stata
• If the correlation coefficient matrix between all
explanatory variables is below 0.8 it means there
is no perfect collinarity between the explanatory
variables.
• You can also use pwcorr
• pwcorr GEN age MRS FS EDU INC EMC,
obs sig
 Multicollinearity
• In our case because all correlation are below
0.8. so, there is no multicollinearity.

• Method of solving the problem drop the


variable.
 Testing for homoskedasticity
• An important assumption is that the variance in
the residuals has to be homoskedastic or
constant.
• Residuals cannot varied for lower of higher values
of X (i.e. fitted values of Y since Y=Xb). A
definition:The error term [e] is homoskedastic if
the variance of the conditional distribution of [e
is constant for i=1・, and in particular does not
depend on x; otherwise, the error term is
heteroskedastic (Stock and Watson, 2003)
 Testing for homoskedasticity…..
• If heteroscedastcity estimates becomes inefficient and
inconsistent.
 cross sectional by its nature, we are likely to face the
problem of heteroscedastcity
• To detect heteroskedasticiy we use the Breusch-Pagan
test.
• The null hypothesis is that residuals are homoskedastic.
• In the example below we reject the null at 95% and
concluded that residuals are homogeneous.
• The command in stata
• hettest
 Testing for homoskedasticity…..

• Because p value is 0.0019; we reject Ho and accept H1


ie. there is problem heteroskedasticity.
• If the problem occur the way to deal with this problem,
one is using heteroskedasticity-robust standard errors.
• reg INC EDU FS MRS age GEN, robust
 Solving hotroskedasticity…..
 omitted-variable test

• How do we know we have included all


variables we need to explain Y?
• Regression: Testing for omitted variable bias is
important for our model since it is related to
the assumption that the error term and the
independent variables in the model are not
correlated (E(e|X) = 0)
• In Stata we test for omitted-variable bias using
the ovtest command:
 omitted-variable test…..

• The null hypothesis is that the model does not


have omitted-variables bias, the p-value is
lower than the usual threshold of 0.05 (95%
significance), so we fail to accept the null and
conclude that we do need more variables
 specification error
• Another command to test model specification is
linktest. It basically checks whether we need more
variables in our model by running a new regression
with the observed Y against independent variables
linktest
• The thing to look for here is the significance of _hatsq.
• The null hypothesis is that there is no specification
error.
• If the p-value of _hatsq is not significant then we fail to
reject the null and conclude that our model is correctly
specified
 specification error…..

• the p-value of _hatsq is significant at 5% then we


fail to accept the null and conclude that our
model is not correctly specified
• Solution; add additional variable by reviewing
deferent literatures.
 normality
• We need to test the residuals for normality.
• e = Y – Yhat.
• We can save the residuals in STATA, by issuing
a command that creates them, after we have
run the regression command.
• The command to generate the residuals is
predict residual, resid
predict e, resid
 Normality…..
• kdensity e, normal
Kernel density estimate
.0002
.00015
Density
.0001
.00005

-5000 0 5000 10000 15000


Residuals

Kernel density estimate


Normal density
kernel = epanechnikov, bandwidth = 652.5153
 Normality…
• histogram e, normal
• histogram e, kdensity normal
2.0e-04
1.5e-04
Density
1.0e-04
5.0e-05

-5000 0 5000 10000 15000


Residuals

• In adddition the command pnorm e and


qnorm e are also helpful to test normality.
 Normality….
• If residuals do not follow a ‘normal’ pattern
then you should check for omitted variables,
model specification, linearity, functional
forms.
• In large sample, this assumption is not that
important because of Central Limit Theory
• ladder INC
 logistic regression
• Binary logistic regression is a type of regression
analysis where the dependent variable is a
dummy variable:
• Outcome can be coded 1 or 0 (yes or no,
approved or denied, success or failure, did not
vote or did vote.
•To be continued………
thanks for the attention

You might also like