0% found this document useful (0 votes)
1K views

Econometrics II CH 1

This document provides an overview of regression analysis using dummy variables. It discusses how qualitative variables can be included in regression models through the use of dummy variables. Dummy variables allow qualitative variables with two or more categories to be included. The categories are coded as 0 and 1. One category is selected as the base or reference category coded as 0. Comparisons are then made to this base category. The document discusses how to interpret coefficients when dummy variables are included and covers linear probability models and logistic regression as alternatives to linear regression for binary outcome variables.

Uploaded by

nigusu degu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views

Econometrics II CH 1

This document provides an overview of regression analysis using dummy variables. It discusses how qualitative variables can be included in regression models through the use of dummy variables. Dummy variables allow qualitative variables with two or more categories to be included. The categories are coded as 0 and 1. One category is selected as the base or reference category coded as 0. Comparisons are then made to this base category. The document discusses how to interpret coefficients when dummy variables are included and covers linear probability models and logistic regression as alternatives to linear regression for binary outcome variables.

Uploaded by

nigusu degu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 48

Debre Markose

University
College of Business and Economics
Econometrics II
Abebe Mucheye Kassie (M.Sc.)
Aug 2022

1
Chapter One
Regression on Dummy Variables
The nature of dummy variables
•In regression analysis the dependent variable is frequently
influenced not only by variables that can be readily quantified
on some well-defined scale
• (e.g., income, output, prices, costs, height, and temperature),
but also by variables that are essentially qualitative in nature
(e.g., sex, race, color, religion, nationality, wars, earthquakes,
strikes, political upheavals, and changes in government
economic policy).

2
3
Cont.…
• Variables that assume such 0 and 1 values are
called dummy variables.
• Alternative names are indicator variables, binary
variables, categorical variables, and dichotomous
variables.
• Dummy variables can be used in regression models just as
easily as quantitative variables.
• As a matter of fact, a regression model may contain
explanatory variables that are exclusively dummy, or
qualitative, in nature.
4
Regression on one quantitative variable and one
qualitative variable with two classes, or categories

5
Cont.…
• Model (5.03) contains one quantitative variable
(years of teaching experience) and one qualitative
variable (sex) that has two classes (or levels,
classifications, or categories), namely, male and
female.
• What is the meaning of this equation?
Assuming, as usual, that E(Ui ) = ,0 we see that

6
Cont….

7
Dummy variable trap

8
Dummy
• The group, category, or classification that is
assigned the value of 0 is often referred to as:
 The base,
 Benchmark,
 Control,
 Comparison,
 Reference, or
 Omitted category.
It is the base in the sense that comparisons are
made with that category.
9
Regression on one quantitative variable and
two qualitative variables

10
Cont…

11
Cont…

12
Cont.…

13
14
15
16
17
18
19
20
Binary Choice Model

21
Binary Choice
Model

22
Cont.

23
Cont.

24
Linear probability model (LPM)

25
Linear probability model (LPM)

26
Linear probability model (LPM)

27
Logistic regression
► In Econometrics one dealt with multiple regression
with a continuous dependent variable, extending
the methods of simple linear regression.

► In many studies the outcome variable of interest


is the presence or absence of some condition,
whether or not the subject has a particular
characteristic such as a symptom of a certain
disease.
► We cannot use ordinary multiple linear
regression for such data, but instead we can use a
similar approach known as multiple linear logistic
regression or just logistic regression.
28
Logistic regression:
♣ In general, there are two main uses of logistic regression.

♣ The first is the prediction (estimation) of the probability


that an individual will have (develop) the characteristic.
For example, logistic regression is often used in
epidemiological studies where the result of the analysis is
the probability of developing cancer after controlling for
other associated risks.

♣ Logistic regression also provides knowledge of the


relationships and strengths between an outcome variable
(dependent variable having only two categories) and
explanatory (independent) variables that can be categorical
or continuous.
29
Logistic regression
The Model:
► The basic principle of logistic regression is much the same
as for ordinary multiple regression.

► The main difference is that instead of developing a model that uses


a combination of the values of a group of explanatory variables to
predict the value of a dependent variable, we predict a
transformation of the dependent variable.

►The dependent variable in logistic regression is usually


dichotomous, that is, the dependent variable can take the
value 1 with a probability of success , or the value 0 with a
probability of failure 1-.
This type of variable is called a binomial (or binary)
variable. 30
Logistic regression
 Although not discussed in this pack, applications of
logistic regression have also been extended to cases
where the dependent variable is of more than two
cases, known as multinomial logistic regression.

 When multiple classes of the dependent variable


can be ranked, then ordinal logistic regression is
preferred to multinomial logistic regression.

31
♣ The logit transformation, written as logit (p). Here p is the proportion of
individuals with the characteristic.

♣ For example, if p is the probability of an individual having its own house,


then 1-p is the probability that they do not have one.

♣ The ratio p / (1-p) is called the odds and thus

logit (p) = ln  1  P  is the log odds.


P

The logit can take any value from minus infinity to plus infinity.

♣ We can fit regression models to the logit which are very similar to the
ordinary multiple regression models found for data from a normal
distribution.

♣ We assume that relationships are linear on the logistic scale:


 P 
 
1 P 
ln = a + b1X1 + b2X2 + … + bnXn
32
where, X1, … Xn are the predictor variables and p is the proportion to be predicted. The calculation is computer intensive.
e ( a b1 X 1b 2 X 2bnXn)
P
1  e ( a b1 X 1b 2 X 2bnXn )
If Z = a + b1X1 + b2X2 + … + bnXn

The above equation turns out to be:

eZ
P
1 e Z

33
Significance tests
► The process by which coefficients are tested for significance for
inclusion or elimination from the model involves several different
techniques.

I) Z-test
The significance of each variable can be assessed by treating
b
Z= se(b)

► This z value is then squared, yielding a Wald statistic with a chi-


square distribution. However, there are problems with the use of the
Wald statistic.
 The likelihood-ratio test is more reliable for small sample sizes than
the Wald test.
34
Significance tests
II) Likelihood-Ratio Test:

► Logistic regression uses maximum-likelihood estimation


to compute the coefficients for the logistic regression
equation.

N.B. Multiple regression uses the least-squares method to


find the coefficients for the independent variables in the regression
equation
(it computes coefficients that minimize the residuals for all cases).
► Before proceeding to the likelihood ratio test, we need to
know about the deviance which is analogous to the residual
sum of squares from a linear model.

35
Deviance
 The deviance of a model is -2 times the log likelihood (-
2LL) associated with each model.

 As a model’s ability to predict outcomes improves, the


deviance falls.
 Poorly-fitting models have higher deviance.

 If a model perfectly predicts outcomes, the deviance will be


zero.
 This is analogous to the situation in linear regression, where
the residual sum of squares falls to 0 if the model predicts
the values of the dependent variable perfectly.
36
 Based on the deviance, it is possible to construct an
analogous to r² for logistic regression, commonly referred to
as the Pseudo r².

 If G1² is the deviance of a model with variables, and G0² is


the deviance of a null model, the pseudo r² of the model is:
r² = 1 - G12
G 02
 Note that The deviance of a model is -2 times the log
likelihood (i.e., -2LL) associated with each model.

37
The likelihood ratio test (LRT), which
makes use of the deviance, is analogous to the
F-test from linear regression.

In its most basic form, it can test the


hypothesis that all the coefficients in a model
are all equal to 0:
 H0: ß1 = ß2 = . . . = ßk = 0

The test statistic has a chi-square distribution,


with k degrees of freedom.

38
Assumptions
► Logistic regression is popular in part because it
enables the researcher to overcome many of the
restrictive assumptions of OLS regression:

1.Logistic regression does not assume a linear


relationship between the dependents and the
independents.
2. The dependent variable need not be normally
distributed.

3. The dependent variable need not be homoscedastic


for each level of the independents; that is, there is
no homogeneity of variance assumption.
39
However, other assumptions still apply:
1. Meaningful coding. Logistic coefficients will be
difficult to interpret if not coded meaningfully. The
convention for binomial logistic regression is to code the
dependent class of greatest interest as 1 and the other
class as 0.

2. Inclusion of all relevant variables in the


regression model

3. Exclusion of all irrelevant variables


4. Error terms are assumed to be independent
(independent sampling).
40
5. No multicollinearity:
 To the extent that one independent is a linear function of
another independent, the problem of multicollinearity will
occur in logistic regression, as it does in OLS regression.
 As the independents increase in correlation with each
other, the standard errors of the logit (effect) coefficients will
become inflated.

41
8. Large samples: Unlike OLS regression, logistic
regression uses maximum likelihood estimation
(MLE) rather than ordinary least squares (OLS) to
derive parameters.

 MLE relies on large-sample


♦ In small samples one may get high standard errors.

42
Hosmer and Lemeshow Test
♣ The Hosmer -Lemeshow goodness - of - fit statistic is used to assess
whether the necessary assumptions for the application of multiple
logistic regression are fulfilled.

♣ The Hosmer and Lemeshow's goodness-of-fit statistic is computed as


the Pearson chi-square from the contingency table of observed
frequencies and expected frequencies.

♣ A good fit as measured by Hosmer and Lemeshow's test will yield


a large p-value (much larger than 0.05).

♣ The result of the Hosmer-Lemeshow goodness-of-fit is easily


obtained by clicking on the appropriate menu commands of logistic
regression. That is,
Analyze → Regression → Binary logistic → Options → Hosmer-Lemeshow goodness-of-fit

43
Probit model

• Basic difference between


logit and probit model.

44
,,

45
Cont…

46
Cont,,,,

47
21/02/24 48

You might also like