0% found this document useful (0 votes)
4 views

Lecture 08 Dummy Variables

This lecture discusses the use of dummy variables in regression models to account for qualitative factors that cannot be numerically measured. It covers the definition, types, and applications of dummy variables, including intercept, slope, and interactive dummies, as well as the concept of the dummy variable trap and the importance of selecting a reference category. Additionally, it explains how to perform statistical tests to evaluate the significance of these variables in the model.

Uploaded by

Angelina
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture 08 Dummy Variables

This lecture discusses the use of dummy variables in regression models to account for qualitative factors that cannot be numerically measured. It covers the definition, types, and applications of dummy variables, including intercept, slope, and interactive dummies, as well as the concept of the dummy variable trap and the importance of selecting a reference category. Additionally, it explains how to perform statistical tests to evaluate the significance of these variables in the model.

Uploaded by

Angelina
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Lecture 8

Dummy variables
Frequently, some factors we would like to include into a regression model are of qualitative
nature and, therefore, are not numerically measurable. One possible approach would be to divide
observations into several groups in accordance with whether they possess a certain qualitative
characteristic, and then analyze the difference between regression coefficients across respective
groups. Alternatively, one could estimate a single regression for all observations, measuring the
influence of the qualitative factor by introducing a so called dummy variable. This variable takes the
value of either zero or unity, depending on whether the given observation possesses the qualitative
characteristic we want to account for. It allows to test the significance of the effect of the
corresponding qualitative factor. Moreover, under certain assumptions regression estimates become
more efficient. This lecture will analyze different ways to include dummy variables into the model in
accordance with the initial hypothesis of how the difference in qualitative characteristics can affect
the relationship. First, we will illustrate where and how dummy variables are used looking at different
examples. Second, the lecture will describe different types of dummy variables, including intercept,
slope, and interactive dummy variables. Then it will discuss what is the dummy variable trap and how
estimation results depend on the choice of a reference category. And, finally, we will examine the
Chow test that enables to compare relationships between different subsamples.

How to define and use:


Dummy variables are used to define a set of categories which are qualitative in nature. Note
that dummy variables refer to only explanatory variables. In other words, the dependent variable
is never dummy (in fact, if the dependent variable can take either 0 or 1 than it is called a binary
choice variable).
Let’s consider a qualitative variable that has 𝑁 categories. The standard procedure is to
choose one category as the reference category and to define dummy variables for each of the
others. A reference category is used as a basis of comparison and it is a good practice to select the
most normal and basic category as the reference category. The number of dummies is equal to
𝑁 − 1. However, if we do not include the intercept (constant term) into the model then N dummies
are used. We will analyze this result in more detail in the section of the dummy variable trap.
Moreover, for models with several qualitative variables each of which defines a corresponding
group the result is the following: the number of dummy variables in a group=the number of
categories according to the appropriate characteristic−1.
Examples:
• Accounting for gender: If we believe that gender can be a significant factor in explaining the
dependent variable (for example, Earnings) then the following procedure and set of dummies
(depending on the chosen reference category) can be used:
1) Define a qualitative variable: gender;
2) Categories: male and female. Hence, 𝑁 = 2;
3) Define a reference category:

Reference category: Female Reference category: Male


1 for female 1 for male
Dummy:𝐹𝑒𝑚𝑎𝑙𝑒 = { Dummy:𝑀𝑎𝑙𝑒 = {
0 for male 0 for female
• Accounting for economic conditions: crisis
1 for period of crisis
Dummy: 𝐶𝑟𝑖𝑠𝑖𝑠 = {
0 for other periods
• Accounting for seasonal effects: some variables differ significantly across the seasons of the
year. For example, consider expenditures on fuel black oil for boiler houses. If we want to
use the quarterly data to study the changes in expenditure across years, it is necessary to take
into account the seasonal factor. This can be done with the help of dummy variables:
There are 4 seasons (quarters) = 4 categories => 3 dummy variables are used.
Let’s define the first quarter as a reference category. Hence,
1 for second quarter
Dummy variables: 𝐷2 = {
0 otherwise
1 for third quarter
𝐷3 = {
0 otherwise
1 for fourth quarter
𝐷4 = {
0 otherwise
Therefore,
𝐷2 = 𝐷3 = 𝐷4 = 0, if the observation refers to the first quarter;
𝐷2 = 1; 𝐷3 = 𝐷4 = 0, if the observation refers to the second quarter;
𝐷3 = 1; 𝐷2 = 𝐷4 = 0, if the observation refers to the third quarter;
𝐷4 = 1; 𝐷2 = 𝐷3 = 0, if the observation refers to the fourth quarter.

Intercept and slope dummy variables:


Dummy variables can be used to test for a change in intercept or a change in slope.

Intercept dummy Slope dummy

It assumes that the qualitative variables introduced The assumption that each category of the
into the regression are responsible only for shifts in qualitative variables does not influence the slope
the constant term. The slope of the regression line of the regression line is not always plausible.
is identical for each category of the qualitative Sometimes we might want to allow the slope
variables. In other words, marginal effects are not coefficients on other variables to vary between
affected with the inclusion of a qualitative groups, thus accounting for different marginal
characteristic. effects. It can be done by creating a slope
Consider a model with one regressor 𝑋2 and one dummy, equal to a dummy variable times
dummy variable 𝐷: another variable:
𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝛽3 ⋅ 𝐷 + 𝑢 𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝛽3 (𝐷 ⋅ 𝑋2 ) + 𝑢

Interpretation: Reference category is 𝐷 = 0 Interpretation: Reference category is 𝐷 = 0


If 𝐷 = 0, then 𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝑢; If 𝐷 = 0, then 𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝑢;
If 𝐷 = 1, then 𝑌 = (𝛽1 + 𝛽3 ) + 𝛽2 𝑋2 + 𝑢 If 𝐷 = 1, then 𝑌 = 𝛽1 + (𝛽2 + 𝛽3 )𝑋2 + 𝑢
When we believe that differences in qualitative characteristics have influence on both the intercept
and marginal effects of other explanatory variables, then we introduce a complete set of all dummy
variables:
𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝛽3 ⋅ 𝐷 + 𝛽4 (𝐷 ⋅ 𝑋2 ) + 𝑢

If 𝐷 = 0, then 𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝑢;
If 𝐷 = 1, then 𝑌 = (𝛽1 + 𝛽3 ) + (𝛽2 + 𝛽4 )𝑋2 + 𝑢.

Dummy variables of interaction


A dummy variable of interaction is introduced when the presence of two qualitative
characteristics simultaneously brings about an additional effect on the dependent variable. Such a
dummy variable is defined as the product of two initial dummy variables. For example, supposing
that abilities (ASVABC), the number of schooling years (HGC), and 2 dummies: gender (MALE)
and ethnicity (ETHWHITE) are factors in determining Earnings (EARN), let’s introduce the
interaction dummy: 𝑀𝐴𝐿𝐸𝑊𝐻𝐼𝑇𝐸 = 𝑀𝐴𝐿𝐸 ⋅ 𝐸𝑇𝐻𝑊𝐻𝐼𝑇𝐸, where

1 for male 1 for White ethnicity


𝑀𝐴𝐿𝐸 = { ETHWHITE = {
0 for female 0 for otherwise
LOG(EARN) = 𝛽0 + 𝛽1 ⋅ ASVABC + 𝛽2 ⋅ HGC + 𝛽3 ⋅ MALE + 𝛽4 ⋅ ETHWHITE + 𝛽5 ⋅ MALEWHITE + 𝑢

WHITE NON-WHITE

LOG(EARN) = 𝑏0 + 𝑏1 ⋅ ASVABC + 𝑏2 ⋅ HGC + LOG(EARN) = 𝑏0 + 𝑏1 ⋅ ASVABC +


MALE
+𝑏3 ⋅ MALE + 𝑏4 ⋅ ETHWHITE + 𝑏5 ⋅ MALEWHITE + 𝑏2 ⋅ HGC + 𝑏3 ⋅ MALE

LOG(EARN) = 𝑏0 + 𝑏1 ⋅ ASVABC + LOG(EARN) = 𝑏0 + 𝑏1 ⋅ ASVABC +


FEMALE
+𝑏2 ⋅ HGC + 𝑏4 ⋅ ETHWHITE + 𝑏2 ⋅ HGC

So, the inclusion of this interaction variable allows to answer the question: are there ethnic
variations in the effect of the gender of a respondent on earnings? Formally,
𝐻0 : 𝛽5 = 0
If 𝛽5 is significant, then the estimate of the coefficient at the variable MALEWHITE shows, that
the observed interaction between gender and ethnicity (in the sense that there is a significant ethnic
variation in the effect of gender) makes possible for a white male respondent to earn 𝑏5 % more.
The dummy variable trap:
Consider the following model: 𝑌 = 𝛽1 + 𝛽2 𝑋2 +. . . +𝛽𝑘 𝑋𝑘 + 𝛿2 𝐷2 +. . . +𝛿𝑠 𝐷𝑠 + 𝑢 (1)
So, the qualitative variable has 𝑠 categories. As was discussed, the general procedure is to
include 𝑠 − 1 dummies into the model. The dummy variable trap occurs when the reference
category is also included together with the constant term and as a result it becomes impossible to
fit the model:
𝑌 = 𝛽1 + 𝛽2 𝑋2 +. . . +𝛽𝑘 𝑋𝑘 + 𝛿1 𝐷1 + 𝛿2 𝐷2 +. . . +𝛿𝑠 𝐷𝑠 + 𝑢 (2) – the dummy variable trap.
Reasons:
1) Intuitive: The intercept dummy variable shows the increase in the intercept relative to that
of the reference category but, as we have already included this basic category, there is no
room for comparison now. Hence, there is no logical interpretation.
2) Mathematically: exact multicollinearity issue. Let’s denote the regressor at the constant
term 𝛽1 as 𝑋1 . It is identical equal to one: 𝑋1 ≡ 1. Now it becomes ∑𝑠𝑖=1 𝐷𝑖 = 1 = 𝑋1 ,
because one of the dummy variables will be equal to 1 and all the others will be equal to
0 in any observation 𝑖. Therefore, there is an exact multicollinearity => no estimates can
be obtained.
Solution:
1) Estimate as (1);
2) Drop the constant term: 𝑌 = 𝛽2 𝑋2 +. . . +𝛽𝑘 𝑋𝑘 + 𝛿1 𝐷1 + 𝛿2 𝐷2 +. . . +𝛿𝑠 𝐷𝑠 + 𝑢. Note that
the interpretation of the coefficients will change.

Change of reference category:


Consider a model which describes how the cost of running a school depends on the number
of students and the type of the school (qualitative variable). There are 4 categories: general,
technical, skilled workers’, and vocational schools. Accordingly, 4 dummy variables are defined:
GEN, TECH, WORKER, VOC. The table shows the estimation results depending on the chosen
reference category (it is general for the first case and skilled workers’ for the second).

Main results:
The choice of reference category does not affect the substance of the regression results. The
table below shows the effects on the model with intercept dummies.

DO NOT CHANGE DO CHANGE


1) The goodness of fit measured by: 1) Interpretation of t-tests: the meaning of
• 𝑅2 a null hypothesis for a dummy variable
• SSR coefficient being equal to 0 is different;
• standard error of regression (Root MSE); 2) Coefficients of dummy variables are
2) F-statistics for the whole equation; different;
3) Coefficients and t-statistics of other variables 3) Standard errors of dummy variables are
(non-qualitative ones) => also their standard different, hence, no relation in t-
errors; statistics.
4) The coefficients of dummies related to the
correspondent reference category for 2 models Note that this result does not apply for
(in our example it is WORKER for the reference the situation described in 4): for
category general (first regression), and it is coefficients of WORKER and GEN for
GEN for the reference category skilled workers’ models with reference categories of
(second regression)) are the same in the absolute general and skilled workers’ schools,
magnitude but have different signs. Moreover, respectively
standard errors for such coefficients are the
same, and, hence, t-statistics only differ in signs.

Statistical tests:
The test for either a change in intercept or a change in slope can be performed by using standard
t-tests on the dummy variable parameters.
Consider a model with both slope and intercept dummies:
𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝛽3 ⋅ 𝐷 + 𝛽4 (𝐷 ⋅ 𝑋2 ) + 𝑢
𝐻0 : 𝛽3 = 0
Standard t-test: 𝑑. 𝑓. = (#𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠) − (#𝑜𝑓 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠) = 𝑛 − 4 – for
our case.
The same applies for the slope dummy:
𝐻0 : 𝛽4 = 0.
The test for the joint explanatory power of all dummy variables is carried out with the help of the
following F-statistics, calculated on the basis of the residual sums of squares in the models with
dummy variables and without them:
𝐻0 : The coefficients of all dummy variables are simultaneously equal to zero
𝐻0 : The coefficient of at least at one dummy variable is non-zero.
( SSRno dumies − SSRdummies ) / the number of dummies
F=
SSRdummies / (the number of observations − the total number of parameters estimated )
𝐻0
~ 𝐹(𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑢𝑚𝑚𝑖𝑒𝑠, (𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 − 𝑡ℎ𝑒 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑))

Chow test:
Sometimes a sample of observations consists of two or more subsamples and it is difficult to
decide, whether it is necessary to estimate one regression for the entire sample or separate
regressions for all sub-samples. The Chow test is used to solve this problem. It tests the following
hypothesis:
𝐻𝑜 : 𝑡ℎ𝑒 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠 𝑎𝑟𝑒 𝑡ℎ𝑒 𝑠𝑎𝑚𝑒 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑠𝑢𝑏𝑠𝑎𝑚𝑝𝑙𝑒𝑠
𝐻1 : 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑑𝑖𝑓𝑓𝑒𝑟𝑠
Moreover, it will be shown that the Chow test is equivalent to an F test testing the explanatory
power of the dummy variables as a group (only if we include the full set of dummies).
I. Chow test for 2 subsamples each of which has 𝑘 parameters to estimate (𝑘 − 1
explanatory variables, and 1 intercept):
Subsample 1: 𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝛽3 ⋅ 𝑋3 +. . . +𝛽𝑘 ⋅ 𝑋𝑘 + 𝑢1 sample size 𝑛1 𝑆𝑆𝑅1
′ ′ ′ ′
Subsample 2: 𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝛽3 ⋅ 𝑋3 +. . . +𝛽𝑘 ⋅ 𝑋𝑘 + 𝑢2 sample size 𝑛2 𝑆𝑆𝑅2
 1 = 1'

 2 =  2
'


𝐻𝑜 : .......... ..
 =  '
 k k

Procedures:
1) Estimate the regression for the whole sample: 𝑛 = 𝑛1 + 𝑛2 and 𝑆𝑆𝑅0
(𝑆𝑆𝑅0 −(𝑆𝑆𝑅1 +𝑆𝑆𝑅2 ))⁄𝑘
2) F-statistics: 𝐹(𝑘, 𝑛 − 2𝑘) = (𝑆𝑆𝑅1 +𝑆𝑆𝑅2 )⁄(𝑛−2𝑘)
𝑐𝑟𝑖𝑡
3) Perform F-test: compare to 𝐹𝛼% 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 𝑙𝑒𝑣𝑒𝑙 (𝑘, 𝑛 − 2𝑘):
(𝑆𝑆𝑅0 −(𝑆𝑆𝑅1 +𝑆𝑆𝑅2 ))⁄𝑘 𝑐𝑟𝑖𝑡
If 𝐹(𝑘, 𝑛 − 2𝑘) = > 𝐹𝛼% (𝑘, 𝑛 − 2𝑘), then we can reject the null
(𝑆𝑆𝑅1 +𝑆𝑆𝑅2 )⁄(𝑛−2𝑘)
hypothesis that the relationships in both samples are the same.
Showing the equivalence:
Let’s show that this test is equivalent to the following F-test for linear restrictions:
1 if observation belongs to sample1
Define the dummy variable: 𝐷 = {
0 if observation belongs to sample2
Let’s include into the regression the full set of dummies:
𝑌 = 𝛽1 + 𝛽1′ 𝐷 + 𝛽2 𝑋2 + 𝛽2′ (𝐷 ⋅ 𝑋2 ) + 𝛽3 𝑋3 + 𝛽3′ (𝐷 ⋅ 𝑋3 )+. . . +𝛽𝑘 𝑋𝑘 + 𝛽𝑘′ (𝐷 ⋅ 𝑋𝑘 ) + 𝑢
Now the number of estimated parameters is equal to 2𝑘
So it becomes equivalent to test:
𝐻𝑜 : 𝛽1′ = 𝛽2′ =. . . = 𝛽𝑘′ = 0. There are 𝑘 restrictions
Unrestricted model with 𝑆𝑆𝑅𝑈𝑅 :
𝑌 = 𝛽1 + 𝛽1′ 𝐷 + 𝛽2 𝑋2 + 𝛽2′ (𝐷 ⋅ 𝑋2 ) + 𝛽3 𝑋3 + 𝛽3′ (𝐷 ⋅ 𝑋3 )+. . . +𝛽𝑘 𝑋𝑘 + 𝛽𝑘′ (𝐷 ⋅ 𝑋𝑘 ) + 𝑢
OLS will choose the intercept 𝑏1 and the bs coefficients of 𝑋2 … 𝑋𝑘 such that to optimise the
fit for the 𝐷 = 0 observations. The coefficients will be exactly the same as if the regression
has been run with only the subsample of 𝐷 = 0 observations. The same logic applies for 𝐷 =
1. So, 𝑆𝑆𝑅𝑈𝑅 = 𝑆𝑆𝑅1 + 𝑆𝑆𝑅2 .
Restricted model with 𝑆𝑆𝑅𝑅 :
𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝛽3 ⋅ 𝑋3 +. . . +𝛽𝑘 ⋅ 𝑋𝑘 + 𝑢
(𝑆𝑆𝑅𝑅 −𝑆𝑆𝑅𝑈𝑅 )⁄𝑘 (𝑆𝑆𝑅0 −(𝑆𝑆𝑅1 +𝑆𝑆𝑅2 ))⁄𝑘
Therefore, we get the same 𝐹(𝑘, 𝑛 − 2𝑘) = (𝑆𝑆𝑅𝑈𝑅 )⁄(𝑛−2𝑘)
= (𝑆𝑆𝑅1 +𝑆𝑆𝑅2 )⁄(𝑛−2𝑘)
II. Generalizing for m subsamples:
Using the equivalence result, in this case there are m categories => we include 𝑚 − 1 dummies
=> the number of restrictions is 𝑘(𝑚 − 1) and the number of estimated parameters is 𝑚𝑘.
Therefore F-statistics:
(𝑆𝑆𝑅0 − (𝑆𝑆𝑅1 +. . . +𝑆𝑆𝑅𝑚 ))/(𝑘(𝑚 − 1))
𝐹(𝑘(𝑚 − 1), 𝑛 − 𝑚𝑘) =
(𝑆𝑆𝑅1 +. . . +𝑆𝑆𝑅𝑚 )⁄(𝑛 − 𝑚𝑘)
The Chow test is easier to perform than the test of the joint explanatory power of the group of dummy
variables, but it is less informative in the sense it does not distinguish between the contributions of
each dummy variable to the difference between regressions and does not test them for significance.
However, the test statistics and, accordingly, the conclusions of these two tests are identical.

You might also like