0% found this document useful (0 votes)
8 views

Week 12 Notes

Uploaded by

Rama Bhushan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Week 12 Notes

Uploaded by

Rama Bhushan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Case Study: Application of Logistic


Regression
Introduction
• Multinomial Logit Model
• Tobit Model
• Case 1: Homeownership data
• Case 2: Grouped data
• Case 3: Debit card data
• Case 4: Grade point data
• Case 5: Smoker’s data
• Case 6: Education data
• Summary and concluding remarks
INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Multinomial Dependent Variables


Multinomial Dependent Variables

• Consider a firm planning to list on NYSE, NASDAQ, and AMEX

• Investors choosing between AAA/AA/A securities

• This also a discrete multiple choice problem

• The agent chooses the alternative where her utility is maximized

• With two choices we needed one equation, with three choices we need two equation

• What if we model three choices with combinations of two, each combination separately
Multinomial Dependent Variables

• Consider a choice between AAA/AA/A rated securities by a given set of investors the

same is modelled in the following manner; where L is leverage and R is expected return

• 𝐴𝑖 = 𝛼1 + 𝛼2 ∗ 𝐿𝑖 + 𝛼3 ∗ 𝑅𝑖 + 𝑢𝑖

• 𝐴𝐴𝑖 = 𝛽1 + 𝛽2 ∗ 𝐿𝑖 + 𝛽3 ∗ 𝑅𝑖 + 𝑣𝑖

• 𝐴𝐴𝐴𝑖 = 𝛾1 + 𝛾2 ∗ 𝐿𝑖 + 𝛾3 ∗ 𝑅𝑖 + 𝑒𝑖

• A is 1 for investment in A-categ 0 otherwise, similarly the variables AA and AAA are

defined
Multinomial Dependent Variables

• If these are the only three options, then we can consider one of them as

reference point as follows

• 𝐴𝐴𝑖 − 𝐴𝑖 = (𝛽1 − 𝛼1 ) + (𝛽2 − 𝛼2 ) ∗ 𝐿𝑖 + (𝛽3 − 𝛼3 ) ∗ 𝑅𝑖 + 𝑣𝑖′ (1)

• 𝐴𝐴𝐴𝑖 − 𝐴𝑖 = (𝛾1 − 𝛼1 ) + (𝛾2 − 𝛼2 ) ∗ 𝐿𝑖 + (𝛾3 − 𝛼3 ) ∗ 𝑅𝑖 + 𝑒𝑖′ (2)

• We need not estimate the third equation as the probability of all the three

sum to one, if two are known, third is automatically determined


Multinomial Dependent Variables

• Suppose for person ‘i’ the utility for AA>A then Eq. (1)>0

• 𝐴𝐴𝑖 − 𝐴𝑖 = (𝛽1 − 𝛼1 ) + (𝛽2 − 𝛼2 ) ∗ 𝐿𝑖 + (𝛽3 − 𝛼3 ) ∗ 𝑅𝑖 + 𝑣𝑖′ (1)

• Assume that the new error term also follows the logistic dist.

𝐴𝐴𝑖 1
• 𝑃 = ; where 𝑍𝑖 = (𝛽1 − 𝛼1 ) + (𝛽2 − 𝛼2 ) ∗ 𝐿𝑖 + (𝛽3 − 𝛼3 ) ∗ 𝑅𝑖
𝐴𝑖 1+𝑒 −𝑧𝑖

• Represents the probability that individual would prefer AA over A


Multinomial Dependent Variables

• Similarly for AAA

𝐴𝐴𝐴𝑖 1
• 𝑃 = ; where 𝑍𝑖 = (𝛾1 − 𝛼1 ) + (𝛾2 − 𝛼2 ) ∗ 𝐿𝑖 + (𝛾3 − 𝛼3 ) ∗ 𝑅𝑖
𝐴𝑖 1+𝑒 −𝑧𝑖

• These two equations will be estimated simultaneously with ML


Multinomial Dependent Variables

• Resulting logit model system of equations is written as follows

𝑃(𝐴𝐴𝑖 )
• ln = 𝑍𝑖 = (𝛽1 − 𝛼1 ) + (𝛽2 − 𝛼2 ) ∗ 𝐿𝑖 + (𝛽3 − 𝛼3 ) ∗ 𝑅𝑖 (1)
𝑃(𝐴𝑖 )

𝑃(𝐴𝐴𝐴𝑖 )
• ln = 𝑍𝑖 = (𝛾1 − 𝛼1 ) + (𝛾2 − 𝛼2 ) ∗ 𝐿𝑖 + (𝛾3 − 𝛼3 ) ∗ 𝑅𝑖 (2)
𝑃(𝐴𝑖 )

• 𝑃 𝐴𝐴𝐴𝑖 + 𝑃 𝐴𝐴𝑖 + 𝑃 𝐴𝑖 =1 (3)


INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Tobit model
Tobit model or censored regression

• Consider the relationship of dividend paid ‘Y’ with market cap, what if the firm does not

pay dividend

• The following Tobit model can be considered

• 𝑌𝑖 = 𝛽1 + 𝛽2 𝑋𝑖 + 𝑢𝑖 𝑖𝑓 𝐿𝐻𝑆 > 0

• =0 otherwise

• The threshold can be from upper or lower limits, for simplification of mathematics, one

can adjust the model to put the threshold at zero


INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Case Study: Homeownership


Case 1
Case 1: Read the "Homeowner" data in R. The dataset comprises observations
with a discrete variable denoting homeownership (Y=1 family owns the
house, 0= does not own the house). The dataset also has a variable
representing income (X) in INR Lakhs. Begin by running a simple linear
probability model with the "Ownership (Y) as the discrete variable and
conduct the following analysis.
Case 1 Contd.
❖ Summarize the model and interpret the coefficients to understand the
relationship between income and homeownership.
❖ Find the probability of homeownership when income is INR 12 Lakhs.
❖ What happens to homeownership probability when the income is more
than 19 Lakhs, e.g., at 20 Lakhs or more?
❖ What happens to homeownership probability when the income is less than
9 Lakhs, e.g., at 8 lakhs?
Case 1 Contd.
❖ Examine the fitted values of the probability to assess the model's
performance.
❖ Visualize the residuals to identify issues related to normality and
heteroscedasticity.
❖ Conduct formal tests to assess the presence of heteroscedasticity in the
model.

❖ Conduct weighted least square (WLS) regression analysis using residuals.


Compare the results with raw results and those with heteroscedasticity
robust standard errors
INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Case Study: Grouped Data


Case 2
Case 2: Consider the grouped data on "Income_Ownership". The data is
grouped according to the income level (X). It carries variables: Income (X) in
INR Lakhs per annum I; Total (F): Total families, and Own (Y): Number of
families that own a house. Read the data in R and conduct the following
analysis.
Case 2 Contd.
❖ Construct the Logit variable from the given data.
❖ Estimate the model in a linear fashion.
❖ Summarize the model and interpret the coefficients to understand the
relationship between all variables.
❖ Examine the fitted values of the probability to assess the model's
performance.
Case 2 Contd.
❖ Visualize the residuals to identify issues related to normality and
heteroscedasticity.
❖ Conduct formal tests to assess the presence of heteroscedasticity in the
model.
❖ Conduct weighted least square (WLS) regression analysis using residuals.
Compare the results with raw results and those with heteroscedasticity
robust standard errors.
❖ What happens to ownership probability when the income is 20 Lakhs and
40 Lakhs?
INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Case Study: Debit card data


Case 3
Case 3: Read the ATM data with the following attributes. Debit_Card (Y)= 1
for the ATM card holder and 0 otherwise, Account_Balance (X2): average
account balances in (‘00s), No_ATM_Transac (X3): number of ATM
transactions in a month, No_Other_Bank_Service (X4): If no other bank
service is received; Receives_Interest (X5): if interest is received on the
account, 0 otherwise; City_Banking (X6): City code where banking is done.
Read the data in R and conduct the following analysis.
Case 3 Contd.
❖ Run a simple linear probability model examining the relationship between
Y as the dependent variable and X2, X3, X5 as the independent variables.
Examine, summarize and interpret the results with and without robust
standard errors.
❖ Examine the fitted probability values.
❖ Estimate a non-linear logit and probit model.
Case 3
❖ Compute marginal effects, Pseudo-R2 (Also called McFadden’s R2), and
Likelihood ratio test, and examine the fitted values
❖ Compare the goodness-of-fit measures for logit/probit models
❖ Assume a threshold of 0.5, compute the % of observations accurately
classified (Count Rsq) for both the models
INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Case Study: Grade Point Average data


Case 4
Case 4: Read the GPA data with the following attributes. Grade (Y)= 1 if the
final grade is A otherwise 0 (B and C); Grade Point Average (GPA):
Numerical grade obtained previously, Test of Undergraduate Course in
Economics (TUCE): Score in the economics entrance test; Personalized
System of Instructions (PSI): 1 if the candidate was exposed to the new
teaching method, 0 otherwise; Letter Grade (Letter): Letter grades in A, B, C.
Read the data in R and conduct the following analysis.
Case 4 Contd.
❖ Run Logit/Probit nonlinear models examining the relationship between Y
as dependent variable with GPA, TUCE, and PSI.
❖ Summarize and interpret the results.
❖ Compute marginal effects, Pseudo-R2 (Also called McFadden’s R2), and
Likelihood ratio test, and examine the fitted values
❖ Compare the goodness-of-fit measures for logit/probit models
❖ Assume a threshold of 0.5, compute the % of observations accurately
classified (Count Rsq) for both the models
INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Case Study: Smoker’s data


Case 5
Case 5: Read the Smoke data with the following attributes. Smoke (Y)= 1 if
the person is smoker otherwise 0; Age: Age of the person, Educ: Formal
Education of the person (in years); Income: income of the person, Pricing:
pricing of the cigarette. Read the data in R and conduct the following analysis.
Case 5 Contd.
❖ Run Logit/Probit nonlinear models examining the relationship between
Smoke dichotomous as dependent variable with Age, Education, Income,
and Pricing.
❖ Summarize and interpret the results.
❖ Compute marginal effects, Pseudo-R2 (Also called McFadden’s R2), and
Likelihood ratio test, and examine the fitted values
❖ Compare the goodness-of-fit measures for logit/probit models
❖ Assume a threshold of 0.5, compute the % of observations accurately
classified (Count Rsq) for both the models
INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Case Study: Multinomial Logit


Education data
Case 6
Case 6: Read the Education data with the following attributes. Choice of
education (Undergrad)= 1 no undergrad, 2 if 3-year undergrad, and 3 if 4-year
undergrad. Board (X1): Intermediate from state board or central board, grades
(X2) : overall grades normalized on a 13 point scale; faminc (X3): gross
annual family income in lakhs, famisiz (X4): size of family members, paredu
(X5) : 1 if any of the parent has advanced education (Masters, Doctorate, etc.),
Gender (X6) =1 if Female 0 otherwise, Language (X7): 1 if Hindi speaking 0
otherwise. Read the data in R and conduct the following analysis. .
Case 6 Contd.
❖ Run multinomial logit model examining the relationship between Choice
of education (Y) dichotomous as the dependent variable with X1, X2, X3,
X5, X6, and X7.
❖ Summarize and interpret the results.
❖ Conduct the statistical test of significance
INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Summary and Concluding Remarks


Summary and Concluding Remarks
• In this lesson, we discussed how the logit model approach can also be
applied to a multichoice variable or multinomial variable.
• In this approach, out of N choices, one choice becomes the reference
point, and for the other choices, the probabilities are assessed with
respect to this reference point.
• Next, we also discussed the Tobit model or censored regression. We
noted that in some cases, if the dependent variable is not observed
beyond a threshold value, the model can still be estimated using the
MLE approach, assigning that particular threshold value to the missing
dependent variable, instead of excluding such observations.
Summary and Concluding Remarks

• Next, we implemented the Homeownership case, where we


employed the linear probability models.
• We noted that while the model is easier to implement and
interpret, it is afflicted by several issues related to estimation, and
fitted values are not probability-consistent as they move beyond
the 0 to 1 range.
• In the next case, we used grouped data. We noted that if data is
grouped, the logit form model can still be estimated in a linear
fashion and one can obtain probability consistent values.
Summary and Concluding Remarks
• In the subsequent cases, namely, Debit card, Grade point
average, and Smokers data, we implemented and compared the
logit and probit models.
• We computed marginal effects, goodness-of-fit measures (E.g.,
count Rsq, pseudo Rsq, and loglikelihood function). We
compared the results from the logit and probit models. We find
that if data is fairly symmetric, then logit probit models give similar
results.
• Lastly, we also estimated a multinomial logit case using
education data. With this, we conclude the tutorial on
classification case studies.
Thanks!

You might also like