0% found this document useful (0 votes)
25 views

Fit Cmclogit

The document discusses the assessment of choice models using Stata, highlighting the introduction of various commands for estimating conditional logistic regression models. It presents a solution through the fit_cmclogit.ado, which calculates several Pseudo R² statistics to evaluate model fit, as demonstrated with an example from the 1995 North Rhine-Westphalia Election Study. The conclusion emphasizes the user-friendly nature of the ado file and mentions future enhancements, including a prediction-success table.

Uploaded by

thuyetnn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Fit Cmclogit

The document discusses the assessment of choice models using Stata, highlighting the introduction of various commands for estimating conditional logistic regression models. It presents a solution through the fit_cmclogit.ado, which calculates several Pseudo R² statistics to evaluate model fit, as demonstrated with an example from the 1995 North Rhine-Westphalia Election Study. The conclusion emphasizes the user-friendly nature of the ado file and mentions future enhancements, including a prediction-success table.

Uploaded by

thuyetnn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

How to assess the fit of choice

models with Stata?


2024 German Stata Conference
at GESIS in Mannheim,
June 7, 2024

?There is no safety in numbers.” Howard S. Wainer


Dr. Wolfgang Langer Associate Assistant Professor
Martin-Luther-Universität University of
Halle-Wittenberg Luxembourg
Institut für Soziologie
1
Outline

 What is the problem?

 What is the solution in Stata?

 Example of application

 Conclusions

2
What is the problem?
 In 1992 Stata V3 introduced the clogit-
command to estimate Conditional (fixed-
effects) logistic regression model which
calculates the McFadden Pseudo R²
 In 2007 Stata V10 introduced the asclogit-
command to estimate the alternative-specific
conditional logit model
 In 2019 Stata V16 introduced the Choice
Models (cm) commands
 But none of them calculates the Likelihood-
Ratio-chi² test statistic and any Pseudo R² to
assess the fit of the model !
3
What is the solution in Stata?
 My fit_cmclogit.ado calculates for McFadden’s
conditional logit choice model the following test
statistic and Pseudo R²s tested by Monte Carlo
simulation studies in the 1990s / 2000s
< Likelihood-Ratio-chi² test statistic using a zero model
with alternative-specific constants
< McFadden Pseudo R² (likelihood-ratio-index) (1974)
< Adjusted McFadden Pseudo R² (1985)
< Maddala Pseudo R² (1983)
< Cragg & Uhler Pseudo R² (1970)
< Aldrich & Nelson Pseudo R² (1984)
< Aldrich & Nelson Pseudo R² with Veall &
Zimmermann correction (1994)
4
Example of application
 North Rhine-Westphalia Election Study of 1995

< CATI Survey with 504 respondents (eligible voters)


< Endogenous variable: voting intention for the
German parties SPD, FDP or CDU: 1) yes 0) no
< Exogenous variables
– Generic / alternative specific: long term preference for
one of the three parties (gprefall): 1) yes 0) no
– Case-specific variables:
– Religious denomination (confession): 1) yes 0) no
– Educational degree (education): 1) secondary modern
2) secondary modern+ 3) grammar school 4) college/university
< Balanced hierarchical data structure
– Party alternatives are nested within respondents
5
Conditional logit choice model Number of obs = 1,512
Case ID variable: probnr Number of cases = 504

Stata 18 Alternatives variable: party Alts per case: min =


avg =
3
3.0

Output Log likelihood = ‐259.67913 Prob > chi2


max =

Wald chi2(9) =
=
3

263.97
0.0000
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
vote | Coefficient Std. err. z P>|z| [95% conf. interval]
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐+‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
party |
gprefall |
yes | 2.193726 .1401447 15.65 0.000 1.919048 2.468405
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐+‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
SPD |
confession |
yes | ‐.9000949 .3084457 ‐2.92 0.004 ‐1.504637 ‐.2955524
|
education |
sec.modern+ | ‐.1846034 .3412324 ‐0.54 0.589 ‐.8534067 .4841998
grammar school | ‐.645902 .5506053 ‐1.17 0.241 ‐1.725069 .4332646
college/university | ‐1.03819 .6887728 ‐1.51 0.132 ‐2.38816 .3117801
|
_cons | .4353825 .2737489 1.59 0.112 ‐.1011554 .9719205
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐+‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
FDP |
confession |
yes | ‐.6455168 .3947333 ‐1.64 0.102 ‐1.41918 .1281462
|
education |
sec.modern+ | 1.393966 .4604399 3.03 0.002 .4915205 2.296412
grammar school | 2.076665 .6434303 3.23 0.001 .8155643 3.337765
college/university | 3.160799 .5990928 5.28 0.000 1.986598 4.334999
|
_cons | ‐1.956077 .3772011 ‐5.19 0.000 ‐2.695377 ‐1.216776
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐+‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
CDU | (base alternative)
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
6
Output of my fit_cmclogit.ado
. fit_cmclogit

Likelihood‐Ratio‐chi2 test against zero model with ASCs

H0: all alternative‐/case‐specific‐effects are zero in the population

LR chi2( 9) = 463.32 Prob > chi2 = 0.0000

Fit‐Indices for the Alternative‐Specific‐Conditional‐Logit model:

McFadden Pseudo R2 (compared with zero model with ASCs) = 0.4715

McFadden Pseudo R2 with Ben‐Akiva & Lerman correction = 0.4532

Maddala ML Pseudo R2 = 0.6012

Cragg & Uhler Pseudo R2 = 0.7009 Excellent fit!


Aldrich & Nelson Pseudo R2 = 0.4790

Aldrich & Nelson Pseudo R2 with Veall & Zimmermann correction = 0.7246

7
My ado returns the following r-containers
. return list

scalars:
r(logl_m0) = ‐491.3376899127339
r(logl_ma) = ‐259.6791267683948
r(an_pr2_vz) = .7246287335654139
r(an_pr2) = .4789712842842913
r(cu_pr2) = .7009448689123344
r(ml_pr2) = .6011939268610912
r(rho2_bar) = .4531680913464735
r(rho2) = .4714854323214728
r(lr_p) = 0
r(lr_df) = 9
r(lr_chi2) = 463.3171262886782

8
Conclusions
 What have I shown?

< My fit_cmclogit.ado allows to assess the fit of


McFadden’s choice model in a user-friendly way.
< It provides all information we need to evaluate the
model fit.

 What’s in progress?

< Following extensions are in the pipeline


– Construction of McFadden’s prediction-success table
– Calculation of a separate McKelvey & Zavoina Pseudo
R² for each logit equation
9
Closing words
 Thank you for your attention

 Do you have some questions?

10
Contact
 Affiliation

< Dr. Wolfgang Langer


University of Halle
Institute of Sociology
D 06099 Halle (Saale)

< Email:
[email protected]

< Url:
– https://langer.soziologie.uni-halle.de

11
References
– Aldrich, J.H. & Nelson, F.D. (1984):
Linear probability, logit, and probit models. Newbury Park: SAGE
(Quantitative Applications in the Social Sciences, 45)
– Amemiya, T. (1981):
Qualitative response models: a survey. Journal of Economic Literature, 21, pp.1483-1536
– Ben-Akiva,M. & S.R.Lerman 19914 (1985):
Discrete choice analysis. Theory and application to travel demand. Cambridge, Mass:
MIT-Press
– Cox, D.R.& Snell, E.J. (1989):
The analysis of binary data. London: Chapman&Hill
– Cragg, S.G.& Uhler, R. (1970):
The demand for automobiles. Canadian Journal of Economics, 3, pp. 386-406
– DeMaris, A.(2002):
Explained variances in logistic regression. A Monte Carlo study of proposed
measures.Sociological Methods&Research, 11, 1, pp. 27-74
– Domencich,T.A. & McFadden, D.L. (1975): Urban travel demand. A behavioral analysis.
Amsterdam u. Oxford: North Holland Publishing Company
– Efron, B. (1978):
Regression and Anova with zero-one data. Measures of residual variation. Journal of
American Statistical Association, 73, pp. 113-121
– Hagle, T.M. & Mitchell II,G.E. (1992):
Goodness of fit measures for probit and Logit. American Journal of Political Science, 36,
3, pp. 762-784

12
References 2
– Hensher, D.A., Rose, J.M. & Greene (2005):
Applied choice analysis. A primer. Cambridge: Cambridge University Press
– Long, J.S. (1997):
Regression models for categorical and limited dependent variables. Thousand Oaks,
Ca : Sage
– Long, J.S. & Freese, J. (2000):
Scalar measures of fit for regression models. Bloomington, : Indiana University
– Long, J.S. & Freese, J. (20032):
Regression models for categorical dependent variables using Stata. College Station,
Tx: Stata
– Maddala, G.S. (1983):
Limited-dependent and qualitative variables in econometrics. Cambridge: Cambridge
University Press
– McFadden, D. (1974): Conditional logit analysis of qualitative choice behavior. In:
P.Zarembka (ed.), Frontiers in Econometrics. New York: Academice Press, pp. 105-142
– McFadden, D. (1978):
Quantitative methods for analysing travel behaviour of individuals: some recent
developments. In: D.A. Hensher & P.R. Stopher: (eds): Behavioural travel
modelling. London: Croom Helm, pp. 279-318
– McKelvey, R. & Zavoina, W. (1975):
A statistical model for the analysis of ordinal level dependent variables. Journal of
Mathematical Sociology, 4, pp. 103-20
– Nagelkerke, N.J.D. (1991):
A note on a general definition of the coefficient of determination. Biometrika, 78, 3,
pp.691-693

13
References 3
– Veall, M.R. & Zimmermann, K.F. (1992):
Pseudo-R2 in the ordinal probit model. Journal of Mathematical Sociology, 16, 4, pp. 333-
342
– Veall, M.R. & Zimmermann, K.F. (1994):
Evaluating Pseudo-R2's for binary probit models. Quality&Quantity, 28, pp. 151 - 164
– Windmeijer, F.A.G. (1995):
Goodness-of-fit measures in binary choice models. Econometric Reviews, 14, 1, pp. 101-
116
– Zimmermann, K.F. (1993):
Goodness of fit in qualitative choice models: review and evaluation. In: H. Schneeweiß &
K. Zimmermann (eds): Studies in applied econometrics. Heidelberg: Physika, pp. 25-
74

14
Appendix

15
What is the solution?
 Short review of the Monte-Carlo studies made
by econometricians to test systematically the
most common Pseudo R²s for binary and
ordinal probit / logit models
< Hagle & Mitchell 1992
< Veall & Zimmermann 1992, 1993, 1994
< Windmeijer 1995
< DeMaris 2002

 My fit_cmclogit.ado to calculate the most


important Pseudo-R²s

16
Which Pseudo-R²s were tested in the MC studies?
 Likelihood-based measures:
< Maddala / Cox & Snell Pseudo R² (1983/1989)
< Cragg & Uhler / Nagelkerke Pseudo R² (1970/1992)
 Log-Likelihood-based measures:
< McFadden Pseudo R² (1974)
< Aldrich & Nelson Pseudo R² (1984)
< Aldrich & Nelson Pseudo R² with the Veall &
Zimmermann correction (1992)
 Basing on the estimated probabilities:
< Efron / Lave Pseudo R² (1970 / 1978)
 Basing on the variance decomposition of the
estimated Probits / Logits:
< McKelvey & Zavoina Pseudo R² (1975)
17
Results of the Monte-Carlo-studies for
binary / ordinal logits or probits
 The McKelvey & Zavoina Pseudo R² is the best
estimator for the ?true R²” of the OLS regression
 The Aldrich & Nelson Pseudo R² with the Veall &
Zimmermann correction is the best approximation of
the McKelvey & Zavoina Pseudo R²
 Lave / Efron, Aldrich & Nelson, McFadden and Cragg
& Uhler Pseudo R² underestimate the ?true R²” of the
OLS regression
 My personal advice: Use the McKelvey & Zavoina
Pseudo R² or the Aldrich & Nelson Pseudo R² with
Veall & Zimmermann correction to assess the fit of
binary and ordinal logit models
18
Log-Likelihood-based measures 1
 McFadden-Pseudo-R2 (1974) provided by Stata
 log L A 
McFadden Pseudo R     1  
2 2

 log L0 
Theoretical range: 0 # McFadden Pseudo R² # 1

but ρ² does not reach its maximum of one!


Rule of thumb: 0.20 # McFadden Pseudo R²# 0.40 marks an
excellent fit. It is equivalent to 0.40 # R² # 0.80
of a linear regression model (McFadden 1978:
307)
Legend: log LA: Log-Likelihood of alternative model
log L0: Log-Likelihood of zero model
19
Relationship between McFadden’s ρ² and
R² of the regression model
 Interpretation
„Those unfamiliar with the p2 index
should be forewarned that its values tend
to be considerably lower than those of the
R2 index and should not be judged by the
standards for a 'good fit' in ordinary
regression analysis. For example, values
of 0.2 to 0.4 for p2 represent an excellent
fit.”
(McFadden 1978: 307)

(Figure 5.5 in Domencich & McFadden 1975: 124)

20
Log-Likelihood-based measures 2
 Adjusted McFadden Pseudo R2 (1985)

 log LA  K 
McFadden Pseudo R 2
adjusted     1   log L 
2

 0 
Correction of McFadden Pseudo R² by the total number
of estimated logistic slopes (K) proposed by Ben-Akiva
& Lerman (1985: 167)

Range: 0 # McFadden Pseudo R2adjusted # 1 ,


but it does not reach its maximum of one!

21
Likelihood-based measures 1
 Maddala Pseudo-R2 (1983) or Cox & Snell
Pseudo R2 (1989):
2
 L0  n
M addala Pseudo R ( R )  1   2 2
ML 
 LA 
  L.R. 2   2   log LA  log L0  
 1  exp    1  exp  
 n   n 
2
R a n g e : 0  M a d d a la P s e u d o R 2  1  L 0 n

Legend:
L0 : Likelihood of zero model (constant only)
LA : Likelihood of alternative model
n : number of cases
22
Likelihood-based measures 2
 Cragg & Uhler Pseudo R2 (1970) or Nagelkerke
Pseudo R2 (1991) 2
RML
C ra g g & U h ler P seu d o R 
2

m ax . R M2 L
2
 L0  n   L.R . 2 
1   1  exp  
  LA 
  n 
1  L0
2
n
1  exp  n  log L0 
2

Correction of the Maddala Pseudo R² by its own


theoretical maximumº Range: 0 # C&U Pseudo R² # 1
Legend: log: Logarithmus naturalis
exp: Exponential function
23
Log-Likelihood-based measures 3
 Aldrich & Nelson Pseudo R2 (1984)

L .R . 2
A ldrich & N elson P seudo R  2

L .R . 2  n

2   log LA  log L0 

2   log LA  log L0   n

24
Veall & Zimmermann Correction
 Veall & Zimmermann (1994) propose a correction
of the Aldrich & Nelson Pseudo R2 by its upper
limit
< Range of the A&N Pseudo R2
 2  lo g L 0
0  A & N P seudo R 2

n  2  lo g L 0

< Correction formula 2   log L A  log Lo 


2   log L A  log Lo   n
A & N Pseudo RV2 & Z 
 2  log L0
n  2  log L0
25
Basing on the estimated probabilities
 Lave / Efron Pseudo R2 (1979)
n

 Y i  pˆ i 
2

L a v e / E fr o n P s e u d o R 2  1  i 1

 Y 
n 2
i Y
n
Y    Yi
i 1
1
with n
i 1

Legend:
Yi : Value of the dependent variable for case i (1 or 0)

p i : Estimated probability Y=1 for case i

Y: Relative frequency of Y=1


26
Variance decomposition of estimated Y*
 McKelvey & Zavoina Pseudo R2 (M&Z Pseudo R2)
  yˆ 
n 2
*
i  yˆ *

Var  yˆ 
* i 1

M & Z Pseudo R  2
 n
Var  yˆ *   Var   
  yˆ 
n 2
*
i  yˆ *

 n  3
2
i 1
n
Range: 0 # M&Z Pseudo R² #1
Legend:
Var  y * : Variance of the estimated logits (latent variable Y*)
y i* : Estimated logit of case i

y * : Mean of the estimated logits


2
3 : Variance of logistic density function
27
Theoretical Model
Alternative-specific Variable:
Long term preference for party:
(gprefall)
1) Yes
0) No

Case-specific Variables:
Intention to vote for
Religious affiliation : party SPD,FDP or CDU:
(confession) (vote)
1) Yes 1) Yes
0) No 0) No

Degree of education:
(education)
1) Secondary modern school
2) Secondary modern school +
3) Grammar school
4) College/University

< Reference group: respondents with secondary


modern degree, without religious affiliation and no
party preference
28
McFadden’s choice model (cmclogit)
 Estimation equation

 Pi (Y  j )  K
ln     ( z ijk  z iJk )
 Pi (Y  J ) 
k
k 1
J 1 L
Rational Choice-part with   j   ßl  X il
alternative-specific γ-logit j l 1
slopes for the difference
of Zk
β-logistic slope of
the effect of Xl on
Multinomial logit model to comparison j vs. J
estimate the effects case-
specific exogenous variables Logistic constant for the
comparison j vs. J
29
Estimated effects of exogenous variables

Reference group: P(CDU)=0.3722 P(FDP) = 0.0526 P(SPD)= 0.5751


30

You might also like