Categorical Data Analysis (CDA) - 1
Categorical Data Analysis (CDA) - 1
Stat3062
Workineh Muluken (MSc. in Biostatistics)
Email: [email protected]
Wachemo University
Department of Statistics
e.g. severity of an injury (none, mild, moderate, sever ), grade levels (1,2…)
• This section presents the key distributions for categorical data: the
binomial and multinomial distributions.
Con…
Binomial Distribution
• The categorical data/ variable has a binomial distribution when it has two
possible mutually exclusive outcomes (successes, failure)
- Identical trials: means that the probability of success is the same for
each trial.
Con…
NB: The trials are often called Bernoulli trials.
p(= ) = , = 0, 1, 2, . . . ,
Con…
a. No correct answer
a. P(y = 0) = b. P(y = 1) =
= =
= 1* = 10*0.2*
= 0.107 =0.268
c. Do yourself
Con…
Exercise: From the above example find the mean and variance of
Multinomial Distribution
• When the trials are independent with the same category probabilities
for each trial, the distribution of counts in the various categories is
the multinomial
P (, ,…., ) = **………
- Wald test
- Likelihood-Ratio test
- Score Inference
Wald test
• Let β denote an arbitrary parameter.
Con…
Z = ~ N (0, 1)
Con…
• Equivalently, = ~
• This type of statistic, which uses the standard error evaluated at the
ML estimate, is called a Wald statistic.
• The or chi-squared test using this test statistic is called a Wald test
• This test uses standard errors under the assumption that the null
hypothesis holds
NB: Wald, likelihood-ratio, and score tests are the three major ways of
constructing significance tests for parameters in statistical models
Con…
Example: Suppose we have 9 successes out of 10 trials in a clinical trial.
Soln.
a.
= = = = 4.21
S.E.() = = = = 0.158
= = 2.53
Con…
= = 6.4 ~ = 3.84
• Let and denote two categorical variables, with categories and with
categories
• Thus, the table of the above form in which the cell contain frequency
counts of outcomes is called a contingency table or cross classification
table.
Con…
X =1 …………
X=2 ……..
. . . . . .
. . . . . .
X= ……..
……….
Con…
Joint probabilities
• Where, = 1
Marginal probabilities
=
Con…
• Where, = =
Conditional probabilities
= and = 1
Con…
Example
Belief in Afterlife
Gender Yes No Total
Female = 509 = 116 = 625
Male = 398 = 502
Total = 907 = 220 n = 1127
Compute:
a. Joint probabilities
b. Marginal probabilities
c. Conditional probabilities
Con…
Soln.
b. The marginal probabilities of X are
a. = = 0.452
= + = 0.452 +0.103 = 0.555
= = 0.103
= + = 0.353+0.092 = 0.445
= = 0.353
The marginal probabilities of Y are
= = 0.092
= + = 0.452+0.353 = 0.805
= + = 0.103+0.092 = 0.195
Con…
= = = 0.814
= = = 0.793
= = = 0.186
= = = 0.207
Con…
Independence
= and = =
Example: Consider the above example and show that whether belief in
afterlife is independent of gender or not.
Con…
Soln.
This implies that belief in after life is not independent of gender that is
belief in after life is dependent of gender.
Comparing Proportions in two by two tables
Myocardial infraction
(-) = = (0.00770.003)
= (0.0047, 0.0107)
Con…
• OR is the only one of the three association measures that is appropriate fore
cross-sectional, prospective, and retrospective study designs.
• The odds of an event are the probability of the event occurring divided by
the probability that event does not occurring.
O=
Con…
• A success is more likely than a failure when odds are greater than one
• A success is less likely than a failure when odds are less than one
• For a 2 x 2 contingency table, let and are odds of success in row 1 and
row 2 respectively
= or =
e.g. θ = 4 and θ = 0.25 represent equally strong associations between X and Y, but
in the opposite direction.
• If the order of rows or columns is reversed (but not both), the new
value of θ is the inverse of the original value.
Example: From the aspirin use data, if we exchange the rows, the odds
ratio becomes
= = = 0.546
Inference for Odds Ratios and Log Odds Ratios
• For small samples, the distribution of the sample odds ratio is highly
skewed.
• The log odds ratio is symmetric about zero, in the sense that reversing
rows or reversing columns changes its sign
• The sample log odds ratio, log () has a less skewed distribution
ASE (log () ) =
Example: Again consider the aspirin use data, find a 95% CI for θ
ASE (log () ) =
=
Con…
= 0.123
= (0.365,0.846)
exp(0.365,0.846) = (, )
= (1.44, 2.33)
Relationship between odds ratio and relative risk (RR)
• Because for some data sets direct estimation of relative risk is not
possible
• Thus, one can estimate the odds ratio to approximate the relative risk
• That is RR =
Con…
• When both and are zero, both the odds ratio and relative risk take
similar value.
• In this case we can use the odds ratio as an estimate of relative risk.
Chi-squared tests of independence
- Cell counts,
- = n = n* * =
=
Con…
Likelihood-Ratio Statistic
=2
Soln.
= = = 533.68
We know that = , then
= = = 411.32
= = = 703.67
Then, the Pearson Chi-squared
= = = 319.65
statistic and likelihood ratio
= = = 542.33
statistics are computed as follows
= = = 246.35
Con…
= + + + + +
= + + + + +
= 4.835+0.169+8.083+6.274+0.219+10.488
= 30.068 ~ = 5.99
= 2{762+327+468+484+239
+477}
= 2(60.683+7.434-61.462-55.074-7.239+70.665)=2(15.007)= 30.014
Since 30.014 > 5.99, we reject the null hypothesis of independence and we
can conclude that party identification is dependent of gender.
Exact inference for small samples
• Both Pearson chi-squared and likelihood ratio statistics are used well
when the contingency tables have large observations (large sample size).
• When the sample is small, the distribution of and are not well
approximated by Chi-squared distribution.
• But the p-values based on the exact tests are conservative, that is larger
than they are really.
Associations in three way tables
Partial tables
• They display the XY relationship at fixed levels of Z, hence showing the effect of
X on Y while controlling for Z.
• The partial tables remove the effect of Z by holding its value constant
Marginal tables
• Typically, the odds ratio are used describe the marginal and conditional
associations in a 3-way table
• Then, we describe the conditional odds ratio for the partial table as
=
Con…
• The table below shows a 2 2 partial table relating defendants race and
death penalty verdict at each level of victims race.
Con…
Death penalty
Victims race Defendants race Yes No Percentage
White White 53 414 11.3
Black 11 37 22.9
Black White 0 16 0.0
Black 4 139 2.8
Total White 53 430 11.0
White 15 176 7.9
Con…
b. Find and interpret the sample conditional odds ratio, adding 0.5 to
each cell to reduce the impact of 0 cell counts.
White 0 16
Black 4 139
Con…
= = = 0.4306
Interpretation: Since the odds ratio for white victims is less 1, we can say that
a white defendant is less likely to get a death penalty than black a defendant
when the victims race is white controlling victims race. Or
The sample odds of receiving death penalty verdict for white defendants were
43% of the sample odds getting death penalty for black defendants
controlling victims race as constant.
Con…
b). When victims race is black
= = = 0.94
c). To get the sample marginal odds ratio , we should add the cell
counts based on the similar categories of victims race and we get the
following data.
Death penalty verdict
Defendants race Yes No
White 53 430
Black 15 176
= = = = 1.45
Con…
Or white defendants are 45% more likely to receive death penalty than
black defendants, controlling the victims race.
Con…
Conditional versus marginal independence
• i.e. = = ……….= = 1
Example: Consider the death penalty data given above, show that whether
a. Conditionally independent
b. Marginally independent
Since, ≠ ≠ 1, the defendants race (X) and death penalty (Y) are not conditional
• That is = = ……….=
- This test gets its name from the null hypothesis, where we claim that the
distribution of the responses are the same (homogeneous) across groups.
- The distribution of the categorical variable is the same for all populations
(or subgroups), or
Con…
: At least one proportion of the response variable is not the same across
the groups /the distributions of the response variable is differ across the
sub-groups.
Con…
-The expected frequency count for each cell of the table is at least 5.
Division I schools: large universities with large athletic budgets(revenue from the
games); they must offer athletic scholarships.
Division II schools: Smaller public universities and many private institutions; have
much smaller budgets(solely from the college).
Q. Does steroid use by student athletes differ for the three NCAA divisions?
Con…
Soln.
: The proportion of athletes using steroids is the same in each of the three
NCAA divisions.
: The proportion of athletes using steroids is not same in each of the three
NCAA divisions
Con…
- For the response “yes”, the proportion of steroid use in combined samples is
= 220/19377 = 0.01135
= 0.01135*8543 = 96.96
= 0.01135*4341 = 49.27
= 0.01135*6493 = 73.70
Con…
- For the response “No”, the proportion of non-steroid use in combined samples is
= 19157/19377 = 0.98865
= 0.98865*8543 = 8446.04
= 0.98865*4341 = 4291.73
= 0.98865*6493 = 6419.30
-We calculate the chi-square test statistic similarly as in test of independence
⇨ = + + + + +
= + + + + +
= 0.3763+0.1513+1.0270+0.0043+0.0017+0.0118 = 1.5724
Con…
- For chi-square tests based on two-way tables (both the test of independence
and the test of homogeneity), the degrees of freedom are ()()
- Here, . = (3-1)*(2-1) = 2
Decision: Since the calculated value (1.5724) is less than the tabulated value,
we fail to reject
Conclusion: The data does not provide strong enough evidence to conclude
that steroid use differs in the three NCAA divisions
Chapter 3: Logistic regression model
Assumptions of logistic regression model
• Logistic regression does not make many of the key assumptions of linear
regression and generalized linear models that are based on ordinary least
squares algorithms.
• Particularly
How ever logistic regression model still apply some other assumptions
• It assumes that the independent variables are linearly related to the log of
odds.
• LRM is used when the dependent variable is categorical and the independent
variables are of any type.
• Components of GLMs
• LRM is often called logit model as the link in this GLM is the logit link.
• LRM calculates changes in the log odds of the dependent variable, not
changes in the dependent variable itself as OLS regression does
• Let us start with simple binary logistic regression model with one
independent variable,
logit ( () ) = log() = α + β
• From the above model, the probability of success, (), can be given by
() = (show it)
Soln.
⇨ +1
⇨ () =
Con…
θ= =
= =
= =
=
Con…
=
Con…
• Let x and z each take values 0 and 1 to represent the two categories of each explanatory
variable
logit() = α + +
• The effect of one factor is the same at each category of the other factor.
• At = 0
(2)-(1) we get
log() = α + -α =
⇨θ=
Con…
• At = 1
(4)-(3) we get
log() = α + +-(α + ) =
⇨θ=
Con…
• This difference between two logits equals the difference of log odds.
Equivalently, that difference equals the log of the odds ratio between
X and Y , at that category of Z.
> install.packages("aod")
> library(aod)
Con…
Example:
>model1=glm(Contraceptive_use~Mother_age+Family_size+Availability_of_toilet+Birth
_type+Mother_education_level,family="binomial")
>wald.test(b=coef(model),Sigma=vcov(model),Terms=6:7)
- In this case terms 6 and 7 are the three terms for the levels of education.
Con…
>predict(model1,type=“response”)
Example:
- Mother_age: continuous
- Family_size: discrete
>library(stats)
>model1=glm(Contraceptive_use~Mother_age+Family_size+Availability
_of_toilet+Birth_type+Mother_education_level,family="binomial")
>summary(model1)
Con…
***
• = -0.062374, this implies that as a family size increases by one unit, the
odds of using contraceptive method is decreased by a factor of 0.9395.
• Mainly, we consider
- Variable selection
- Goodness-of-fit tests
- Model diagnostics
Strategies for model selection
i. Information criteria
= -2 + 2
= -2 L +
Con…
• But the most criteria used is AIC because BIC depends of prior information of
the parameters.
Example: Compare the following models based of AIC
>model1=glm(Contraceptive_use~Mother_age+Family_size+Availability_of_toi
let+Birth_type+Mother_education_level,family="binomial")
Con…
>model2=glm(Contraceptive_use~Mother_age+Family_size+Availability_of_toil
et,family="binomial")
> AIC(model1)
[1] 11620.27
> AIC(model2)
[1] 11617.51
Example: Consider the following output and compare the two models using the
LRT
Con…
>library(lmtest)
>lrtest(model2,model1)
Birth_type + Mother_education_level
1 4 -5804.8
• From the above output we observed that the -value is not significant,
and we fail to reject and conclude that model2 is relatively better
than model1.
Variable selection
• As the number of model predictors increases, it becomes more likely that some
ML model parameter estimates are infinite.
• A variable may seem to have little effect because it overlaps considerably with
other predictors in the model, itself being predicted well by the other predictors.
• The most and common method for variable selection is stepwise variable
selection algorithms.
• Now, here we use AIC value as a criteria to select the best subset of predictor
variables
Con…
Forward selection method
• This algorithm starts with the null/empty model and we add terms/variables
sequentially until further additions don’t improve the fit.
• That is variables with smallest AIC /-value are added to the model
sequentially.
Backward elimination
• At a given stage, it eliminates the term in the model that has the largest p-
value value/small AIC value
Con…
• The process stops when any further deletion leads to a significantly poorer fit.
Example: Consider the following variables and fit a model using variable selection
methods.
- Mother_age
-Family_size
-Availability_of_toilet
- Birth_type
-Mother_education_level
Con…
>library(stats)
>fit.all=glm(Contraceptive_use~Mother_age+Family_size+Availability_of_toil
et+Birth_type+Mother_education_level,family="binomial") # to fit full model
Df Deviance AIC
+ Availability_of_toilet 1 11798 11802
+ Mother_age 1 12137 12141
+ Family_size 1 12149 12153
<none> 12249 12251
+ Birth_type 1 12249 12253
+ Mother_education_level 2 12247 12253
Con…
Step: AIC=11801.73
Contraceptive_use ~ Availability_of_toilet
Df Deviance AIC
+ Mother_age 1 11645 11651
+ Family_size 1 11729 11735
<none> 11798 11802
+ Mother_education_level 2 11795 11803
+ Birth_type 1 11797 11803
Con…
Step: AIC=11651.2
Contraceptive_use ~ Availability_of_toilet + Mother_age
Df Deviance AIC
+ Family_size 1 11610 11618
<none> 11645 11651
+ Mother_education_level 2 11642 11652
+ Birth_type 1 11645 11653
Con…
Step: AIC=11617.51
Contraceptive_use ~ Availability_of_toilet + Mother_age + Family_size
Df Deviance AIC
<none> 11610 11618
+ Mother_education_level 2 11606 11618
+ Birth_type 1 11609 11619
Coefficients:
- Availability_of_toilet
-Mother_age
-Family_size
>fit.all=glm(Contraceptive_use~Mother_age+Family_size+Availability_of_toil
et+Birth_type+Mother_education_level,family="binomial")# to fit the full
model
Df Deviance AIC
- Birth_type 1 11606 11618
- Mother_education_level 2 11609 11619
<none> 11606 11620
- Family_size 1 11642 11654
- Mother_age 1 11725 11737
- Availability_of_toilet 1 12069 12081
Con…
Step: AIC=11618.41
Contraceptive_use ~ Mother_age + Family_size + Availability_of_toilet +
Mother_education_level
Df Deviance AIC
- Mother_education_level 2 11610 11618
<none> 11606 11618
- Family_size 1 11642 11652
- Mother_age 1 11726 11736
- Availability_of_toilet 1 12069 12079
Con…
Step: AIC=11617.51
Contraceptive_use ~ Mother_age + Family_size + Availability_of_toilet
Df Deviance AIC
<none> 11610 11618
- Family_size 1 11645 11651
- Mother_age 1 11729 11735
- Availability_of_toilet 1 12071 12077
Coefficients:
(Intercept) Mother_age Family_size Availability_of_toilet
-0.26979 -0.03177 -0.06230 1.03718
• Once the model is fitted the next important step is to check whether
the probabilities produced by the model accurately reflect the true
outcome experience in the data.
- Hosmer-Lemeshow test
- Classification tables
-
Con…
i. Hosmer-Lemeshow test
• Where, = , = , = ,
= )
• Under the null hypothesis of the model is adequate to fit the data, the
distribution of approximated by the Chi-squared distribution with (g-2)
degrees of freedom
Con…
ii. Classification table
• This table is the result of cross-classifying the outcome variable, , with a dichotomous
variable whose values are derived from the estimated logistic regression model
probabilities.
• If the model predicts group membership accurately according to some criterion, then
this is thought to provide evidence that the model fits the data well
Con…
iii. Receiver Operatic Characteristic Curve (ROC)
• ROC curve is the plot of sensitivity versus (1-specificity) for an entire range
of possible cut points.
• The area under ROC curve, which ranges from 0.5 to 1 provides a measure
of the model’s ability to discriminate between those subjects who
experience the outcome of interest versus those who do not.
Con…
• According to the rule of thumb the area under the ROC curve is interpreted by
general guidelines
0.5 < ROC < 0.7, this suggests that poor discrimination
• They are used to model categorical responses (nominal and ordinal) with
more than two categories
• Logit models for nominal response variables pair each category with a baseline
category.
• When the last category (J ) is the baseline, the baseline-category logits are
• Given that the response falls in category j or category J , this is the log odds that
the response is j.
Con…
logit() = log() = +
- This model has J-1 equations with separate parameters for each.
- That is
logit() = log() = + ⇨ the log odds that the outcome is 1 rather than 3
logit() = log() = + ⇨ the log odds that the outcome is 2 rather than 3
Con…
• For an arbitrary pair of categories a and b, the log odds that the response is
in category a rather than in category b is given by
= ( +)-(-)
= (-) +(-)
-So, this equation has the form of +β with intercept parameter = (-) and
slope parameter β = (-)
Con…
Example: Alligators food choice
The primary food type found in the alligators stomach has three
categories (Fish, Invertebrate, other) and here we assume that the
length of alligators in meter () as a predictor variable. Then after some
analysis using the baseline category “other” as a reference group, we
get the following outputs.
a). The estimated odds ratio for the above two equations and interpret them
b). The estimated log odds ratio that the response is fish rather than
“invertebrate” and interpret it.
Soln.
= = = 0.896
= = = 0.085
Con…
Interpretation:
• For equation (1), for every one unit increase in alligators length, the estimated
odds that the primary food type is “Fish "rather than “other” is decreased by a
factor of 0.896
• For equation (2), for every one unit increase in alligators length, the estimated
odds that the primary food type is “invertebrate” rather than “other” is
decreased by a factor of 0.085
b). The estimated log odds that the primary food type is “Fish” rather than
“invertebrate” equals
Con…
log(/) = (1.618-5.697)+[-0.11-(-2.465)]
= -4.08+2.355
Interpretation:
For every one unit increase in alligators length, the estimated odds that
the primary food type is “Fish” rather than “invertebrate” is increased
by a factor of = 10.5
Con…
Estimated response probabilities
• This is
= , where, j = 1,2,3,…….,J
Example: The estimates from the above example of “alligators food choice”
contrast “Fish” and “Invertebrate” to “Other” as the baseline category. The
estimated probabilities of the outcomes (Fish, Invertebrate, Other) for
alligators length of 0.015m length are
= = =
= 0.017
Con…
=
= =
=
=
=
= 0.98
= 0.0033
Con…
NB: the term 1 in each denominator and in the numerator of represent
for = = 0 with the baseline category.