0% found this document useful (0 votes)
43 views

Categorical Data Analysis (CDA) - 1

Uploaded by

redifuad2023
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Categorical Data Analysis (CDA) - 1

Uploaded by

redifuad2023
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 154

Categorical Data Analysis (CDA)

Stat3062
Workineh Muluken (MSc. in Biostatistics)
Email: [email protected]

Wachemo University
Department of Statistics

Address: 09 20 22 56 62 Hossana, Ethiopia


09 27 68 17 96 February 2021
Chapter 1: Introduction
Categorical response data/variable
• Categorical variable is a variable which has a measurement scale
consisting of a set of categories.

• A variable is categorical when it records discrete groups


(nominal/ordinal)

- Nominal(unordered) variables: categories have no natural order

(e.g., gender (male, female), race(white, black),………….)

- Ordinal(ordered) variables: categories have natural order/ranked


Con…

e.g. severity of an injury (none, mild, moderate, sever ), grade levels (1,2…)

Response/explanatory variable distinction


• Most statistical analyses distinguish between response variables and explanatory
variables.

• The response variable is sometimes called the dependent variable, outcome


variable or Y variable

• The explanatory variable are sometimes called the independent Variables,


regressors, predictors or X variables.
Variables and types of data for CDA

• The response variable is categorical

• Explanatory/predictor variables can be categorical or continuous

Nominal/Ordinal Scale Distinction


• Categorical variables have two main types of measurement scales

• Categorical variables having ordered scales are called ordinal variables.

• Categorical variables having unordered scales are called nominal variables


Probability Distributions for Categorical Data

• Inferential statistical analyses require assumptions about the


probability distribution of the response variable.

- For regression and analysis of variance (ANOVA) models for


continuous data, the normal distribution plays a central role.

• This section presents the key distributions for categorical data: the
binomial and multinomial distributions.
Con…
Binomial Distribution
• The categorical data/ variable has a binomial distribution when it has two
possible mutually exclusive outcomes (successes, failure)

• Assumptions of binomial distribution

- Independent trials: means the response outcomes are independent random


variables. That is outcome of one trial does not affect the outcome of another

- Identical trials: means that the probability of success is the same for
each trial.
Con…
NB: The trials are often called Bernoulli trials.

• Let π denote the probability of success for a given trial.

• Let denote the number of successes out of the trials.

• Under the assumption of independent, identical trials, has the


binomial distribution with index and parameter π

• Then, the probability of outcome for given by

p(= ) = , = 0, 1, 2, . . . ,
Con…

Example: Suppose a quiz has 10 multiple-choice questions, with five


possible answers for each. A student who is completely unprepared
randomly guesses the answer for each question. Let denote the number
of correct responses. The probability of a correct response is 0.20 for a
given question. Find the probability of getting

a. No correct answer

b. One correct answer

c. At most 9 correct answers, 5 correct answers, at lest 1 correct answer


Con…

Soln. Given: = 10, = 0.2, y = 0 Soln. Given: = 10, = 0.2, y = 1

a. P(y = 0) = b. P(y = 1) =

= =

= 1* = 10*0.2*

= 0.107 =0.268

c. Do yourself
Con…

• If has binomial distribution with trials and parameter , then the


mean and variance of are given by

()=, Var ()=(1- )

Exercise: From the above example find the mean and variance of
Multinomial Distribution

• Some trials have more than two possible outcomes.

Example: The outcome for a driver in an auto accident might be


recorded using the categories (uninjured, injury not requiring
hospitalization, injury requiring hospitalization, fatality).

• When the trials are independent with the same category probabilities
for each trial, the distribution of counts in the various categories is
the multinomial

• Let c denote the number of outcome categories


Con…
• We denote their probabilities by {, , . . . , }, where = 1

• For n independent observations, the multinomial probability that fall


in category 1, fall in category 2, . . . , fall in category c, where = is
given by

P (, ,…., ) = **………

NB: Binomial distribution is a special case of multinomial distribution


with c = 2 categories.
Con…

• The marginal distribution of the count in any particular category is


binomial.
• For category , the count has mean and variance (1-)

• Most methods for categorical data assume the binomial distribution


for a count in a single category and the multinomial distribution for a
set of counts in several categories.
Statistical inference for a proportion
Significance Test About a Binomial Proportion

• For the binomial distribution, we use the ML estimator in statistical


inference for the parameter π.

• The ML estimator of π is the sample proportion, p.

• The sampling distribution of the sample proportion p has mean and


standard error of
() = π and S.E.() =
• The sampling distribution of p is approximately normal for large n.
Con…

• Consider the null hypothesis : π = that the parameter equals some


fixed value, .
- The test statistic is given by
= ~ N (0, 1)

Example: In 2002, 893 American adults were asked as “ Do you believe


that a pregnant woman should be able to obtain an abortion?”. Then
400 adults replied “yes” for this question. Then test π = 0.5 vs π ≠ 0.5
Con…
Soln.
The test statistic becomes
Given: n = 893, y = 400
= = = -3.11
= = = 0.448
Decision: |-3.11| > 1.96, we reject
S.E. () =
Conclusion: the population proportion
= of American adults who believe that a
= 0.0167 pregnant woman should be able to
obtain an abortion is different from 0.5
Confidence Intervals for a Binomial Proportion
• A large sample 100(1 − α)% confidence interval for π has the formula
S.E. () where S.E. () =

Example: For the attitudes about abortion example discussed above,


we have p = 0.448, n = 893. Then find 95% confidence interval π.
Soln.
0.448 ± 1.96 = 0.448 ± 0.033, or (0.415, 0.481)

Conclusion: We can be 95% confident that the population proportion of


Americans in 2002 who favored legalized abortion for pregnant women
who do not want more children is between 0.415 and 0.481.
More on statistical inference for discrete data
• There are three ways of using the likelihood function to conduct
inference (confidence intervals and significance tests) about
parameters.

- Wald test

- Likelihood-Ratio test

- Score Inference

Wald test
• Let β denote an arbitrary parameter.
Con…

• Consider a significance test of : β = (such as : β= 0, for which = 0).

• Let S.E. denote the standard error of , evaluated by substituting the


ML estimate for the unknown parameter.

• For the binomial parameter π, S.E. =

• When is true, the test statistic

Z = ~ N (0, 1)
Con…

• Equivalently, = ~

• This type of statistic, which uses the standard error evaluated at the
ML estimate, is called a Wald statistic.

• The or chi-squared test using this test statistic is called a Wald test

Likelihood-ratio test (LRT)


• The LRT is based on the ratio of two log-likelihood function.

• Let is the likelihood function calculated at , and is the likelihood


function calculated at the ML estimate .
Con…

• The likelihood-ratio test statistic equals

LRT = -2log() , where log is the natural logarithm

• Under , the LRT statistic has a large-sample


Score test

• This test uses standard errors under the assumption that the null
hypothesis holds

NB: Wald, likelihood-ratio, and score tests are the three major ways of
constructing significance tests for parameters in statistical models
Con…
Example: Suppose we have 9 successes out of 10 trials in a clinical trial.

Then test : π = 0.5 vs : π ≠ 0.5 using

a. Wald test b. LRT c. Score test

Soln.

a.

The sample proportion () = 0.9 and =10

S.E. ()= = = =0.095


Con…

= = = = 4.21

The corresponding Chi-squared statistic is

= = 17.8 ~ , where = 3.84

Decision: Since = 17.8 is greater than 3.84, we reject : π = 0.5

Conclusion: We can conclude that the population proportion in a


clinical trial is different from 0.5.
Con…
b.
=
The likelihood function for the
=
binomial distribution is given by
= 10*0.3874*0.1
L(π) = , then
= 0.3874
=
LRT = -2log()
=
= -2log(0.00975/0.3874)
= 10*0.00195*0.5
= -2log(0.0252)
= 0.00975
= 7.36
Con…

Decision: Since LRT = 7.36 > = 3.84, we reject

Conclusion: we can conclude that the population proportion in clinical


trial is different from 0.5.

c. The null standard error is given by

S.E.() = = = = 0.158

Then, the Z statistic is given by

= = 2.53
Con…

The corresponding chi-squared statistic is

= = 6.4 ~ = 3.84

Decision: Since 6.4 is greater than 3.84, we reject

Conclusion: we can conclude that we can conclude that the population


proportion in clinical trial is different from 0.5.
Chapter 2: Contingency Table
Describing contingency table

• Let and denote two categorical variables, with categories and with
categories

• The possible combinations of outcomes are displayed in a rectangular


table having rows for categories of and columns for categories of

• Thus, the table of the above form in which the cell contain frequency
counts of outcomes is called a contingency table or cross classification
table.
Con…

• A contingency table when subjects are sampled and cross-classifies


according to their outcome () is displayed below.
=1 =2 ……….. =

X =1 …………

X=2 ……..

. . . . . .
. . . . . .
X= ……..

……….
Con…

• A contingency table that cross-classifies two variables is called a two-


way table.

• A contingency table that cross-classifies three variables is called a


three-way table
Probability Structure for Contingency Tables

• Probabilities for contingency tables can be three types(joint, marginal


and conditional).

Joint probabilities

• Let = () denote the probability that () falls in the cell in rowand


column

• Then the probability is called joint probability

• Where, = 1

• But we estimate sample proportion, =


Con…

Marginal probabilities

• The row and column totals obtained by summing the joint


probabilities are called the marginal probabilities/distributions.

• The marginal distribution of denoted by

• The marginal distribution of denoted by

=
Con…
• Where, = =
Conditional probabilities

• In most contingency tables, one variable is a response variable () and


the other is an explanatory variable().

• A distribution consists of conditional probabilities forgiven the levels


ofis called a conditional distribution.

• For a fixed category of , probability of distribution used to study how


this distribution changes as the category ofchanges.
Con…

• Given that a subject is classified in row of , denotes the probability of


classification in column of Y, = 1,2,3,….

• Thus, the conditional probabilities are given by

= and = 1
Con…

Example
Belief in Afterlife
Gender Yes No Total
Female = 509 = 116 = 625
Male = 398 = 502
Total = 907 = 220 n = 1127

Compute:
a. Joint probabilities
b. Marginal probabilities
c. Conditional probabilities
Con…
Soln.
b. The marginal probabilities of X are
a. = = 0.452
= + = 0.452 +0.103 = 0.555
= = 0.103
= + = 0.353+0.092 = 0.445
= = 0.353
The marginal probabilities of Y are
= = 0.092
= + = 0.452+0.353 = 0.805

= + = 0.103+0.092 = 0.195
Con…

c. The conditional probabilities are

= = = 0.814

= = = 0.793

= = = 0.186

= = = 0.207
Con…

Independence

• Two categorical variables are said to be statistically independent if the


conditional distribution of are identical at each level of .

• That is = for =1,2,…. or

= and = =

Example: Consider the above example and show that whether belief in
afterlife is independent of gender or not.
Con…

Soln.

≠ ≠ ⇨ 0.814 ≠ 0.793 ≠ 0.805 and

≠ ≠ ⇨ 0.186 ≠ 0.207≠ 0.195

This implies that belief in after life is not independent of gender that is
belief in after life is dependent of gender.
Comparing Proportions in two by two tables

• Suppose the two categories of Y are success and failure

• Let and be the probability of successes in row 1 and row 2 respectively.

• The difference in probabilities, - compares the successes probabilities in


the two groups.

• Let and be sample proportions of successes in row 1 and row 2


respectively.

• The sample proportion differences, - , estimates -


Con…
• The estimated standard error of - is
(-)=

Example: To find out whether regular intake of Aspirin reduces mortality


from cardiovascular disease one group people was given Aspirin and the
second group was given a placebo with the results given below.

Myocardial infraction

Group Yes No Total

Placebo 189 10845 11034

Aspirin 104 10933 11037


Con…

• Then construct a 95% CI for - and =


interpret it =
Soln.
=
= = = = 0.0171
= 0.00153
= = = 0.0094 (-) * (-)

⇨ - = 0.0171 – 0.0094 = 0.0077 = 0.0077 1.96*0.00153

(-) = = (0.00770.003)

= (0.0047, 0.0107)
Con…

Interpretation: since the difference contains only positive value, we can


conclude that - > 0, that is > . This again indicates that individuals who
take Aspirin appears to result in a diminished risk of heart attack.
Odds Ratio (OR)

• OR is the only one of the three association measures that is appropriate fore
cross-sectional, prospective, and retrospective study designs.

• OR is the ratio of two odds

• The odds of an event are the probability of the event occurring divided by
the probability that event does not occurring.

• Odds are non-negative values

• Let π is the probability of success, the odds of success is defined by

O=
Con…

• A success is more likely than a failure when odds are greater than one

• A success is less likely than a failure when odds are less than one

• For a 2 x 2 contingency table, let and are odds of success in row 1 and
row 2 respectively

• Then, the odds ratio denoted by θ is given by

= or =

• Because of this expression, θ is sometimes called the cross product ratio


Con…
• The odds ratio compares the chance of = 1 at two levels of

- Where chances are in terms of odds rather than probabilities/risks

• θ is not symmetric about 1

e.g. θ = 4 and θ = 0.25 represent equally strong associations between X and Y, but
in the opposite direction.

• θ is invariant to interchange of row variable and column variable

⇨ explanatory/response variable distinction does not affect θ

Example: Find and interpret OR from the aspirin example above


Con…
Soln. = = = 1.832 or

For row 1, we have = =


= = 0.0171 =
= = = = 0.0174 = 1.832
For row 2, we have
Interpretation: Subjects in the
= = 0.0094 placebo group are 83% more likely to
= = = = 0.0095 have MI than subjects in the aspirin
use.
Properties of Odds Ratio

• Odds ratio of one indicates independence

• Values for θ far from 1 indicates stronger level of association

• If the order of rows or columns is reversed (but not both), the new
value of θ is the inverse of the original value.

Example: From the aspirin use data, if we exchange the rows, the odds
ratio becomes

= = = 0.546
Inference for Odds Ratios and Log Odds Ratios
• For small samples, the distribution of the sample odds ratio is highly
skewed.

• X and Y are independent if log (θ) = 0

Log odds ratio

• The log odds ratio is symmetric about zero, in the sense that reversing
rows or reversing columns changes its sign

Example: Consider the Aspirin use data

= 1.832 and = 0.546 ⇨ log(1.832) = 0.605 and log(0.546) = -0.605


Con…

• The sample log odds ratio, log () has a less skewed distribution

- Hence it can be approximated by a normal distribution

• The asymptotic standard error of log () is given by

ASE (log () ) =

• A 100(1-α)% large sample CI for log (θ) is given by

log () ASE (log () )


Con…

• A 100(1-α)% large sample CI for θ is given by

exp{log () ASE (log () )}

Example: Again consider the aspirin use data, find a 95% CI for θ

Soln. =1.832 ⇨ log(() = log (1.832) = 0.605

ASE (log () ) =

=
Con…

= 0.123

• A 95% CI for log(θ) equals

0.605 1.96 (0.123) = (0.605 0.241)

= (0.365,0.846)

• The corresponding 95% CI for θ becomes

exp(0.365,0.846) = (, )

= (1.44, 2.33)
Relationship between odds ratio and relative risk (RR)

• A relationship between odds ratio and relative risk is useful

• Because for some data sets direct estimation of relative risk is not
possible

• Thus, one can estimate the odds ratio to approximate the relative risk

• In 2 x 2 contingency tables, the relative risk is the ratio of the success


probabilities of the two rows

• That is RR =
Con…

• The relationship between odds ratio and RR is given by

Odds ratio = Relative risk * ()

• When both and are zero, both the odds ratio and relative risk take
similar value.

• In this case we can use the odds ratio as an estimate of relative risk.
Chi-squared tests of independence

• This test is applied when we have two categorical variables from a


single population

• It is used to determine whether there is a significant association


between the two variables

• A test of independence tests the null hypothesis that there is no


association between the two variables in a contingency table where
the data is all drawn from one population.

• If we have to categorical variables, and


Con…

: and are independent versus : and are not independent.

When we use Chi-square test of independence ?

- Sample data are randomly selected/simple random sampling method is used


-The expected frequency count for each cell of the table is at least 5.

- Individual observations must be independent

- Only one population is required

-No distribution requirement.


Con…

• In a contingency table we have

- Cell counts,

- Expected cell frequencies, = n that is = E () =n

- Estimated expected frequencies,

- = n = n* * =

• The two statistics for test of independence are Pearson Chi-square


statistic and likelihood ratio statistic
Con…

Pearson Chi-Squared Statistic

• It was proposed in 1900 by Karl Pearson, the British Statistician

• This statistic takes the minimum value of zero when all =

• The Pearson Chi-squared statistic has a chi-squared distribution with


degrees of freedom, (-1)(-1).

• This statistic is given by

=
Con…

Likelihood-Ratio Statistic

• The likelihood ratio Chi-squared statistic is given by

=2

• also take a minimum value of zero when all =

• Both and have the same Chi-squared distribution

• Both and have the same numerical values

• We get the same conclusion in both cases


Con…
Example: Consider the cross-classification of party identification by gender
below.
Party identification Total

Gender Democrat Independent Republican

Females 762 327 468 1557

Males 484 239 477


1200
Total 1246 566 945 2757

Test whether party identification is independent of gender using and and


make a conclusion.
Con…

Soln.
= = = 533.68
We know that = , then
= = = 411.32
= = = 703.67
Then, the Pearson Chi-squared
= = = 319.65
statistic and likelihood ratio
= = = 542.33
statistics are computed as follows
= = = 246.35
Con…

= + + + + +

= + + + + +

= 4.835+0.169+8.083+6.274+0.219+10.488

= 30.068 ~ = 5.99

Since, 30.068 > 5.99, we reject the null hypothesis of independence


Con…
Similarly,
=2{+++++}

= 2{762+327+468+484+239

+477}

= 2(60.683+7.434-61.462-55.074-7.239+70.665)=2(15.007)= 30.014

Since 30.014 > 5.99, we reject the null hypothesis of independence and we
can conclude that party identification is dependent of gender.
Exact inference for small samples

• Both Pearson chi-squared and likelihood ratio statistics are used well
when the contingency tables have large observations (large sample size).

• When the sample is small, the distribution of and are not well
approximated by Chi-squared distribution.

• In such situations we can perform inference using exact distribution/tests

• The most common test is Fisher’s exact test

• But the p-values based on the exact tests are conservative, that is larger
than they are really.
Associations in three way tables
Partial tables

• Two-way cross-sectional slices of the three-way table cross classify X and Y at


separate levels of Z.

• These cross sections are called partial tables.

• They display the XY relationship at fixed levels of Z, hence showing the effect of
X on Y while controlling for Z.

• The partial tables remove the effect of Z by holding its value constant

• The associations in partial tables are called conditional associations, because


they refer to the effect of X on Y conditional on fixing Z at some level.
Con…

Marginal tables

• The two-way contingency table that results from combining the


partial tables is called the XY marginal table

• The marginal table contains no information about Z, so rather than


controlling Z, it ignores Z.

• Conditional associations in partial tables can be quite different from


associations in marginal tables.
Con…
Conditional Versus Marginal Associations

• Typically, the odds ratio are used describe the marginal and conditional
associations in a 3-way table

Conditional odds ratio

• For a 2 2 table, where k denotes the number of levels of a categorical


confounding variable

• Then, we describe the conditional odds ratio for the partial table as

= , denotes the expected cell frequency in cell ()


Con…

• above describes conditional X-Y association in partial table k (at the )


level of variable .

Marginal odds ratio

• The XY marginal odds ratio is defined by

• The expected frequencies in the XY marginal table are given by

=
Con…

Example: Death penalty verdict by defendants race and victims race

• The variables are Y = death penalty verdict (yes, no), X = race of


defendant (white, black), Z = race of victims (white, black).

• Here, we want to study the effect of defendants race on the death


penalty verdict, treating victims race as a control variable.

• The table below shows a 2 2 partial table relating defendants race and
death penalty verdict at each level of victims race.
Con…

Death penalty
Victims race Defendants race Yes No Percentage
White White 53 414 11.3
Black 11 37 22.9
Black White 0 16 0.0
Black 4 139 2.8
Total White 53 430 11.0
White 15 176 7.9
Con…

a. Construct the partial tables needed to study the conditional


association b/n defendants race and death penalty verdict.

b. Find and interpret the sample conditional odds ratio, adding 0.5 to
each cell to reduce the impact of 0 cell counts.

c. Compute and interpret the sample marginal odds ratio b/n


defendants race and the death penalty verdict.
Con…
Soln. a) When victims race is white, we have
Death penalty verdict
Defendants race Yes No
White 53 430
Black 15 176

When victims race is black, we have


Death penalty verdict
Defendants race Yes No

White 0 16
Black 4 139
Con…

b). When victims race is white

= = = 0.4306

Interpretation: Since the odds ratio for white victims is less 1, we can say that
a white defendant is less likely to get a death penalty than black a defendant
when the victims race is white controlling victims race. Or

The sample odds of receiving death penalty verdict for white defendants were
43% of the sample odds getting death penalty for black defendants
controlling victims race as constant.
Con…
b). When victims race is black

= = = 0.94

• Interpretation: Controlling for victims race, the odds of getting death


penalty verdict were lower for white defendants than black
defendants.

• That is white defendants are 6% less likely to receive death penalty


than black defendants keeping victims race as constant.
Con…

c). To get the sample marginal odds ratio , we should add the cell
counts based on the similar categories of victims race and we get the
following data.
Death penalty verdict
Defendants race Yes No
White 53 430
Black 15 176

Then, the sample marginal odds ratio becomes

= = = = 1.45
Con…

Interpretation: Ignoring victims race, the sample odds of death penalty


is 45% higher for white defendants than that of black defendants.

Or white defendants are 45% more likely to receive death penalty than
black defendants, controlling the victims race.
Con…
Conditional versus marginal independence

• In a partial table, X and Y are conditionally independent if and only if


the conditional odds ratio of X and Y is one at each level of Z.

• i.e. = = ……….= = 1

• Conditional independence of X and Y, given Z does not imply marginal


independence of X and Y.

• X and Y are said to be marginal independence if and only if the


conditional odds ratio, , is one.
Con…

Example: Consider the death penalty data given above, show that whether

death penalty and defendants race are

a. Conditionally independent

b. Marginally independent

Soln. a) previously we found that the conditional odds ratio are

= 0.43 and =0.94

Since, ≠ ≠ 1, the defendants race (X) and death penalty (Y) are not conditional

independent at each level of victims race (Z).


Con…
Homogeneous association

• There is a homogenous X-Y association in a 2 2 table if and only if the


conditional odds ratios between X and Y are identical at each level of

• That is = = ……….=

• Conditional independence of X and Y is a special case of homogenous


association where each conditional odds ratio equals one.
Con…
Chi-square test of homogeneity

• This test determines if two or more populations(or subgroups of a


population) have the same distribution of a single categorical variable.

• The test of homogeneity expands the test for a difference in two


population proportions.

• We use the test of homogeneity if the response variable has two or


more categories and we wish to compare two or more populations (or
subgroups.)
Con…

Note: Homogeneous means the same in structure or composition.

- This test gets its name from the null hypothesis, where we claim that the
distribution of the responses are the same (homogeneous) across groups.

• The null hypothesis for test of homogeneity can be.

- Different populations have the same proportions of some characteristics


(), or

- The distribution of the categorical variable is the same for all populations
(or subgroups), or
Con…

- The proportion with a given response is the same in all of the


populations, and this is true for all response categories

• The alternative hypothesis can be

: At least one proportion of the response variable is not the same across
the groups /the distributions of the response variable is differ across the
sub-groups.
Con…

Requirements for test of homogeneity

-Multiple populations that the data is drawn from.

- One categorical response variable

-Sample data are randomly selected

-The expected frequency count for each cell of the table is at least 5.

- Individual observations must be independent

-No distribution requirement.


Example: The National Collegiate Athletic Association (NCAA) published a report
called “Steroid Use” of College Student-Athletes. NCAA is divided into three
divisions.

Division I schools: large universities with large athletic budgets(revenue from the
games); they must offer athletic scholarships.

Division II schools: Smaller public universities and many private institutions; have
much smaller budgets(solely from the college).

Division III schools: Colleges and universities that treat athletics as an


extracurricular activity for students, instead of a source of revenue; these
institutions do not offer athletic scholarships.
Con…
• The data is given below
Steroid use
Yes No Totals
Division I 103 8440 8543
Division II 52 4289 4341
Division III 65 6428 6493
Totals 220 19157 19377

Q. Does steroid use by student athletes differ for the three NCAA divisions?
Con…
Soln.

- The categorical response variable is steroid use (yes or no).

- The null and alternative hypotheses are

- The populations are the three NCAA divisions.

: The proportion of athletes using steroids is the same in each of the three
NCAA divisions.

: The proportion of athletes using steroids is not same in each of the three
NCAA divisions
Con…

- For the response “yes”, the proportion of steroid use in combined samples is

= 220/19377 = 0.01135

- Expected count of steroid users for Division I is

= 0.01135*8543 = 96.96

- Expected count of steroid users for Division II is

= 0.01135*4341 = 49.27

- Expected count of steroid users for Division III is

= 0.01135*6493 = 73.70
Con…

- For the response “No”, the proportion of non-steroid use in combined samples is

= 19157/19377 = 0.98865

- Expected count of non-steroid users for Division I is

= 0.98865*8543 = 8446.04

- Expected count of non-steroid users for Division II is

= 0.98865*4341 = 4291.73

- Expected count of steroid users for Division III is

= 0.98865*6493 = 6419.30
-We calculate the chi-square test statistic similarly as in test of independence

⇨ = + + + + +

= + + + + +

= 0.3763+0.1513+1.0270+0.0043+0.0017+0.0118 = 1.5724
Con…
- For chi-square tests based on two-way tables (both the test of independence
and the test of homogeneity), the degrees of freedom are ()()
- Here, . = (3-1)*(2-1) = 2

- The tabulated value from chi-square distribution is 5.991

Decision: Since the calculated value (1.5724) is less than the tabulated value,
we fail to reject

Conclusion: The data does not provide strong enough evidence to conclude
that steroid use differs in the three NCAA divisions
Chapter 3: Logistic regression model
Assumptions of logistic regression model

• Logistic regression does not make many of the key assumptions of linear
regression and generalized linear models that are based on ordinary least
squares algorithms.

• Particularly

-Linearity: linear relationship b/n the dependent and independent variables

-Normality: the error terms (residuals) are normally distributed

-Homoscedasticity: constant variance of residuals

-Measurement level: dependent variable measured on interval or ratio scale.


Con…

How ever logistic regression model still apply some other assumptions

• It assumes that the independent variables are linearly related to the log of
odds.

• The dependent variable should have mutually exclusive and exhaustive


categories.

• Binary logistic regression requires the dependent variable to be dichotomy.

• Ordered logistic regression requires the dependent variable to be ordered.

• It also assumes that there is no multicollinearity among the independent variables .


Con…
• Logistic regression typically requires a large sample size than linear
regression.

⇨ because maximum likelihood estimates are less powerful than ordinary


least squares.

- OLS needs 5 cases per independent variable in the analysis

- ML needs at least 10 cases per independent variable

- Some statisticians recommend at least 30 cases for each parameter to be


estimated.
Con…
• Logistic regression model (LRM) is one type of the statistical models.

• LRM is used when the dependent variable is categorical and the independent
variables are of any type.

• LRM is a generalized linear model (GLM)

• Components of GLMs

- Random component: refers to the outcome

-Systematic component: design matrix multiplied by the parameter vector.


-Link function: it links the systematic component to the random component.
Con…

• Generalized linear models (GLMs) extend the linear modeling


framework to variables that are not normally distributed.

• LRM is often called logit model as the link in this GLM is the logit link.

• LRM calculates changes in the log odds of the dependent variable, not
changes in the dependent variable itself as OLS regression does

• Binary logistic regression model (BLRM) is used when the dependent


variable is dichotomy (binary) and the independent variables are of
any type.
con…

• Simple BLRM is used with one dichotomous dependent variable and


one independent variable

• LRM applies maximum likelihood estimation method after


transforming the dependent variable into a logit variable(the natural
log of the odds of the dependent variable occurring or not)

• Let us start with simple binary logistic regression model with one
independent variable,

• Let () the probability of success () as a function of


Con…
• Then the simple binary logistic regression model using logit link function is
given by

logit ( () ) = log() = α + β

- Where, α is an intercept and β a slope

• From the above model, the probability of success, (), can be given by

() = (show it)

• If β > 0, the function increases from 0 to 1 and it decreases if β < 0.

• If β = 0, the function is constant for all values and is unrelated to


Con…
Con…
• Now we see how the odds of success changes when we increase by 1 unit.

Soln.

⇨ +1

For , logit ( () ) = log() = α + β ⇨ () =

For , logit ( () ) = log() = α + β +β

⇨ () =
Con…

θ= =

= =

= =

=
Con…

⇨ when we increase by 1 unit, the odds of an evening


occurring increases by a factor of , regardless of the value of .
• Similarly multiple logistic regression model is given by

logit ( () ) = log() = α + + +………+ = β


• The probability of success is also given by

=
Con…

Logistic regression model with categorical predictors

• Some or all of the predictors can be categorical, rather than quantitative.

• Categorical predictors also called factors

• Suppose a binary response has two binary predictors, and .

• Let x and z each take values 0 and 1 to represent the two categories of each explanatory
variable

• The model for or becomes

logit() = α + +

• The variables andin the model are called indicator/dummy variables.


Con…
• Logits implied by indicator variables in model are given in table below
logit[()] = α + +
x z logit
0 0 α
1 0 α+
0 1 α+
1 1 α+ +

• This model assumes an absence of interaction effect .

• The effect of one factor is the same at each category of the other factor.

• At a fixed category z of Z, the effect on the logit of changing from x = 0 to x = 1 is


Con…

• At = 0

= 0, logit[()] = log() = α ⇨ = ………….(1)

=1, logit[()] = log() = α + ⇨ = ………(2)

(2)-(1) we get

log() = α + -α =

⇨θ=
Con…
• At = 1

= 0, logit[()] = log() = α + ⇨ = ………….(3)

=1, logit[()] = log() = α + + ⇨ = ……(4)

(4)-(3) we get

log() = α + +-(α + ) =

⇨θ=
Con…

• This difference between two logits equals the difference of log odds.
Equivalently, that difference equals the log of the odds ratio between
X and Y , at that category of Z.

• Controlling for Z, the odds of “success” at x = 1 equal times the odds


of success at x = 0.

• This conditional odds ratio is the same at each category of Z


Con…
• R syntax for binary logistic regression model
• Download the package
> install.packages("stats")
• This package use function to fit logistic regression model.
• Let is a binary response with three explanatory variables , and
> library(stats)
>model=glm(Y~x1+x1+x3,family="binomial")
> summary(model)
Con…
• We can use the function to obtain CIs for the coefficient estimates.

> confint(model) to confidence intervals

• We can also extract the odds ratio if needed

>exp(coef(model)) to extract odds ratio only

>exp(cbind(OR=coef(model),confint(model)) odds ratio with 95% CI

• We can use function to test for an overall effect of categorical variable

> install.packages("aod")

> library(aod)
Con…
Example:
>model1=glm(Contraceptive_use~Mother_age+Family_size+Availability_of_toilet+Birth
_type+Mother_education_level,family="binomial")

• Here we can test the overall effect of

>wald.test(b=coef(model),Sigma=vcov(model),Terms=6:7)

- b supplies the coefficients

- Sigma supplies the variance covariance matrix for error terms

- Terms tells R which terms in the model to be tested

- In this case terms 6 and 7 are the three terms for the levels of education.
Con…

• We can predict the probabilities in R using function

>predict(model1,type=“response”)

Example:

-Dependent variable: Contraceptive use (no, yes)

- Mother_age: continuous

- Family_size: discrete

- Availability_of_toilet (no, yes)

- Birth_type (single birth, multiple birth)

- Mother_education_level (no education, primary, secondary and above)


Con…

>library(stats)

>model1=glm(Contraceptive_use~Mother_age+Family_size+Availability
_of_toilet+Birth_type+Mother_education_level,family="binomial")

>summary(model1)
Con…

Coefficients: Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.278169 0.104990 -2.649 0.00806 **

Mother_age -0.031781 0.002951 -10.768 < 2e-16 ***

Family_size -0.062374 0.010539 -5.919 3.25e-09

***

Availability_of_toilet Yes 1.038907 0.050464 20.587 < 2e-16 ***

Birth_type Multiple -0.098806 0.270028 -0.366 0.71443

Mother_education_level Primary -0.014731 0.056558 -0.260 0.79452

Mother_education_level Secondary and above 0.138277 0.081938 1.688 0.09149 .


Con….
Interpretation of model confidents

• = -0.031781, this implies that the odds of using contraceptive method is


decreased by a factor 0.9687 as a mother’s age decreased by one year.

• = -0.062374, this implies that as a family size increases by one unit, the
odds of using contraceptive method is decreased by a factor of 0.9395.

• = 1.038907, it implies that mothers from a household having toilet facility


are 2.826 times more likely to use contraceptive method than mothers
from a household of having no any toilet facility.
Chapter 4: Building and Applying Logistic Regression Models

• Having learned the basics of logistic regression, we now study issues


relating to building a model with multiple predictors and checking its fit

• Mainly, we consider

- Strategies for model selection

- Variable selection

- Goodness-of-fit tests

- Model diagnostics
Strategies for model selection

i. Information criteria

• Information criterion tests are comparative in nature, with lower value


indicating a better fitted model.

• Two primary groups of information criterion tests are:

- Akaike information criteria (AIC)

= -2 + 2

- Bayesian information criteria (BIC).

= -2 L +
Con…

• Where, is log-likelihood function, is number of parameters, and is number


of observations

• But the most criteria used is AIC because BIC depends of prior information of
the parameters.
Example: Compare the following models based of AIC

>model1=glm(Contraceptive_use~Mother_age+Family_size+Availability_of_toi
let+Birth_type+Mother_education_level,family="binomial")
Con…

>model2=glm(Contraceptive_use~Mother_age+Family_size+Availability_of_toil
et,family="binomial")

> AIC(model1)

[1] 11620.27

> AIC(model2)

[1] 11617.51

Since model2 has smaller AIC value as compared to model1, it is relatively a


better model than model1.
Con…

ii. Likelihood ratio test (LRT)


• When one model is a special case of another, we can test the null hypothesis that
the simpler model is adequate against the alternative hypothesis that the more
complex model fits better.
• According to the alternative, at least one of the extra parameters in the more
complex model is nonzero
• Generally, the LRT is used to compare nested models

Example: Consider the following output and compare the two models using the
LRT
Con…

>library(lmtest)

>lrtest(model2,model1)

Model 1: Contraceptive_use ~ Mother_age + Family_size + Availability_of_toilet

Model 2: Contraceptive_use ~ Mother_age + Family_size + Availability_of_toilet +

Birth_type + Mother_education_level

#Df LogLik Df Chisq Pr(>Chisq)

1 4 -5804.8

2 7 -5803.1 3 3.2443 0.3555


Con…

• From the above output we observed that the -value is not significant,
and we fail to reject and conclude that model2 is relatively better
than model1.
Variable selection

• As the number of model predictors increases, it becomes more likely that some
ML model parameter estimates are infinite.

• Models with several predictors often suffer from multicollinearity – correlations


among predictors making it seem that no one variable is important when all the
others are in the model.

• A variable may seem to have little effect because it overlaps considerably with
other predictors in the model, itself being predicted well by the other predictors.

• Deleting such a redundant predictor can be helpful, for instance to reduce


standard errors of other estimated effects.
Con…

• The most and common method for variable selection is stepwise variable
selection algorithms.

• Such algorithms can select or delete predictors from a model in a stepwise


manner.

• To add or delete an explanatory variable into/from a model, we use different


criteria such as -value, AIC value and simple correlation.

• Now, here we use AIC value as a criteria to select the best subset of predictor
variables
Con…
Forward selection method

• This algorithm starts with the null/empty model and we add terms/variables
sequentially until further additions don’t improve the fit.

• That is variables with smallest AIC /-value are added to the model
sequentially.

Backward elimination

• It begins with a complex model and sequentially removes terms

• At a given stage, it eliminates the term in the model that has the largest p-
value value/small AIC value
Con…

• The process stops when any further deletion leads to a significantly poorer fit.

Example: Consider the following variables and fit a model using variable selection
methods.

- Contraceptive_use (dependent variable)

- Mother_age

-Family_size

-Availability_of_toilet

- Birth_type

-Mother_education_level
Con…

Soln. A). Forward selection method

>library(stats)

> fit.start=glm(Contraceptive_use~1,family="binomial")# to fit null model

>fit.all=glm(Contraceptive_use~Mother_age+Family_size+Availability_of_toil
et+Birth_type+Mother_education_level,family="binomial") # to fit full model

> step(fit.start,direction="forward",scope=formula(fit.all))# to apply forward


method
Con…
Start: AIC=12250.93
Contraceptive_use ~ 1

Df Deviance AIC
+ Availability_of_toilet 1 11798 11802
+ Mother_age 1 12137 12141
+ Family_size 1 12149 12153
<none> 12249 12251
+ Birth_type 1 12249 12253
+ Mother_education_level 2 12247 12253
Con…
Step: AIC=11801.73
Contraceptive_use ~ Availability_of_toilet

Df Deviance AIC
+ Mother_age 1 11645 11651
+ Family_size 1 11729 11735
<none> 11798 11802
+ Mother_education_level 2 11795 11803
+ Birth_type 1 11797 11803
Con…

Step: AIC=11651.2
Contraceptive_use ~ Availability_of_toilet + Mother_age

Df Deviance AIC
+ Family_size 1 11610 11618
<none> 11645 11651
+ Mother_education_level 2 11642 11652
+ Birth_type 1 11645 11653
Con…
Step: AIC=11617.51
Contraceptive_use ~ Availability_of_toilet + Mother_age + Family_size

Df Deviance AIC
<none> 11610 11618
+ Mother_education_level 2 11606 11618
+ Birth_type 1 11609 11619

Call: glm(formula = Contraceptive_use ~ Availability_of_toilet +


Mother_age +
Family_size, family = "binomial")
Con…

Coefficients:

(Intercept) Availability_of_toilet Yes (if any method) Mother_age Family_size

-0.26979 1.03718 -0.03177 -0.06230

Degrees of Freedom: 10263 Total (i.e. Null); 10260 Residual

Null Deviance: 12250

Residual Deviance: 11610 AIC: 11620


Con…
• Then, the best subset of variables are

- Availability_of_toilet

-Mother_age

-Family_size

• Then, the final model becomes

logit( = -0.26979 + 1.03718- 0.03177 - 0.06230

• Where, the word “Yes” is a dummy variable


Con…

B). Backward elimination

>fit.all=glm(Contraceptive_use~Mother_age+Family_size+Availability_of_toil
et+Birth_type+Mother_education_level,family="binomial")# to fit the full
model

> step(fit.all,direction="backward") # to apply backward elimination method


Con…
Start: AIC=11620.27
Contraceptive_use ~ Mother_age + Family_size + Availability_of_toilet +
Birth_type + Mother_education_level

Df Deviance AIC
- Birth_type 1 11606 11618
- Mother_education_level 2 11609 11619
<none> 11606 11620
- Family_size 1 11642 11654
- Mother_age 1 11725 11737
- Availability_of_toilet 1 12069 12081
Con…
Step: AIC=11618.41
Contraceptive_use ~ Mother_age + Family_size + Availability_of_toilet +
Mother_education_level

Df Deviance AIC
- Mother_education_level 2 11610 11618
<none> 11606 11618
- Family_size 1 11642 11652
- Mother_age 1 11726 11736
- Availability_of_toilet 1 12069 12079
Con…
Step: AIC=11617.51
Contraceptive_use ~ Mother_age + Family_size + Availability_of_toilet
Df Deviance AIC
<none> 11610 11618
- Family_size 1 11645 11651
- Mother_age 1 11729 11735
- Availability_of_toilet 1 12071 12077

• Here, we get the same subset of variables as in forward selection


method. Hence, the model is also similar.
Con…
Call: glm(formula = Contraceptive_use ~ Mother_age + Family_size +
Availability_of_toilet, family = "binomial")

Coefficients:
(Intercept) Mother_age Family_size Availability_of_toilet
-0.26979 -0.03177 -0.06230 1.03718

Degrees of Freedom: 10263 Total (i.e. Null); 10260 Residual


Null Deviance: 12250
Residual Deviance: 11610 AIC: 11620
Con…
Goodness-of-fit tests

• Once the model is fitted the next important step is to check whether
the probabilities produced by the model accurately reflect the true
outcome experience in the data.

• The most commonly used measures of goodness of fit for logistic


regression model are

- Hosmer-Lemeshow test

- Classification tables

-
Con…

i. Hosmer-Lemeshow test

• The Hosmer-Lemeshow test is the commonly used test of the overall


fit of the fitted logistic regression model

• Hosmer-Lemeshow test divides subjects into deciles based on


predicted probabilities and constructs a goodness of fit statistic by
comparing the observed and predicted number of events in each
group
Con…

• Where, = , = , = ,

= )

• Under the null hypothesis of the model is adequate to fit the data, the
distribution of approximated by the Chi-squared distribution with (g-2)
degrees of freedom
Con…
ii. Classification table

• It is an appealing way to summarize the results of a fitted logistic regression model.

• This table is the result of cross-classifying the outcome variable, , with a dichotomous

variable whose values are derived from the estimated logistic regression model

probabilities.

• In this approach, estimated probabilities are used to predicted group membership.

• If the model predicts group membership accurately according to some criterion, then

this is thought to provide evidence that the model fits the data well
Con…
iii. Receiver Operatic Characteristic Curve (ROC)

• A better and more complete description of classification accuracy is the


area under the ROC curve.

• ROC curve is the plot of sensitivity versus (1-specificity) for an entire range
of possible cut points.

• The area under ROC curve, which ranges from 0.5 to 1 provides a measure
of the model’s ability to discriminate between those subjects who
experience the outcome of interest versus those who do not.
Con…

• According to the rule of thumb the area under the ROC curve is interpreted by
general guidelines

ROC = 0.5, this suggests that no discrimination

0.5 < ROC < 0.7, this suggests that poor discrimination

If 0.7 ≤ ROC < 0.8, this suggests that acceptable discrimination

0.8 ≤ ROC < 0.9, this suggests that excellent discrimination

ROC ≥ 0.9, this suggests that outstanding discrimination


Chapter 5: Multi-category logit models

• Multi-category logit models are a generalization of binary logit models

• They are used to model categorical responses (nominal and ordinal) with
more than two categories

• As in ordinary logistic regression, explanatory variables can be categorical


and/or quantitative

• The multi-category models assume that the counts in the categories of Y


have a multinomial distribution
Con…

Logit models for nominal responses

• Let J denote the number of categories for Y

• Let ……….. denote the response probabilities, satisfying = 1

• With n independent observations, the probability distribution for the


number of outcomes of the J types is the multinomial

• Multi-category logit models simultaneously use all pairs of categories by


specifying the odds of outcome in one category instead of another
Con…
• The order of listing the categories is irrelevant, because the model treats the
response scale as nominal (unordered categories).

Base-line category logits

• Logit models for nominal response variables pair each category with a baseline
category.

• When the last category (J ) is the baseline, the baseline-category logits are

logit() = log() , where, = 1,2,…..

• Given that the response falls in category j or category J , this is the log odds that
the response is j.
Con…

• For J = 3, for instance, the model uses log(/) and log(/).

• The baseline-category logit model with a predictor is

logit() = log() = +

- This model has J-1 equations with separate parameters for each.

- That is

logit() = log() = + ⇨ the log odds that the outcome is 1 rather than 3

logit() = log() = + ⇨ the log odds that the outcome is 2 rather than 3
Con…

• For an arbitrary pair of categories a and b, the log odds that the response is
in category a rather than in category b is given by

log() = log () = log()-log()

= ( +)-(-)

= (-) +(-)

-So, this equation has the form of +β with intercept parameter = (-) and
slope parameter β = (-)
Con…
Example: Alligators food choice

The primary food type found in the alligators stomach has three
categories (Fish, Invertebrate, other) and here we assume that the
length of alligators in meter () as a predictor variable. Then after some
analysis using the baseline category “other” as a reference group, we
get the following outputs.

log(/) = 1.618-0.11 (1)

log(/) = 5.697-2.465 (2)

Then find the following questions


Con…

a). The estimated odds ratio for the above two equations and interpret them

b). The estimated log odds ratio that the response is fish rather than
“invertebrate” and interpret it.

Soln.

a). For equation (1), the estimated odds ratio is

= = = 0.896

For equation (2), the estimated odds ratio is

= = = 0.085
Con…
Interpretation:

• For equation (1), for every one unit increase in alligators length, the estimated
odds that the primary food type is “Fish "rather than “other” is decreased by a
factor of 0.896

• For equation (2), for every one unit increase in alligators length, the estimated
odds that the primary food type is “invertebrate” rather than “other” is
decreased by a factor of 0.085

b). The estimated log odds that the primary food type is “Fish” rather than
“invertebrate” equals
Con…

log(/) = (1.618-5.697)+[-0.11-(-2.465)]

= -4.08+2.355

Interpretation:

For every one unit increase in alligators length, the estimated odds that
the primary food type is “Fish” rather than “invertebrate” is increased
by a factor of = 10.5
Con…
Estimated response probabilities

• The multi-category logit model has an alternative expression in terms


of the response probabilities.

• This is

= , where, j = 1,2,3,…….,J

NB: the denominator is the same for each category


Con…

Example: The estimates from the above example of “alligators food choice”
contrast “Fish” and “Invertebrate” to “Other” as the baseline category. The
estimated probabilities of the outcomes (Fish, Invertebrate, Other) for
alligators length of 0.015m length are

= = =

= 0.017
Con…
=
= =
=
=
=
= 0.98
= 0.0033
Con…
NB: the term 1 in each denominator and in the numerator of represent
for = = 0 with the baseline category.

You might also like