SAHADEB - Categorical - Data - Lecture3
SAHADEB - Categorical - Data - Lecture3
APDS
IIM Calcutta
1
Log-Linear Models for Contingency Tables
(Read: Agresti, pp. 314-326)
• We will develop models where the log of the cell
counts can be additively expressed as a function of
several parameters.
• We have already looked at Poisson Log-Linear
models.
• In the cross classified tables, the loglinear model
does not differentiate between the response and
explanatory variables. It treats both jointly as
responses, modeling log(μij) for combinations of the
levels (i,j).
2
General Form of the r × c Table
(Observations)
Column Level Column Level Column
1 2 Level c
Row Level
1
n11 n12 n 1c
R1
Row Level
2
n21 n22
n 2c R2
Row Level
r
nr 1 nr 2 n rc Rr
C1 C2 Cc T=n
Row Level
1
11 12 1c
π1+
Row Level
2
21 22
2c π2+
Row Level
r
r1 r2 rc πr+
π+1 π+2 π+r 1
4
Linking Poisson with Multinomial Dist,
(pp. 202-203, Text)
Result: Let Y1, …,Yk be independent Poi(i) r.v.’s. Then
given (Y1+ …+Yk)=n, the conditional joint distribution of
(Y1, …,Yk) is multinomial distribution with parameters:
𝑖
n trials, and 𝜋𝑖 = 𝑘 , i=1,…,k
𝑗=1 𝑗
5
Linking Poisson with Multinomial Dist,
(pp. 202-203, Text)
𝑖 exp(𝑖 )
Log (i)=+i, I=0; Let 𝜋𝑖 = 𝐼 = 𝐼
𝑘=1 𝑘 𝑘=1 exp(𝑘 )
If our interest lies in estimating i’s (or i’s), maximum
likelihood based on Poisson approach produces the same
inference as that on the multinomial approach. Poisson
approach has an extra parameter to estimate = 𝐼𝑘=1 𝑘 =
𝑘=1 exp( + 𝑘 ), MLE of which is n.
𝐼
𝑖 𝑖
Exp(i) = log( ) = log( ) = log of odds of i-th category
𝐼 𝐼
happening relative to category I.
7
Log-Linear Models:
Independence Model for 2-Way Table
𝜋𝑖𝑗 = 𝜋𝑖+ 𝜋+𝑗
8
Log-Linear Models: Independence Model
log ij i j
X Y
r 0 and c 0
X Y
9
Interpretation of Parameters:
Log-Linear Independence Model for r×2 tables
Let X and Y represent the explanatory and response
variable respectively. Suppose variable Y has 2 levels.
P (Y 1 | X i )
logit P (Y 1 | X i ) log
P (Y 2 | X i )
i1
log log i1 log i 2
i 2
( i X 1Y ) ( i X 2Y )
1Y 2Y independent of i
10
Interpretation of Parameters:
Log-Linear Models for r×2 tables
12
Saturated Log-Linear Models
Statistically dependent variables satisfy a more complex
loglinear model.
13
Saturated Model: 2 × 2 Table
14
General Form of the r × c Table
Column
Level 1
Column
Level 2 Column
Level c
Total
Row Level 1
11 12 1c
R1
Row Level 2 21 22
2c
R2
Row Level r
r1 r 2 rc
Rr
Total C1 C2 Cc T (=n)
15
General Log-Linear Models, r×c Tables
The number of parameters in the model
16
General Log-Linear Models, r×c Tables
18
Independence Model: 2 × 2 Contingency
Table
log ij i j
X Y
20
Hierarchical Models
Consider the model:
log ij i j ij
X Y XY
or
(ii) we can set:
∑λijXY = 0
when summed over either i or j.
22
Alternative Parameter Constraints
(p.317, Agresti, 2nd Ed)
which determine the odds ratios (or log odds ratios) are
unique. For instance, suppose that the log odds ratio
equals 2 in a 2 x 2 table.
24
Log-Linear Models: 3-Way I×J×K Tables
Change of notation to I, J and K.
log( ijk )
I J K
where ijk 1 .
i 1 j 1 k 1
25
3-Way Tables (Agresti, 2nd ed, p.318, Ch 8)
Source: Wright State University, Dayton, Ohio, USA
Subjects: Students in the Final Year of High School
[Agresti]
26
Mutual Independence
X, Y, Z are said to be mutually independent if
for all i, j and k,
πijk = πi++ π+j+ π++k
𝑖+𝑘 +𝑗𝑘
𝑖𝑗𝑘 = , for all i, j, k
++𝑘
For the expected frequencies this means
log(μijk)=λ+ λiX+ λjY + λkZ + λikXZ + λjkYZ
This is weaker than joint independence.
29
Conditionally Independent Models
0. log ij i X j Y k Z ij XY jk YZ ik XZ (Homogeneous asso.)
1. log ij i X j Y k Z ij XY jk YZ (Cond. Indep of Z with X)
2. log ij i X j Y k Z ij XY (Joint Indep of Z with (X & Y) )
3. log ij i X j Y k Z (Mutual Indep)
32
Cochran-Mantel-Haenszel Test for no row by
column association in any of the 22 Tables
(conditional Independence) (pp. 94-101)
33
List of 3-Way Models
(*). log ij i X j Y k Z ij XY ik XZ jk YZ ijk XYZ ; ( XYZ )
0. log ij i X j Y k Z ij XY ik XZ jk YZ ; ( XY , XZ , YZ )
1. log ij i X j Y k Z ij XY jk YZ ; ( XY , YZ )
log ij i X j Y k Z ij XY ik XZ ; ( XY , XZ )
log ij i X j Y k Z ik XZ jk YZ ; ( XZ , YZ )
2. log ij i X j Y k Z ij XY ; ( XY , Z )
log ij i X j Y k Z ik XZ ; ( XZ , Y )
log ij i X j Y k Z jk YZ ; (YZ , X )
3. log ij i X j Y k Z ; ( X ,Y , Z )
34
Probabilistic Forms of Conditionally
Independent Models
35
Calculation of Fitted Values
𝜋𝑖+𝑘 𝜋+𝑗𝑘
Cond. Independence 𝑋𝑍, 𝑌𝑍 : 𝜋𝑖𝑗𝑘
𝜋++𝑘
36
Calculation of Fitted Values
𝑀𝑢𝑡𝑢𝑎𝑙 𝐼𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑐𝑒 𝑋, 𝑌, 𝑍 : 𝜋𝑖𝑗𝑘 = 𝜋𝑖++ 𝜋+𝑗+ 𝜋++𝑘
𝐽𝑜𝑖𝑛𝑡 𝐼𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑐𝑒 (𝑋𝑍): 𝜋𝑖𝑗𝑘 = 𝜋𝑖+𝑘 𝜋+𝑗+
𝜋𝑖+𝑘 𝜋+𝑗𝑘
𝐶𝑜𝑛𝑑. 𝐼𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑐𝑒 𝑋𝑍, 𝑌𝑍 : 𝜋𝑖𝑗𝑘
𝜋++𝑘
Homogeneous Association: 𝜋𝑖𝑗𝑘 = 𝜓𝑖𝑗 𝜙𝑗𝑘 𝜔𝑖𝑘
𝑛𝑖+𝑘 𝑛+𝑗𝑘
Cond. Independence XZ, YZ : 𝜇𝑖𝑗𝑘
𝑛++𝑘
Homogeneous Association: Iterative Methods 37
(XY, XZ, YZ) Model: Interpreting Model Parameters
(Agresti, 2nd ed, p. 321)
𝜋𝑖𝑗𝑘 𝜋𝑖+1,𝑗+1,𝑘
𝜃𝑖𝑗(𝑘) = , 1 i I-1, 1 j J-1
𝜋𝑖,𝑗+1,𝑘 𝜋𝑖+1,𝑗,𝑘
39
Intentionally Kept Blank
40
Alcohol, Cigarette, Marijuana (ACM) use
(Agresiti, p. 323)
41
Calculation of Fitted Values
𝑀𝑢𝑡𝑢𝑎𝑙 𝐼𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑐𝑒 𝑋, 𝑌, 𝑍 : 𝜋𝑖𝑗𝑘 = 𝜋𝑖++ 𝜋+𝑗+ 𝜋++𝑘
𝐽𝑜𝑖𝑛𝑡 𝐼𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑐𝑒 (𝑋𝑍): 𝜋𝑖𝑗𝑘 = 𝜋𝑖+𝑘 𝜋+𝑗+
𝜋𝑖+𝑘 𝜋+𝑗𝑘
𝐶𝑜𝑛𝑑. 𝐼𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑐𝑒 𝑋𝑍, 𝑌𝑍 : 𝜋𝑖𝑗𝑘
𝜋++𝑘
Homogeneous Association: 𝜋𝑖𝑗𝑘 = 𝜓𝑖𝑗 𝜙𝑗𝑘 𝜔𝑖𝑘
𝑛𝑖+𝑘 𝑛+𝑗𝑘
Cond. Independence XZ, YZ : 𝜇𝑖𝑗𝑘
𝑛++𝑘
Homogeneous Association: Iterative Methods 42
Model Fits (Expected Cell Count)for ACM Data
(Agresti, p. 323)
43
ACM Data: Estimated Odds Ratios
Measuring Association (Agresti, 2nd ed, p. 323)
45
Best Model for A-C-M Data
46
Log-Likelihood-Ratio Statistic (G2)
47
Output from Best Model (AC,AM,CM)
48
Output from Best Model (AC,AM,CM)
49
Intentionally Kept Blank
50
Calculation of Fitted Values (Agresti, p. 333)
For simplicity, derivations use the Poisson sampling model,
which does not require a constraint on parameters such as the
multinomial does. For three-way tables, the joint Poisson
probability that cell counts {Yijk} is
51
Calculation of Fitted Values (Agresti, p. 333)
For general loglinear model (XYZ), likelihood simplifies to
52
Chi-Squared Goodness of Fit Tests
(Agresti, p. 337-)
54
Intentionally Kept Blank
55
Loglinear-Logit Model Connection(Agresti, p. 330)
56
Loglinear-Logit Model Connection(Agresti, p. 330)
To understand implications of a loglinear model formula, form a
logit on one variable; e.g., consider the loglinear model (XY, XZ,
YZ). When Y is binary, its logit is
60
Intentionally Kept Blank
61
The Likelihood Ratio Chi-Square
2 𝑛𝑖𝑗
𝐺 =2 𝑖 𝑗 𝑛𝑖𝑗 𝑙𝑜𝑔 𝜇𝑖𝑗
62
Intentionally Kept Blank
(beyond syllabus)
63
Political Affiliation Example
Political Affiliation
Rep Dem Indep
Letters 34 61 16 111
Engineering 31 19 17 67
College Agriculture 19 23 16 58
Education 23 39 12 74
Total 107 142 61 310
It is a 4 × 3 table.
65
Political Affiliation Example
• In this example, the interaction plot suggests
lack of independence.
• However it also suggests that it is the
Engineering school which is primarily the
deviant group.
• Removal of the Engineering school may lead
to insignificance among the other three
schools.
66
Political Affiliation Example (All 4 Schools)
67
Political Affiliation Example
(with Engineering School Removed)
68
Intentionally Kept Blank
69
School Adversity Example
Adversity of School Condition (k)
Low Low Med Med High High
ium ium
Risk (j) N R N R N R Total
Classroom Non deviant 16 7 15 34 5 3 80
behavior (i)
Deviant 1 1 3 8 13 3 17
Total 17 8 18 42 6 6 97
n1 80 n12 18
n2 17 n 22 42
n 11 17 n 23 6
n 21 8 n 97 71
Different Presentation of Data
72
Higher Dimensional Tables
• As the dimension of the table becomes larger, the
analysis becomes complicated.
• If we can collapse over some of the variables without
losing the information on significant interaction
terms, it can make the analysis much easier.
• Inappropriate collapsing can lead to incorrect
inference.
• The best known example of inappropriate collapsing
is demonstrated by Simpson’s paradox.
73
ACM Data: Estimated Odds Ratios
Measuring Association (Agresti, p. 323)
75
Four-Way Contingency Tables (Agresti, p. 326)
Model (G, I, L, S) of mutual independence fits very poorly. Model (GI, GL, GS, IL,
IS, LS) fits much better but still has a lack of fit (P < 0.001).
Model (GIL, GIS, GLS, ILS) fits well (G2 = 1.3, df=1) but is complex and difficult to
interpret. This suggests studying models more complex than (GI, GL, GS, IL, IS, LS)
but simpler than (GIL, GIS, GLS, ILS).
79
Four-Way Contingency Tables (Agresti, p. 328)
For model (GLS, GI, IL, IS), each pair of variables is conditionally dependent,
and at each category of I, the association between any two of the others varies
across categories of the remaining variable. For this model, it is inappropriate
to interpret the GL, GS, and LS two-factor terms on their own. Since I does
not occur in a three-factor interaction, the conditional odds ratio between I and
each variable (see the top portion of Table 8.10) is the same at each ombination
of categories of the other two variables. 80
Four-Way Contingency Tables (Agresti, p. 328)
When a model has a three-factor interaction term but no higher order term,
one can study the interaction by calculating fitted odds ratios between two
variables at each level of the third. One can do this at any levels of remaining
variables not involved in the interaction.
The bottom portion of Table 8.10 illustrates this for model (GLS, GI, IL, IS). For
instance, the fitted GS odds ratio of 0.66 for (L=urban. refers to four fitted
values for urban accidents, both the four with (injury=no) and the four with
81
(injury= yes); for example, 0.66 = (7273.2 ×10,959.2)/(11,632.6×10,358.9).
82
Contrasts
Let qij , i 1, , I , j 1, J be any set of numbers
with the property that qi q j 0. Then a contrast
of interactions may be expressed as
I J
q
i 1 j 1
ij ij
XY
.
83
Contrasts