Chap1 Introduction 2may24
Chap1 Introduction 2may24
Categorical Data
Anup Dewanji
03 May 2024
1 Introduction
This course focuses on analysis of discrete and categorical data. Discrete and
categorical data abound in practice. Discrete variables are those taking finitely
or countably many values. For example, number of children in a family, number
of daily accidents, etc., are discrete variables. This is in contrast with continu-
ous variable, which may take uncountable number of real values in an interval
of either a finite or an infinite range (e.g., height of an individual, shelf time of
a product, etc.). On the other hand, categorical variables are attributes repre-
senting a specific quality or feature having a finite number of levels or categories.
For example, gender, marital status, race, etc., are categorical variables having
certain number of categories. Although, in many textbooks, categorical variable
is considered as a special type of discrete variable, one distinguishing character-
istic of a categorical variable is that it does not take any numerical value, but
has a finite number of possible levels or categories.
1
binary variable may be thought of to indicate the operating status of a product
at the terminating time. In general, one can suitably dichotomize a discrete or
continuous variable to create a binary variable.
A categorical variable with more than two levels is also called a polychoto-
mous variable. There are two common types of polychotomous variables. If
the levels of a polychotomous variable are ordered, then it is called an ordinal
variable (e.g., education level, socio-economic status, etc.). In practice, natural
numbers (e.g., 1 to 5) are often used to denote the ordinal levels. It is important
to note that such choice of numbers can be arbitrary which are rather treated
as codes for the different levels. Ordinal variables arise often from discretizing a
(latent) continuous variable either because of failure to observe the original scale
directly or for the purpose of simple modeling and interpretation. For example,
age of an individual is a continuous variable, but is often recorded (grouped)
in intervals in many surveys, possibly with the interpretation of child, young,
adult, senior, elderly, etc..
Discrete variables may have finite or infinite range. In most practical ex-
amples, a discrete variable records the number of occurrences of an event of
interest such as accidents, heart attacks, suicide attempts, abortions, etc., and
thus has a theoretical range that includes all natural numbers. Such variables
are called count variables. Examples with finite range are not that common;
one can think of number of daily admissions in a hospital with finite number of
beds.
In this section, we review some fundamental concepts and techniques for statisti-
cal inference procedures based on the data at hand and the assumed underlying
model. The main objective is to estimate the unknown model parameters, or
population quantities of interest, and/or carry out some relevant hypothesis
testing involving the parameters. We also state some limiting results (as the
sample size n → ∞) that are used to derive properties of the estimates. In
Statistics literature, we call them ‘asymptotic results’. These are covered in
standard courses on statistical inference. However, to recapitulate, we will de-
scribe the roles played by the concepts and the results in the investigation of
statistical properties of the estimated model parameters.
2
2.1 Maximum Likelihood Estimation
Suppose X1 , · · · , Xn are IID with CDF F (x; θ) and PDF (or, PMF) f (x; θ)
(using a common notation for PDF and PMF). The corresponding observations
are x1 , · · · , xn . Likelihood function L(θ) = L(θ|x1 , · · · , xn ) is a function of the
unknown parameter(s) θ which is proportional to the probability of observing
the data x1 , · · · , xn under the assumed model given by F (x; θ) or f (x; θ). Using
some loose notation, we have
L(θ) ∝ P r{X1 = x1 , · · · , Xn = xn }
Yn
= P r{Xi = xi }
i=1
Yn
≈ f (xi ; θ)
i=1
3
The corresponding observations are x1 , · · · , xn . Write down the likelihood and
log-likelihood functions and obtain the MLE of λ after checking the second
derivative.
px = px−1 p, x = 1, 2, · · · .
First check that sum of all the px ’s is 1 and then find E[X] and V [X]. Obtain
the MLE of p based on observations x1 , · · · , xn .
4
problem through a property called ‘invariance’. In general, if θ̂ is the MLE of
θ, then the MLE of g(θ̂) is the MLE of g(θ) for any differentiable function g(·).
Therefore, in Example 2, since x̄ is the MLE of λ, e−4x̄ is the MLE of e−4λ ; in
particular, e−x̄ is the MLE of e−λ , the probability of no accident on a single day.
2.2 Information
Consider a model given by PMF or PDF, fθ (·), with the associated model pa-
rameter(s) θ. Then, the ‘information’, denoted by i(θ), in a random observation
on X following the distribution fθ (·) is defined as
2 " T #
∂ log fθ (X) ∂ log fθ (X) ∂ log fθ (X)
i(θ) = −E =E .
∂θ∂θT ∂θ ∂θ
You will find that i(p) from a single Bin(n, p) random variable is same as
n
p(1−p) ,
the information from n IID Bernoulli(p) random variables. Noting that,
5
when X1 , · · · , Xn are IID Bernoulli(p) random variables, we have Bin(n, p) dis-
tribution for the sum X1 + · · · + Xn . So, the equality of information from n
IID Bernoulli(p) random variables with that from a single Bin(n, p) random
variable tells us that there is equal information on p from either n IID Ber(p)
random variables or from their sum. That is, there is ‘sufficient’ information
on p in the sum X1 +· · ·+Xn , which is equal to having X1 , · · · , Xn individually.
6
∂ log L(θ) ∂l(θ)
Asymptotic results for MLE, θ̂: Let us write u(θ) = ∂θ = ∂θ , called
the ‘score vector’. Let us also write
∂ 2 log L(θ)
I(θ) = − .
∂θ∂θT
Result-1:
d
u(θ) −
→ N (0, I(θ)),
under some standard conditions. When θ is a scalar, this is equivalent to the
result
u(θ) d
p −
→ N (0, 1).
I(θ)
While using this result, we replace I(θ) by I(θ̂). So, the approximate 95%
confidence interval (CI) of θ can be obtained from
u(θ)
θ : −1.96 ≤ q ≤ 1.96 ,
I(θ̂)
where 1.96 is the 97.5th percentile of the standard normal distribution N (0, 1),
denoted by z0.975 .
Result-2:
d
→ N (θ, I −1 (θ)),
θ̂ −
under some standard conditions. q This implies that, for a scalar θ, the approx-
imate standard error of θ̂ is I −1 (θ̂) = I −1/2 (θ̂). For vector θ, the diagonal
entries of I −1 (θ) give the asymptotic variances of the corresponding components
of θ̂. The corresponding standard errors (SEs) are obtained by evaluating the
diagonal entries of I −1 (θ̂) and then taking positive square root. As in the case
of Result-1, the approximate 95% CI of scalar θ can be obtained from
θ̂ − θ
θ : −1.96 ≤ q ≤ 1.96 ,
I(θ̂)
7
which is equivalent to
q q
θ̂ − 1.96 × I(θ̂) ≤ θ ≤ θ̂ + 1.96 × I(θ̂),
or q
θ̂ ± 1.96 × I(θ̂) ≡ θ̂ ± 1.96 × SE(θ̂).
P
u(λ) = 0, we get λ̂ = nxi = x̄. Then, we have the observed information
Solving P
√
I(λ̂) = x̄2xi = nx̄ , so that I −1 (λ̂) = nx̄ and SE(λ̂) = √nx̄ .
8
Suppose we wish to obtain the approximate 95% CI for λ. Using Result-1,
we have the approximate 95% CI as
√
x̄ nx̄
λ : −1.96 ≤ √ × −n + ≤ 1.96 .
n λ
e−λ λnx̄
LR(λ) = ,
e−x̄ x̄nx̄
which gives (H.W.)
for e−λ .
For scalar θ, this V (θ) becomes (g ′ (θ))2 I −1 (θ), which can be consistently esti-
mated by (g ′ (θ̂))2 I −1 (θ̂). So, the approximate SE of g(θ̂) is given by g ′ (θ̂)I −1/2 (θ̂)
and the approximate 95% CI for g(θ) is g(θ̂) ± 1.96 × g ′ (θ̂)I −1/2 (θ̂).
The CRLB result extends more generally to the estimator for g(θ) as well. In
particular, the variance of any unbiased estimator of g(θ) has the CRLB given
9
by (g ′ (θ))2 I −1 (θ). Since the MLE g(θ̂) is asymptotically unbiased for g(θ) and
it has the variance given by (g ′ (θ))2 I −1 (θ), it is the most efficient estimator for
g(θ), as before.
H.W.: Consider Example 2 with Poisson distribution and g(λ) = e−λ . Obtain
the approximate 95% CI for g(λ) using delta method and compare that with
the one above obtained by using Result-2. They may not be algebraically the
same; but, for a given data, they will be quite similar for large n. Work this out
with the number of siblings data from your class.
for some 0 < p < 1. The sample observations x1 , · · · , xn take values 1 or 0. Each
Xi corresponds to a binary trial having two possible outcomes (Yes/No, Suc-
cess/Failure, etc.). Note that E[Xi ] = p, the population mean of the underlying
random variable, which is the same as the proportion of 1’s in the population.
We are usually interested in the parameter p. Also, V [Xi ] = p(1 − p).
10
where y is the observed value of Y . The variance of p̂, or x̄, under the Bernoulli(p)
model, is given by V (p̂) = p(1−p)
n . By CLT, we have
d p(1 − p)
p̂ − p −
→ N 0, .
n
Thispvariance V (p̂) can be estimated by p̂(1 − p̂)/n. So, the standard error
of p̂ is p̂(1 − p̂)/n. So, an approximate 95% confidence interval for p is given
by, using normal approximation of Result 2 on MLE,
p p
[p̂ − 1.96 × p̂(1 − p̂)/n, p̂ + 1.96 × p̂(1 − p̂)/n].
11
Now suppose the manufacturer of the vaccine claims that the chance of such
adverse effect is 1 in 100, that is 0.01. In order to test this claim, the null
hypothesis will be H0 : p = p0 = 0.01. The test statistic T takes the value
0.019 − 0.01 0.009
T =p = = 1.9565.
0.01 × 0.99/464 0.0046
This is right on the margin, so one can take a decision either way. However,
from the public health point of view, we are interested in the one-sided alter-
native H1 : p > p0 = 0.01. Since T > 1.645, we reject the null hypothesis
H0 : p = p0 = 0.01.
e−λ λx
P [X = x] = , for x = 0, 1, 2, · · · ,
x!
where X has the same distribution as the Xi ’s. The parameter λ > 0 is of
interest which represents the population mean of X. The population variance
of X is also λ. That is, E[X] = V [X] = λ.
λ̂ − λ0
T =p .
λ0 /n
As before, using CLT, this statistic T has approximately a standard normal dis-
tribution. So, we reject this null hypothesis H0 : λ = λ0 against the two-sided
12
alternative H1 : λ ̸= λ0 , if |T | > 1.96. For one-sided alternative H1 : λ > λ0 , we
reject H0 if T > 1.645. And, for one-sided alternative H1 : λ < λ0 , we reject
H0 if T < −1.645.
∂g(λ) p
| | x̄/n,
∂λ
evaluated at λ = λ̂. For g(λ) = e−λ , this standard error comes out to be 0.142×
0.312 = 0.044 with the corresponding approximate 95% confidence interval
Suppose now the city authority claims that there is on average one accident
daily. That is, the null hypothesis is H0 : λ = λ0 = 1. To test this hypothesis,
the value of the test statistic T is
1.95 − 1 √
T = p = 20 × 0.95 = 4.25.
1/20
So, we clearly reject this H0 against two-sided alternative and also against the
one-sided alternative H1 : λ > 1.
Similarly, suppose we wish to test the null hypothesis that the probability
of no accident on a day is 0.2. Note that, under H0 : λ = λ0 = 1, the variance
of e−λ̂ is, using delta method,
λ0 e−2
(e−λ0 )2 × = = 0.00677.
n 20
13
So, we consider the test statistic
0.142 − 0.2 −0.058
T = √ = = −0.7047.
0.00677 0.0823
Since |T | = 0.7047 < 1.96, we cannot reject this null hypothesis against the
two-sided alternative.
For this model, we have E[fj ] = npj and V [fj ] = pj (1 − pj ); also, for j ̸= j ′ ,
Cov[fj , fj ′ ] = −pj pj ′ . Check that, if one is interested in only a particular fre-
quency, say f1 , then f1 has a Binomial(n, p1 ) distribution. Also, for L = 2,
this turns out to be a binomial distribution.
One is usually interested in the cell probabilities pj ’s. The maximum likeli-
hood estimate of pj is p̂j = fj /npwith V (p̂j ) = pj (1 − pj )/n, for j = 1, · · · , L.
So, the standard error of p̂j is p̂j (1 − p̂j )/n. As before, one can obtain an
approximate 95% confidence interval for pj as
q
p̂j ± 1.96 × p̂j (1 − p̂j )/n.
Also, to test for a specific value pj0 of pj , one can consider the test statistic
√
n(p̂j − pj0 )
T =p .
pj0 (1 − pj0 )
14
The rejection rules against one- or two-sided alternatives are similar to those
discussed in the previous sub-section.
This particular result can be used to test for goodness of fit of a partic-
ular model for a sampled data, in general. Suppose we have observations
x1 , · · · , xn from a particular model given by the CDF F (·; θ) with the asso-
ciated parameter(s) theta. The first step is to obtain a consistent estimate
(say, MLE) of θ, denoted by θ̂. The idea is to partition the range of the cor-
responding random variable X into, say, L mutually exclusive and exhaustive
parts and count the frequencies f1 , · · · , fL of the n observations falling into
the L parts. At the same time, we derive the probabilities of an observa-
tion falling into the L parts, say p1 , · · · , pL , under the assumed model F (·; θ).
Supposing that the L parts are constructed as the intervals Ij = (τj−1 , τj ],
for j = 1, · · · , L − 1 with τ0 = −∞, and IL = (τL−1 , τL = ∞), we have
pj = P [X ∈ Ij ] = F (τj ; θ) − F (τj−1 ; θ). These probabilities are estimated, us-
ing invariance property of MLE, by p̂j = F (τj ; θ̂) − F (τj−1 ; θ̂), for j = 1, · · · , L.
PL PL
It is an easy exercise to check that j=1 pj = j=1 p̂j = 1. In particular, the
vector of frequencies (f1 , · · · , fL ) follows a M ultinomial(n; p1 , · · · , pL ) distri-
bution. We first compute the expected frequencies ej = np̂j , for j = 1, · · · , L,
under the assumed model and then, as before, the test statistics
L
X (fj − ej )2
T = .
j=1
ej
15
j = 1, 2, 3, representing the grades Low, Medium and Severe, respectively. Note
that f1 + f2 + f3 = n. Writing pj as the probability of being in the grade j, for
j = 1, 2, 3, with p1 + p2 + p3 = 1, we have the M ultinomial(n; p1 , p2 , p3 ) model
for the frequencies f1 , f2 , f3 .
Although the Poisson distribution is commonly used for modeling count data, it
is also very restrictive in the sense that, under this model, the mean and variance
should be the same. However, in practice, it often happens that the sample
data exhibits evidence of the sample variance being larger than the sample
mean. This phenomenon known as over-dispersion. A commonly used model
for such over-dispersed count data is the negative binomial distribution. This
distribution is derived from a P oisson(λ) model for the count variable X, where
the parameter λ itself is a random variable following a gamma distribution. More
specifically, let λ follows a Gamma(α, γ) distribution with scale parameter α > 0
and shape parameter γ > 0 having density
αγ −αλ γ−1
g(λ; α, γ) = e λ ,
Γ(γ)
16
for λ > 0. One can then prove that the marginal distribution of X is given by
the probability mass function (PMF)
Z ∞
λx αγ −αλ γ−1
P [X = x] = e−λ × e λ dλ
0 x! Γ(γ)
∞
αγ
Z
= e−λ(α+1) λx+γ−1 dλ
Γ(γ)x! 0
αγ Γ((x + γ)
=
Γ(γ)x! (α + 1)x+γ
γ x
(x + γ − 1)! α 1
=
n!(γ − 1)! α+1 α+1
(x + γ − 1)! γ
= p (1 − p)x ,
x!(γ − 1)!
for x = 0, 1, 2, · · · , where p = α/(α + 1) and Γ(·) is the Gamma function. This
is the PMF of the negative binomial distribution, denoted by N egBin(γ, p).
H.W.: Work out the mean and variance of this N egBin(γ, p) distribution.
17
Poisson model is rejected for this data.
Contingency tables are often used to summarize the data on on two categorical
variables, say, X and Y ; one is designated as the row and the other as the column
variable. If both variables are binary, there are four combinations of possible
values from these two variables, and thus their occurrence in a sample can be
displayed in a 2 × 2 contingency table. The general form of a contingency table
is given below with the binary levels 0 and 1 for the two categorical variables X
and Y . Here fij denotes the frequency of subjects (or observations) belonging to
the ith level of X and the jth level of Y , for i, j = 0, 1 and the + sign indicating
a sum over the two levels of the corresponding variable.
y
x 0 1 Total
0 f00 f01 f0+
1 f10 f11 f1+
Total f+0 f+1 n
Such contingency tables arise from a variety of contexts and sampling schemes.
Some common examples are given below.
1. Consider a single random sample from a target population. The row and
column variables may represent two different characteristics of the subjects
or they may be repeated measures of the same characteristics collected at
a time before and after an intervention as in the pre-post study design.
2. A stratified random sample from two independent groups (say, male and
female). In this case, one of the variables is the group indicator, and
subjects are randomly sampled from each of the groups in the population.
Randomized clinical trials with two treatment conditions and case/control
studies are such examples.
3. Two judges prescribe ratings based on a binary scale for each subject in
a sample.
18
It is important to know the context of the observations in a contingency
table for making valid inference and interpretation of the resulting estimates.
If the n balls are drawn simultaneously or one by one without replacing the
one drawn back in the bag before the next draw, the procedure of drawing the
n balls is called sampling without replacement. In this case, we cannot compute
the probability of observing X = x number of white balls based on the bino-
mial distribution since the number of balls decreases as the sampling process
continues and the proportion of white balls dynamically changes from draw to
draw. Thus, under this sampling procedure, the draws are dependent and also
the probability of getting a white ball does not remain constant. In most real
studies, subjects are sampled without replacement. However, in data analysis,
most methods are based on the i.i.d. assumption, which means that sampling
with replacement is assumed. If the target population is very large compared
to the sample size n, the difference will be small and this assumption is still
reasonable. But, in this case, N may not be very large compared to n.
19
X cannot exceed the number of balls sampled n and also the total number of
white balls in the bag N1 ; that is, X ≤ min{n, N1 }. Similarly, if the number of
balls sampled n exceeds the number of black balls N2 in the bag, there will be
at least n − N2 white balls in the sample; hence, X ≥ max{0, n − N2 }.
20
follow a multivariate hypergeometric distribution with the corresponding PMF
given by
QL Nj
j=1 xj
P [X1 = x1 , · · · , XL = xL ] = N
,
n
P
for all (x1 , · · · , xL ) satisfying xj ≥ 0 and max{0, n− i̸=j Nj } ≤ xj ≤ min{n, Nj },
PL
for j = 1, · · · , L, and also j=1 xj = n.
21