0% found this document useful (0 votes)
17 views

Chap1 Introduction 2may24

This document introduces discrete and categorical data analysis. It defines discrete and categorical variables and provides examples. It also discusses binary, ordinal and nominal categorical variables. Maximum likelihood estimation is introduced as a method for statistical inference on model parameters from data.

Uploaded by

Animesh Patel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Chap1 Introduction 2may24

This document introduces discrete and categorical data analysis. It defines discrete and categorical variables and provides examples. It also discusses binary, ordinal and nominal categorical variables. Maximum likelihood estimation is introduced as a method for statistical inference on model parameters from data.

Uploaded by

Animesh Patel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Chapter 1: Introduction to Discrete and

Categorical Data
Anup Dewanji
03 May 2024

1 Introduction

This course focuses on analysis of discrete and categorical data. Discrete and
categorical data abound in practice. Discrete variables are those taking finitely
or countably many values. For example, number of children in a family, number
of daily accidents, etc., are discrete variables. This is in contrast with continu-
ous variable, which may take uncountable number of real values in an interval
of either a finite or an infinite range (e.g., height of an individual, shelf time of
a product, etc.). On the other hand, categorical variables are attributes repre-
senting a specific quality or feature having a finite number of levels or categories.
For example, gender, marital status, race, etc., are categorical variables having
certain number of categories. Although, in many textbooks, categorical variable
is considered as a special type of discrete variable, one distinguishing character-
istic of a categorical variable is that it does not take any numerical value, but
has a finite number of possible levels or categories.

A simple way of summarizing categorical data is through one-way frequency


table, listing the frequencies of observations in each category. So, if there are L
categories for a categorical variable, the frequency table will have the L frequen-
cies f1 , · · · , fL , say, arranged in a row or column. Note that the sum of these
frequencies f1 + · · · + fL is equal to n, the sample size. One may also present
the data in graphical representation through bar diagram or pie-chart.

If a categorical variable has two possible categories, it is called a binary.


Binary outcomes are quite common in many discipline. Gender is often a vari-
able of interest for most studies, with the categories of ‘male’ or ‘female’. In
many questionnaires, ‘yes’ and ‘no’ are often the only possible answers to an
item. Even when there is no binary outcome planned at the design stage, they
may occur in later stages. For example, if an outcome of interest is subject to
missing, then a binary variable may be created to indicate the missingness (yes
or no) of the outcome. In a life testing experiment with industrial products,
one may terminate the testing after a certain time (say, one year) and then a

1
binary variable may be thought of to indicate the operating status of a product
at the terminating time. In general, one can suitably dichotomize a discrete or
continuous variable to create a binary variable.

A categorical variable with more than two levels is also called a polychoto-
mous variable. There are two common types of polychotomous variables. If
the levels of a polychotomous variable are ordered, then it is called an ordinal
variable (e.g., education level, socio-economic status, etc.). In practice, natural
numbers (e.g., 1 to 5) are often used to denote the ordinal levels. It is important
to note that such choice of numbers can be arbitrary which are rather treated
as codes for the different levels. Ordinal variables arise often from discretizing a
(latent) continuous variable either because of failure to observe the original scale
directly or for the purpose of simple modeling and interpretation. For example,
age of an individual is a continuous variable, but is often recorded (grouped)
in intervals in many surveys, possibly with the interpretation of child, young,
adult, senior, elderly, etc..

If there is no ordering in the levels, the categorical variable is called nominal;


for example, gender, ethnicity, colour of eye, blood group, etc., are all nominal
variables. Sometimes whether a polychotomous variable is ordinal or nominal
may depend on the study. For example, race is usually considered a nominal
variable; but in order to study darkness of skin tone, race may become ordinal.

Discrete variables may have finite or infinite range. In most practical ex-
amples, a discrete variable records the number of occurrences of an event of
interest such as accidents, heart attacks, suicide attempts, abortions, etc., and
thus has a theoretical range that includes all natural numbers. Such variables
are called count variables. Examples with finite range are not that common;
one can think of number of daily admissions in a hospital with finite number of
beds.

2 Key Statistical Results

In this section, we review some fundamental concepts and techniques for statisti-
cal inference procedures based on the data at hand and the assumed underlying
model. The main objective is to estimate the unknown model parameters, or
population quantities of interest, and/or carry out some relevant hypothesis
testing involving the parameters. We also state some limiting results (as the
sample size n → ∞) that are used to derive properties of the estimates. In
Statistics literature, we call them ‘asymptotic results’. These are covered in
standard courses on statistical inference. However, to recapitulate, we will de-
scribe the roles played by the concepts and the results in the investigation of
statistical properties of the estimated model parameters.

2
2.1 Maximum Likelihood Estimation
Suppose X1 , · · · , Xn are IID with CDF F (x; θ) and PDF (or, PMF) f (x; θ)
(using a common notation for PDF and PMF). The corresponding observations
are x1 , · · · , xn . Likelihood function L(θ) = L(θ|x1 , · · · , xn ) is a function of the
unknown parameter(s) θ which is proportional to the probability of observing
the data x1 , · · · , xn under the assumed model given by F (x; θ) or f (x; θ). Using
some loose notation, we have
L(θ) ∝ P r{X1 = x1 , · · · , Xn = xn }
Yn
= P r{Xi = xi }
i=1
Yn
≈ f (xi ; θ)
i=1

The ‘maximum likelihood estimate’ (MLE) of θ is that value of θ which


maximizes the likelihood function L(θ) w.r.t. θ, assuming that there is a unique
maximum. This is usually denoted by θ̂. It is common to consider the log-
likelihood function l(θ) = loge L(θ); maximizing L(θ) is equivalent to maximiz-
ing l(θ).

Example 1: X1 , · · · , Xn are IID Bernoulli(p) variables with θ = p so that


P [Xi = 1] = p and P [Xi = 0] = 1 − p. The corresponding observations are
denoted by x1 , · · · , xn . The likelihood function L(p) can be written as
n
Y P P
L(p) = pxi (1 − p)1−xi = p xi
(1 − p)n− xi
,
i=1
P P
so that the log-likelihood function l(p) = ( xi ) log p + (n − xi ) log(1 − p).
Differentiating this l(p) w.r.t. p, we get
P P P
xi n − xi xi − np
l′ (p) = − = .
p 1−p p(1 − p)
P
Equating this with 0, we get the MLE of p, denoted by p̂, as p̂ = xi /n = x̄.
Therefore, the maximum likelihood estimator is X̄ and the MLE is x̄. In order
to ensure that this p̂ maximizes the likelihood, we need to check the second
derivative of l(p), that is l′′ (p), evaluated at p = p̂. We see that
P P
′′ xi n − xi
l (p) = − 2 − < 0,
p (1 − p)2
so p̂ is indeed the unique maxima.

H.W.: Example 2: X1 , · · · , Xn are IID P oisson(λ) variables with


e−λ λx
P [Xi = x] = .
x!

3
The corresponding observations are x1 , · · · , xn . Write down the likelihood and
log-likelihood functions and obtain the MLE of λ after checking the second
derivative.

Example 3: X1 , · · · , Xn are IID N ormal(µ, σ 2 ) variables with correspond-


ing observations x1 , · · · , xn . Here θ = (µ, σ 2 ) and the likelihood
n 2
Y 1 1 (xi −µ) 1 1
Pn 2
2
L(µ, σ ) ∝ √ e− 2 σ2 ∝ n e− 2σ2 i=1 (xi −µ) .
i=1
2πσ σ

The corresponding log-likelihood is given by


n
1 X
l(µ, σ 2 ) = −n log σ − (xi − µ)2 .
2σ 2 i=1

For the purpose of differentiating, it is convenient to write this as a function


l(µ, σ) (that is, σ instead of σ 2 ). Taking partial derivatives of l(µ, σ) w.r.t. µ
and σ, we get
n
∂l 1 X
= (xi − µ)
∂µ σ 2 i=1
n
∂l n 1 X
= − + 3 (xi − µ)2
∂σ σ σ i=1

Equating the above two equations


P (called ‘likelihood
P equations’) with 0, we get
the MLEs of µ and σ 2 as µ̂ = xi /n and σ̂ 2 = (xi − x̄)2 /n, respectively. We
need to check if the corresponding 2 × 2 matrix of second order partial deriva-
tives of l(µ, σ) w.r.t. µ and σ is negative definite. We leave this as a H.W.
for you since it is rather routine. Note that, while the MLE of µ is unbiased,
2 2
that of
P σ is not2 unbiased. An unbiased and commonly used estimate of σ is
2
s = (xi − x̄) /(n − 1).

H.W.: Example 4: Suppose X1 , · · · , Xn are IID variables with the common


Geometric distribution, denoted by Geom(p), with the unknown parameter p
and the PMF given by

px = px−1 p, x = 1, 2, · · · .

First check that sum of all the px ’s is 1 and then find E[X] and V [X]. Obtain
the MLE of p based on observations x1 , · · · , xn .

Suppose, in Example 2 of P oisson(λ) model, Xi represents number of acci-


dents in Kolkata on the ith day. Then, the probability of no accident in Kolkata
during the 4 puja days is (e−λ )4 = e−4λ . One may be interested in estimating
this probability. Maximum likelihood estimation gives an easy solution to this

4
problem through a property called ‘invariance’. In general, if θ̂ is the MLE of
θ, then the MLE of g(θ̂) is the MLE of g(θ) for any differentiable function g(·).
Therefore, in Example 2, since x̄ is the MLE of λ, e−4x̄ is the MLE of e−4λ ; in
particular, e−x̄ is the MLE of e−λ , the probability of no accident on a single day.

H.W.: Suppose X1 , · · · , Xn are IID Geom(p) variables. Obtain MLE of p


and also that of the probability P [X > 5]. Apply with the sibling data of your
batch.

2.2 Information
Consider a model given by PMF or PDF, fθ (·), with the associated model pa-
rameter(s) θ. Then, the ‘information’, denoted by i(θ), in a random observation
on X following the distribution fθ (·) is defined as
 2  "  T #
∂ log fθ (X) ∂ log fθ (X) ∂ log fθ (X)
i(θ) = −E =E .
∂θ∂θT ∂θ ∂θ

For scalar θ, we have


" 2 #
d2 log fθ (X)
 
d log fθ (X)
i(θ) = −E =E .
dθ2 dθ

So the information content in n IID random observations from fθ (·) is I(θ) =


n × i(θ), which is equal to
 2 
∂ log L(θ)
−E , for vector θ,
∂θ∂θT
and
d2 log L(θ)
 
−E for scalar θ.
dθ2
Suppose X follows a Bernoulli(p) distribution. Then, the PMF is given by
fθ (x) = fp (x) = px (1 − p)1−x , for x = 0, 1. Check that

d2 log fp (X) X 1−X


− 2
= 2+ .
dp p (1 − p)2
Noting that E[X] = p, we have the information in a single observation from
Bernoulli(p), written as i(p), as p1 + 1−p
1 1
= p(1−p) , so that the information
n
from n such IID observations as p(1−p) .

H.W.: Obtain i(p) from the Bin(n, p) and Geom(p) distributions.

You will find that i(p) from a single Bin(n, p) random variable is same as
n
p(1−p) ,
the information from n IID Bernoulli(p) random variables. Noting that,

5
when X1 , · · · , Xn are IID Bernoulli(p) random variables, we have Bin(n, p) dis-
tribution for the sum X1 + · · · + Xn . So, the equality of information from n
IID Bernoulli(p) random variables with that from a single Bin(n, p) random
variable tells us that there is equal information on p from either n IID Ber(p)
random variables or from their sum. That is, there is ‘sufficient’ information
on p in the sum X1 +· · ·+Xn , which is equal to having X1 , · · · , Xn individually.

H.W.: Obtain i(λ) from P oisson(λ) distributions.


H.W.: Obtain i(µ, σ) from N (µ, σ 2 ) distribution.

2.3 Some Limiting Results


In this section, we wish to state some limiting results (in Statistics literature,
we call them ‘asymptotic results’) as the sample size n tends to infinity. It is to
be noted that these asymptotic results can be used as approximate for a large
enough sample size n to carry out what is called approximate inference.

Convergence in probability: Let us write a statistic T = T (X1 , · · · , Xn )


based on (X1 , · · · , Xn ) as T = Tn to indicate the dependence on n explicitly. A
statistic Tn is said to ‘converge to g(θ) in probability’ if, for any ϵ > 0,
P [|Tn − g(θ)| > ϵ] → 0 as n → ∞,
P
denoted by Tn − → g(θ). In this case, we also say that the estimator Tn is weakly
consistent for g(θ). Consistency is a desirable property of an estimator.

Result: The MLE θ̂ = θ̂n converges in probability to the true value of θ, or is


weakly consistent for the true value of θ, under some standard regularity con-
ditions. Note that, for a differentiable function g(θ) of θ, the MLE g(θ̂n ) is also
weakly consistent for g(θ).

Convergence in distribution: Consider a sequence {Tn } ∞ n=1 of statistics such


that the sampling distribution of Tn has CDF Fn . We say that Tn converges
d
in distribution to a random variable T (denoted by Tn −→ T ) with CDF F if
Fn (x) → F (x) as n → ∞ for all x where F is continuous. We say that Tn
asymptotically (as n → ∞) follows the distribution F .

Central Limit Theorem (CLT): Suppose X1 , · · · , Xn are IID with finite


mean µ and finite variance σ 2 . Then,
√ 
n X̄n − µ d

→ N (0, 1).
σ

X̄n −µ) n(X̄n −µ)
Recall that E[X̄n ] = µ and V [X̄n ] = σ 2 /n. Therefore, (√ 2
= σ is
σ /n
the ‘standardization’ of a random variable. This, in turn, means
σ2
 
 d
X̄n − µ −→ N 0, .
n

6
∂ log L(θ) ∂l(θ)
Asymptotic results for MLE, θ̂: Let us write u(θ) = ∂θ = ∂θ , called
the ‘score vector’. Let us also write
∂ 2 log L(θ)
I(θ) = − .
∂θ∂θT

Evaluated at θ = θ̂, I(θ̂) is called the ‘observed information matrix’, as against


the information matrix I(θ) which is often called ‘expected information matrix’.

Result: The observed information matrix I(θ̂) converges in probability to the


expected information matrix I(θ), or is a weak consistent estimate for I(θ),
under some standard conditions.

In the following, we state a set of three asymptotic results which can be


useful for making ‘asymptotic’ inference on θ.

Result-1:
d
u(θ) −
→ N (0, I(θ)),
under some standard conditions. When θ is a scalar, this is equivalent to the
result
u(θ) d
p −
→ N (0, 1).
I(θ)
While using this result, we replace I(θ) by I(θ̂). So, the approximate 95%
confidence interval (CI) of θ can be obtained from
 
 u(θ) 
θ : −1.96 ≤ q ≤ 1.96 ,
I(θ̂)
 

where 1.96 is the 97.5th percentile of the standard normal distribution N (0, 1),
denoted by z0.975 .

Result-2:
d
→ N (θ, I −1 (θ)),
θ̂ −
under some standard conditions. q This implies that, for a scalar θ, the approx-
imate standard error of θ̂ is I −1 (θ̂) = I −1/2 (θ̂). For vector θ, the diagonal
entries of I −1 (θ) give the asymptotic variances of the corresponding components
of θ̂. The corresponding standard errors (SEs) are obtained by evaluating the
diagonal entries of I −1 (θ̂) and then taking positive square root. As in the case
of Result-1, the approximate 95% CI of scalar θ can be obtained from
 
 θ̂ − θ 
θ : −1.96 ≤ q ≤ 1.96 ,
I(θ̂)
 

7
which is equivalent to
q q
θ̂ − 1.96 × I(θ̂) ≤ θ ≤ θ̂ + 1.96 × I(θ̂),

or q
θ̂ ± 1.96 × I(θ̂) ≡ θ̂ ± 1.96 × SE(θ̂).

Note that, for scalar θ and an unbiased estimator T of θ, I −1 (θ) gives a


lower bound for the variance of T . This is known as ‘Cramer-Rao lower bound’
(CRLB). So, if the variance of an unbiased estimator attains this lower bound
CRLB, it is the most efficient among all unbiased estimators. Since the MLE θ̂
is consistent (or, asymptotically unbiased) and its asymptotic variance is given
by the CRLB I −1 (θ), it is asymptotically most efficient.

Result-3: Define the ‘likelihood ratio’ as


L(θ)
LR(θ) = ,
L(θ̂)

so that LR(θ) lies between 0 and 1. Then,


d
→ χ2(p) ,
−2 × log LR(θ) −

under some standard conditions, where p is the dimension of θ. The approximate


95% CI of scalar θ can be obtained from
n o
θ : −2 × log LR(θ) ≤ χ2(1,0.95) ,

where χ2(1,0.95) = 3.84 is the 95th percentile of the χ2(1) distribution.

Let us consider the Example 2 of IID X1 , · · · , Xn from P oisson(λ) distri-


bution with observations x1 , · · · , xn . Check that
P
L(λ) ∝ e−nλ λ xi
,
X
l(λ)= −nλ + ( xi ) log λ,
P
xi
u(λ) = −n + ,
P λ
xi
I(λ) = .
λ2

P
u(λ) = 0, we get λ̂ = nxi = x̄. Then, we have the observed information
Solving P

I(λ̂) = x̄2xi = nx̄ , so that I −1 (λ̂) = nx̄ and SE(λ̂) = √nx̄ .

8
Suppose we wish to obtain the approximate 95% CI for λ. Using Result-1,
we have the approximate 95% CI as
 √ 
x̄  nx̄ 
λ : −1.96 ≤ √ × −n + ≤ 1.96 .
n λ

It is left as an exercise (H.W.) to convert this into an interval for λ. Based on


Result-2, we have the approximate 95% CI as


λ̂ ± 1.96 × SE(λ̂) ≡ x̄ ± 1.96 × √ .
n
In order to use the Result-3, note that the likelihood ratio is given by

e−λ λnx̄
LR(λ) = ,
e−x̄ x̄nx̄
which gives (H.W.)

−2 log LR(λ) = 2 [nx̄(log x̄ − log λ) − (x̄ − λ)] .

It may be of interest to get a CI for e−λ , the probability of zero event.


One can easily convert the CI for λ for the corresponding

CI for e−λ . For

example, the approximate 95% CI x̄ ± 1.96 × √n for λ can be converted into
the approximate 95% CI
 √   √ 
x̄ x̄
− x̄+1.96× √n − x̄−1.96× √n
e ,e

for e−λ .

There is another interesting result, called ‘delta method’, to obtain the


asymptotic distribution for an estimator of a parametric function g(θ) of θ for
any differentiable function g(·). Note that, by invariance property of MLE, the
MLE of g(θ) is g(θ̂). The delta method states that
d
g(θ̂n ) − g(θ) −
→ N (0, V (θ)),

under some standard conditions, where


 T  
∂g(θ) ∂g(θ)
V (θ) = I −1 (θ) .
∂θ ∂θ

For scalar θ, this V (θ) becomes (g ′ (θ))2 I −1 (θ), which can be consistently esti-
mated by (g ′ (θ̂))2 I −1 (θ̂). So, the approximate SE of g(θ̂) is given by g ′ (θ̂)I −1/2 (θ̂)
and the approximate 95% CI for g(θ) is g(θ̂) ± 1.96 × g ′ (θ̂)I −1/2 (θ̂).

The CRLB result extends more generally to the estimator for g(θ) as well. In
particular, the variance of any unbiased estimator of g(θ) has the CRLB given

9
by (g ′ (θ))2 I −1 (θ). Since the MLE g(θ̂) is asymptotically unbiased for g(θ) and
it has the variance given by (g ′ (θ))2 I −1 (θ), it is the most efficient estimator for
g(θ), as before.

H.W.: Consider Example 2 with Poisson distribution and g(λ) = e−λ . Obtain
the approximate 95% CI for g(λ) using delta method and compare that with
the one above obtained by using Result-2. They may not be algebraically the
same; but, for a given data, they will be quite similar for large n. Work this out
with the number of siblings data from your class.

Slutsky’s Theorem: Consider two sequences of random variables {Xn } and


d P
{Yn } for n ≥ 1 such that Xn −→ X and Yn − → c, where X is a the limiting
random variable of {Xn } having some limiting distribution and c is a constant.
Then,
d
1. Xn + Yn −
→ X + c.
d
2. Xn Yn −
→ cX.
Xn d X
3. Yn −
→ c, provided c ̸= 0.

3 Analysis of Binary Data: Bernoulli Distribu-


tion

Suppose X1 , · · · , Xn are n IID Bernoulli(p) random variables with

P [Xi = 1] = p and P [Xi = 0] = 1 − p,

for some 0 < p < 1. The sample observations x1 , · · · , xn take values 1 or 0. Each
Xi corresponds to a binary trial having two possible outcomes (Yes/No, Suc-
cess/Failure, etc.). Note that E[Xi ] = p, the population mean of the underlying
random variable, which is the same as the proportion of 1’s in the population.
We are usually interested in the parameter p. Also, V [Xi ] = p(1 − p).

Note that Y = X1 + · · · + Xn gives the total number of 1’s in the n inde-


pendent binary trials. This Y is known to have a Binomial(n, p) distribution
with  
n y
P [Y = y] = p (1 − p)n−y , for y = 0, 1, · · · , n.
y
The maximum likelihood estimate of p is given by
x1 + · · · + xn y
p̂ = = x̄ = ,
n n

10
where y is the observed value of Y . The variance of p̂, or x̄, under the Bernoulli(p)
model, is given by V (p̂) = p(1−p)
n . By CLT, we have
 
d p(1 − p)
p̂ − p −
→ N 0, .
n

Thispvariance V (p̂) can be estimated by p̂(1 − p̂)/n. So, the standard error
of p̂ is p̂(1 − p̂)/n. So, an approximate 95% confidence interval for p is given
by, using normal approximation of Result 2 on MLE,
p p
[p̂ − 1.96 × p̂(1 − p̂)/n, p̂ + 1.96 × p̂(1 − p̂)/n].

Now, suppose we wish to test for the null hypothesis H0 : p = p0 , where


p0 is a known fixed value of p. The alternative hypothesis could be two-sided
H1 : p ̸= p0 , or one-sided H1 : p > p0 , or H1 : p < p0 . For an approximate test,
we consider the test statistic
p̂ − p0
T =p .
p0 (1 − p0 )/n

Using CLT, under H0 , the distribution of T has a standard normal approxima-


tion for large n. Note that this result can also be obtained from Result 2 on
MLE while replacing I(p) by I(p0 ). Then, it is known as the Wald test.

For the two-sided alternative H1 : p ̸= p0 , we reject the null hypothesis H0


of p = p0 if |T | > 1.96. Similarly, for the one-sided alternative H1 : p > p0 ,
we reject the null hypothesis H0 of p = p0 if T > 1.645, where 1.645 is the
95th percentile z0.95 of N (0, 1) distribution. Also, for the one-sided alternative
H1 : p < p0 , we reject the null hypothesis H0 of p = p0 if T < −1.645.

As an example, consider the following data on adverse effect of covid 19


vaccine. On a particular day, 464 individuals were vaccinated in a health cen-
tre. Out of those, 9 individuals showed some adverse effect within next 48
hours. We intend to use this data to make inference on the probability of
adverse effect within 48 hours of vaccination. We write outcome of the 464
individuals with respect to showing the adverse effect as n = 464 Bernoulli(p)
random variables X1 , · · · , Xn , where p denotes the probability of adverse effect
within 48 hours of vaccination. Clearly, the value of Y is y = 9. So, we have
p̂ = y/n = 9/464 = 0.0194, or about 2 in 100.

The corresponding standard error is


p p
p̂(1 − p̂)/n = 0.0194(1 − 0.0194)/464 = 0.0064.

Therefore, an approximate 95% confidence interval for p is given by

[0.0194 − 1.96 × 0.0064, 0.0194 + 1.96 × 0.0064] = [0.0068, 0.0319].

11
Now suppose the manufacturer of the vaccine claims that the chance of such
adverse effect is 1 in 100, that is 0.01. In order to test this claim, the null
hypothesis will be H0 : p = p0 = 0.01. The test statistic T takes the value
0.019 − 0.01 0.009
T =p = = 1.9565.
0.01 × 0.99/464 0.0046

This is right on the margin, so one can take a decision either way. However,
from the public health point of view, we are interested in the one-sided alter-
native H1 : p > p0 = 0.01. Since T > 1.645, we reject the null hypothesis
H0 : p = p0 = 0.01.

4 Analysis of Count Data: Poisson Distribution

Here X1 , · · · , Xn represent the IID random variables corresponding to the n


count observations x1 , · · · , xn in the sample. An appropriate model for count
data is P oisson(λ) distribution which is given by the probability mass function

e−λ λx
P [X = x] = , for x = 0, 1, 2, · · · ,
x!
where X has the same distribution as the Xi ’s. The parameter λ > 0 is of
interest which represents the population mean of X. The population variance
of X is also λ. That is, E[X] = V [X] = λ.

The maximum likelihood estimate of λ is given by λ̂ = x̄ = (x1 +· · ·+xn )/n.


The variance of this estimate, under the P oisson(λ) model, is
λ
V [λ̂] = V [x̄] = ,
n
p
which can be estimated by x̄/n. So, the standard error of λ̂ is x̄/n. As before,
an approximate 95% confidence interval for λ is given by
p p
[λ̂ − 1.96 × x̄/n, λ̂ + 1.96 × x̄/n].

This can also be obtained using Result-2 on MLE.

In order to carry out testing some null hypothesis H0 : λ = λ0 , for some


fixed value λ0 of λ, we consider the test statistic

λ̂ − λ0
T =p .
λ0 /n

As before, using CLT, this statistic T has approximately a standard normal dis-
tribution. So, we reject this null hypothesis H0 : λ = λ0 against the two-sided

12
alternative H1 : λ ̸= λ0 , if |T | > 1.96. For one-sided alternative H1 : λ > λ0 , we
reject H0 if T > 1.645. And, for one-sided alternative H1 : λ < λ0 , we reject
H0 if T < −1.645.

In order to illustrate, let us consider the following data on number of acci-


dents on 20 randomly selected days in a city.
11022233410331310432
So, we have n = 20 and the MLE λ̂ = x̄ = 1.95. So, based on this data, the
estimated expected number of daily accidents in this city is 1.95. It may be of
interest to estimate the probability of no accident on a typical day. Under the
P oisson(λ) model, this probability is given by P [X = 0] = e−λ . This can be
estimated by e−1.95 = 0.142, using invariance property of MLE.
p p
The standard error of this estimate λ̂ = 1.95 of λ is x̄/n = 1.95/20
= 0.312. Therefore, an approximate 95% confidence interval for λ is

[1.95 − 1.96 × 0.312, 1.95 + 1.96 × 0.312] = [1.338, 2.562].

One may be interested in obtaining the standard error of the estimate


e−1.95 = 0.142 of the probability of no accident on a day. For this purpose,
that is to obtain standard error of the estimate of a function g(λ) of λ, delta
method gives that standard error as

∂g(λ) p
| | x̄/n,
∂λ

evaluated at λ = λ̂. For g(λ) = e−λ , this standard error comes out to be 0.142×
0.312 = 0.044 with the corresponding approximate 95% confidence interval

[0.142 − 1.96 × 0.044, 0.142 + 1.96 × 0.044] = [0.056, 0.228].

Suppose now the city authority claims that there is on average one accident
daily. That is, the null hypothesis is H0 : λ = λ0 = 1. To test this hypothesis,
the value of the test statistic T is
1.95 − 1 √
T = p = 20 × 0.95 = 4.25.
1/20

So, we clearly reject this H0 against two-sided alternative and also against the
one-sided alternative H1 : λ > 1.

Similarly, suppose we wish to test the null hypothesis that the probability
of no accident on a day is 0.2. Note that, under H0 : λ = λ0 = 1, the variance
of e−λ̂ is, using delta method,

λ0 e−2
(e−λ0 )2 × = = 0.00677.
n 20

13
So, we consider the test statistic
0.142 − 0.2 −0.058
T = √ = = −0.7047.
0.00677 0.0823
Since |T | = 0.7047 < 1.96, we cannot reject this null hypothesis against the
two-sided alternative.

5 Analysis of Frequency Data: Multinomial Dis-


tribution

Categorical data with L ≥ 2 categories can be described as frequencies of the


observations in the different levels of the corresponding variable. Then, based
on observations x1 , · · · , xn from n independent individuals (xi denoting the
category that the ith individual belongs to), we have the frequencies f1 , · · · , fL ,
where fj is the frequency or number of individuals (out of n) in the jth category,
for j = 1, · · · , L. Note that f1 +· · ·+fL = n. Write pj as the probability of being
in the jth category, so that P [Xi = j] = pj , for j = 1, · · · , L. Clearly, we have
PL
j=1 pj = 1. Therefore, we have (L − 1) of the pj ’s as independent parameters,
since the Lth one can be obtained by subtracting from 1. For the same reason,
there (L − 1) number of independent frequencies fj ’s since the Lth one can be
obtained by subtracting from n. Nevertheless, the vector (f1 , · · · , fL ) is treated
as the random vector of interest and said to follow a multinomial distribution,
denoted by M ultinomial(n; p1 , · · · , pL ) with
L
n! Y f
P [f1 , · · · , fL ] = QL pj j .
j=1 fj ! j=1

For this model, we have E[fj ] = npj and V [fj ] = pj (1 − pj ); also, for j ̸= j ′ ,
Cov[fj , fj ′ ] = −pj pj ′ . Check that, if one is interested in only a particular fre-
quency, say f1 , then f1 has a Binomial(n, p1 ) distribution. Also, for L = 2,
this turns out to be a binomial distribution.

One is usually interested in the cell probabilities pj ’s. The maximum likeli-
hood estimate of pj is p̂j = fj /npwith V (p̂j ) = pj (1 − pj )/n, for j = 1, · · · , L.
So, the standard error of p̂j is p̂j (1 − p̂j )/n. As before, one can obtain an
approximate 95% confidence interval for pj as
q
p̂j ± 1.96 × p̂j (1 − p̂j )/n.

Also, to test for a specific value pj0 of pj , one can consider the test statistic

n(p̂j − pj0 )
T =p .
pj0 (1 − pj0 )

14
The rejection rules against one- or two-sided alternatives are similar to those
discussed in the previous sub-section.

If one wishes to test H0 : pj = cj , for j = 1, · · · , L, where the cj ’s are


PL
known non-negative constants satisfying j=1 cj = 1. Under H0 , we have
E[fj ] = ncj = ej , say, for j = 1, · · · , L. Then, the statistic
L
X (fj − ej )2
T =
j=1
ej

asymptotically follows a chi-square distribution with (L − 1) degrees of freedom,


denoted by χ2 (L − 1). We reject the null hypothesis H0 if the observed value
of T is greater than the 95th percentile of the χ2 (L − 1) distribution.

This particular result can be used to test for goodness of fit of a partic-
ular model for a sampled data, in general. Suppose we have observations
x1 , · · · , xn from a particular model given by the CDF F (·; θ) with the asso-
ciated parameter(s) theta. The first step is to obtain a consistent estimate
(say, MLE) of θ, denoted by θ̂. The idea is to partition the range of the cor-
responding random variable X into, say, L mutually exclusive and exhaustive
parts and count the frequencies f1 , · · · , fL of the n observations falling into
the L parts. At the same time, we derive the probabilities of an observa-
tion falling into the L parts, say p1 , · · · , pL , under the assumed model F (·; θ).
Supposing that the L parts are constructed as the intervals Ij = (τj−1 , τj ],
for j = 1, · · · , L − 1 with τ0 = −∞, and IL = (τL−1 , τL = ∞), we have
pj = P [X ∈ Ij ] = F (τj ; θ) − F (τj−1 ; θ). These probabilities are estimated, us-
ing invariance property of MLE, by p̂j = F (τj ; θ̂) − F (τj−1 ; θ̂), for j = 1, · · · , L.
PL PL
It is an easy exercise to check that j=1 pj = j=1 p̂j = 1. In particular, the
vector of frequencies (f1 , · · · , fL ) follows a M ultinomial(n; p1 , · · · , pL ) distri-
bution. We first compute the expected frequencies ej = np̂j , for j = 1, · · · , L,
under the assumed model and then, as before, the test statistics
L
X (fj − ej )2
T = .
j=1
ej

This test statistic T asymptotically follows a χ2 (L − p − 1) distribution under


the assumed model F (·; θ), where p is the dimension of the parameter vector
θ. We reject the null hypothesis that the model F (·; θ) is the true model if
the observed value of T is greater than the 95th percentile of the χ2 (L − p − 1)
distribution.

For example, if we are interested in the severity of symptoms of a particu-


lar disease in individual patients, the symptoms can be broadly graded as Low,
Medium and Severe. Then, based on observations x1 , · · · , xn from n individuals
(xi giving the grades of symptoms of the ith individual), we have the frequen-
cies f1 , f2 , f3 , where fj is the number of individuals (out of n) with grade j, for

15
j = 1, 2, 3, representing the grades Low, Medium and Severe, respectively. Note
that f1 + f2 + f3 = n. Writing pj as the probability of being in the grade j, for
j = 1, 2, 3, with p1 + p2 + p3 = 1, we have the M ultinomial(n; p1 , p2 , p3 ) model
for the frequencies f1 , f2 , f3 .

Suppose, after collecting data from 100 patients, we find f1 = 32, f2 = 45


and f3 = 23. Then, we have p̂1 = 0.32, p p̂2 = 0.45 and p̂3 = 0.23 with the
corresponding standard errors given by 0.32 × 0.68/100 = 0.047, 0.050 and
0.042, respectively. It is left as an exercise to obtain the approximate 95% CIs
for p1 , p2 , p3 . Also, using similar method, test for p3 = 0.3 against a two-sided
alternative. Suppose, we wish to test H0 : p1 = 0.3, p2 = 0.5, p3 = 0.2. Then,
we first obtain e1 = 100 × 0.3 = 30, e2 = 50, e − 3 = 20, so that

(32 − 30)2 (45 − 50)2 (23 − 20)2 4 25 9


T = + + = + + = 1.0833.
30 50 20 30 50 20
This is much less than 5.991, the 95th percentile of the χ2(2) distribution. So,
we cannot reject H0 .

6 Analysis of Over-dispersed Count Data: Neg-


ative Binomial Distribution

Although the Poisson distribution is commonly used for modeling count data, it
is also very restrictive in the sense that, under this model, the mean and variance
should be the same. However, in practice, it often happens that the sample
data exhibits evidence of the sample variance being larger than the sample
mean. This phenomenon known as over-dispersion. A commonly used model
for such over-dispersed count data is the negative binomial distribution. This
distribution is derived from a P oisson(λ) model for the count variable X, where
the parameter λ itself is a random variable following a gamma distribution. More
specifically, let λ follows a Gamma(α, γ) distribution with scale parameter α > 0
and shape parameter γ > 0 having density
αγ −αλ γ−1
g(λ; α, γ) = e λ ,
Γ(γ)

16
for λ > 0. One can then prove that the marginal distribution of X is given by
the probability mass function (PMF)
Z ∞
λx αγ −αλ γ−1
P [X = x] = e−λ × e λ dλ
0 x! Γ(γ)

αγ
Z
= e−λ(α+1) λx+γ−1 dλ
Γ(γ)x! 0
αγ Γ((x + γ)
=
Γ(γ)x! (α + 1)x+γ
 γ  x
(x + γ − 1)! α 1
=
n!(γ − 1)! α+1 α+1
(x + γ − 1)! γ
= p (1 − p)x ,
x!(γ − 1)!
for x = 0, 1, 2, · · · , where p = α/(α + 1) and Γ(·) is the Gamma function. This
is the PMF of the negative binomial distribution, denoted by N egBin(γ, p).

H.W.: Work out the mean and variance of this N egBin(γ, p) distribution.

This distribution can also be derived as that of the number of independent


binary trials needed to achieve γ (assumed to be a positive integer) successes,
where each trial has the probability of success p. It is for this reason that
this distribution is more commonly called the negative binomial distribution.
Also, because of this additional parameter, over that of a single parameter in
Poisson distribution, this N egBin(n, γ, p) model is able to give better fit to
over-dispersed count data than the simple Poisson model.

Assuming the N egBin(n, γ, p) model, the MLEs of γ and p based on observa-


tions x1 , · · · , xn are not readily available in closed form. One needs to use some
numerical method (e.g., Newton-Raphson method) to obtain the MLEs γ̂ and p̂.

For an example, consider the following data on frequency of protected sex


during the past three months. We do not have access to the raw counts (number
of protected sex during the past three months) x1 , · · · , xn , but the following
frequency table is available. We also have the sample mean count x̄ = 9.1.
Counts 0 1 2 3 4 5 ≥6
No. of obs. 32 4 5 5 5 6 41

Assuming P oisson(λ) model, we know the MLE of λ is λ̂ = x̄ = 9.1. Also,


using this λ̂, one can obtain the seven estimated cell probabilities (of the above
frequency table) under the Poisson model as 0.0001, 0.0010, 0.0046, 0.0140,
0.0319, 0.0580 and 0.8904, respectively. For testing the goodness of fit of the
Poisson model, one can then obtain the corresponding ej ’s, as in Section 5, and
then the test statistic T as 93934,89. This value turns out to be much greater
than the 95th percentile of the χ2(7−1−1) distribution which is 11.0705. So, the

17
Poisson model is rejected for this data.

If the distribution of a count response is deemed not to follow the Poisson


model, the N egBin(n, γ, p) model or more complex models may be used to fit
the data. In many applications like the example above, there is an excessive
number of zeros beyond what is expected by the Poisson model. By considering
the data as a mixture of a degenerate distribution centered at 0 and a Poisson,
we may apply the zero-inflated Poisson (ZIP) model to fit the data.

7 Analysis of 2 × 2 Contingency Table: Hyper-


geometric Distribution

Contingency tables are often used to summarize the data on on two categorical
variables, say, X and Y ; one is designated as the row and the other as the column
variable. If both variables are binary, there are four combinations of possible
values from these two variables, and thus their occurrence in a sample can be
displayed in a 2 × 2 contingency table. The general form of a contingency table
is given below with the binary levels 0 and 1 for the two categorical variables X
and Y . Here fij denotes the frequency of subjects (or observations) belonging to
the ith level of X and the jth level of Y , for i, j = 0, 1 and the + sign indicating
a sum over the two levels of the corresponding variable.
y
x 0 1 Total
0 f00 f01 f0+
1 f10 f11 f1+
Total f+0 f+1 n
Such contingency tables arise from a variety of contexts and sampling schemes.
Some common examples are given below.
1. Consider a single random sample from a target population. The row and
column variables may represent two different characteristics of the subjects
or they may be repeated measures of the same characteristics collected at
a time before and after an intervention as in the pre-post study design.
2. A stratified random sample from two independent groups (say, male and
female). In this case, one of the variables is the group indicator, and
subjects are randomly sampled from each of the groups in the population.
Randomized clinical trials with two treatment conditions and case/control
studies are such examples.
3. Two judges prescribe ratings based on a binary scale for each subject in
a sample.

18
It is important to know the context of the observations in a contingency
table for making valid inference and interpretation of the resulting estimates.

For example, in a diagnostic test study, X may represent the status of a


disease and Y the result of a test in detecting the presence of the disease.
Commonly used indices for accuracy of binary tests include the true positive
fraction (TPF), or sensitivity, P [Y = 1|X = 1], or the true negative fraction
(TNF), or specificity, P [Y = 0|X = 0], the positive predictive value (PPV),
P [X = 1|Y = 1], and the negative predictive value (NPV), P [X = 0|Y = 0]. If
subjects are randomly selected from the target population as a single sample,
then all these indices can be easily estimated using sample proportions. How-
ever, if they are independently sampled from diseased (X = 1) and non-diseased
(X = 0) populations, known as case-control study, TPF and TNF can be di-
rectly estimated. Similarly, if subjects are sampled based on the test results
Y = 1 or Y = 0, only PPV and NPV can be estimated from the table. Never-
theless, we are usually interested in studying the relationship between the row
(X) and the column (Y ) variables.

Before introducing the hypergeometric distribution, it is important to dis-


tinguish between two popular sampling procedures, with and without replace-
ment. Suppose a bag contains N balls, N1 of which are white and the remaining
N2 = N − N1 ones being black. Now, draw n balls from the bag. Let X be the
number of white balls among the n balls sampled. If the balls are drawn one at
a time with the color of the ball recorded and then replace back in the bag be-
fore the next draw, this procedure is called sampling with replacement. Under
this sampling procedure, the draws in general can be considered independent
and identically distributed, the probability of getting a white ball in each draw
being p = N1 /N . Thus, we can calculate the probability of observing X = x
number of white balls in the sample using the Binomial(n, p) distribution.

If the n balls are drawn simultaneously or one by one without replacing the
one drawn back in the bag before the next draw, the procedure of drawing the
n balls is called sampling without replacement. In this case, we cannot compute
the probability of observing X = x number of white balls based on the bino-
mial distribution since the number of balls decreases as the sampling process
continues and the proportion of white balls dynamically changes from draw to
draw. Thus, under this sampling procedure, the draws are dependent and also
the probability of getting a white ball does not remain constant. In most real
studies, subjects are sampled without replacement. However, in data analysis,
most methods are based on the i.i.d. assumption, which means that sampling
with replacement is assumed. If the target population is very large compared
to the sample size n, the difference will be small and this assumption is still
reasonable. But, in this case, N may not be very large compared to n.

We are interested in the distribution of X, the number of white balls out of


the n balls sampled without replacement from the N balls in the bag. Clearly,

19
X cannot exceed the number of balls sampled n and also the total number of
white balls in the bag N1 ; that is, X ≤ min{n, N1 }. Similarly, if the number of
balls sampled n exceeds the number of black balls N2 in the bag, there will be
at least n − N2 white balls in the sample; hence, X ≥ max{0, n − N2 }.

Note that there are N



n possibilities of choosing n balls out of N when the
sampling is without
 −1 replacement and each of these possibilities has the same
probability [ Nn ] of being sampled regardless of their colour combination.
Again, in order to obtain the probability of X = x, note that Nx1 × n−x N2
 
out
of these N

n possibilities have the same colour combination of x white balls and
(n − x) black balls in the sample of size n, thus, the probability of having X = x
white balls in the sample is given by
N1
 N2 
x n−x
P [X = x] = N
 ,
n

for max{0, n − N2 } ≤ x ≤ min{n, N1 }. This is called a hypergeometric dis-


tribution with given N1 , N2 and n, denoted by HG(N1 , N2 , n). The mean and
variance of this distribution are given by
N1
E[X] = n
N
and
N1 N2 n(N − n)
V [X] = .
N 2 (N − 1)
Mapping this distribution to the general form of the 2 × 2 contingency table
given at the beginning of this section, given the marginals f0+ , f1+ and f+0 ,
the cell count f00 has the HG(f0+ , f1+ , f+0 ) distribution. Here x = 0 indicates
white ball and x = 1 indicates black ball; also y = 0 means in the sample
and y = 1 means not in the sample. Note that the sampling of balls without
replacement needs to be such that, at each stage, every remaining ball in the
bag has the same probability of being sampled regardless of their colour. That
is, whether a ball is in the sample or not does not depend on the colour of the
ball. Therefore, for this mapping to hold good, we need the row and the column
variables to be not associated, or independent; that is, the distribution for f00
is hypergeometric under the assumption that the row and the column variables
are not associated.

There is a generalization of this hypergeometric distribution to include balls


of more than two colours. Suppose there are balls of L (≥ 2) different colours in
the bag with the corresponding numbers N1 , · · · , NL and N = N1 + · · · + NL .
Suppose, as before, a sample of n balls are drawn without replacement from the
bag. Let X1 , · · · , XL denote the number of balls of the L different colours in the
sample, such that n = X1 + · · · + XL . Then, the vector (X1 , · · · , XL ) is said to

20
follow a multivariate hypergeometric distribution with the corresponding PMF
given by
QL Nj

j=1 xj
P [X1 = x1 , · · · , XL = xL ] = N
 ,
n
P
for all (x1 , · · · , xL ) satisfying xj ≥ 0 and max{0, n− i̸=j Nj } ≤ xj ≤ min{n, Nj },
PL
for j = 1, · · · , L, and also j=1 xj = n.

In this case, we have


Nj
E[Xj ] = n ,
N
Nj (N − Nj )n(N − n)
V [Xj ] =
N 2 (N − 1)
and
Nj Nj ′ n(N − n)
Cov[Xj , Xj ′ ] = − ,
N 2 (N − 1)
for j ̸= j ′ = 1, · · · , L.

H.W.: If the sampling of n balls is with replacement in this case of balls of L


different colours, what is the PMF of (X1 , · · · , XL )? Obtain the corresponding
means, variances and covariances.

21

You might also like