Generalized Linear Models: Ariel Alonso Abad
Generalized Linear Models: Ariel Alonso Abad
Outline
Lecture 9: Project
Alonso, A. GLM 3 / 568
Tutorial groups
Email Prof. Alonso list with the members of each group (names and
student numbers)
Written: Project
Outline
Statistical Inference
Psychology: Synchronicity
Intelligence quotient
The intelligence quotient (IQ) of
0.015
Intelligentie
0.020
Probability distributions
If the number of observations
0.015
arises.
0.005
0.000
Intelligentie
Probability distributions
If the number of observations
0.015
arises.
Normal distribution
0.005
Student’s t-distribution
Chi-Squared distribution
0.000
F-distribution
0 50 100 150 200
Intelligentie
0.4
(x−μ)2
e 2σ2
f(x) =
σ 2π
0.3
probability density
0.2
Mode=Median=Mean
13.6% 13.6%
2.1% 2.1%
0.1% 0.1%
0.0
μ − 3σ μ − 2σ μ − 1σ μ μ + 1σ μ + 2σ μ + 3σ
x−value
68%
probability density
0.2
Mode=Median=Mean
95%
0.0
μ − 3σ μ − 2σ μ − 1σ μ μ + 1σ μ + 2σ μ + 3σ
x−value
Body length
Why?
0.010
0.005
0.000
Quantile
0.025
0.020
High school
0.010
Quantile
University
0.010
Quantile
0.025
0.020
PhD
0.010
Quantile
Polgar
0.005
Quantile
Birth weight
Inferential Statistics
Inferential statistics makes quantitative statements about the
characteristics of a population but...
To simplify notation we will use y for both the random variable and
the realized value
n
p(Ý | θ) = p(yi | θ)
i =1
Point estimation
Ý)
Popular measure of closeness: Mean squared error (MSE) of θ(
defined as:
MSE θ( Ý ) − θ)2
Ý ) =EÝ |θ (θ(
2
Ý ) + bias θ(
=VarÝ |θ θ( Ý)
Ý) = 0
An estimator is called unbiased when bias θ(
1 1
n n
Point estimators: ȳ = yi and S 2 = (y − ȳ )2
n n
i =1 i =1
n−1
E (ȳ ) = μ and E S 2 = σ2 < σ2
n
In practice all else is not equal, and biased estimators are frequently
used
λy e −λ
y ∼ P(y |λ) where P(y |λ) = and y = 0, 1, 2 . . .
y!
E (y ) = Var (y ) = λ
1 n
Biased but consistent: The sample variance S 2 = (y − ȳ )2
n i =1
To gain insight into the behavior of the sample mean we will study
how sample means from a known population behave
From a known population draw several samples of n units and look at the
averages of those samples
Sample means
Normal Gamma
0.30
0.4
Simulations
0.3
0.20
Density
Density
0.2
0.00
0.0
20 000 means
−4 −2 0 2 4 0 5 10 15 20 25 30
X1 , X2 , X3 , X4
x x
Uniform Beta
X1 + X2 + X3 + X4
X̄ =
12
4
1.2
10
8
Density
0.8
6
4
0.4
2
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
Normal Gamma
0.30
0.8
Simulations
0.6
0.20
Density
Density
0.4
0.10
0.2
0.00
0.0
N=4 (4 units)
−2 −1 0 1 2 0 2 4 6 8 10 12 14
x x
X1 , X2 , X3 , X4
Uniform Beta
X1 + X2 + X3 + X4
3.0
X̄ =
2.0
4
2.0
1.5
Density
Density
0.5
0.0
0.0
0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
Sample means
Normal Gamma
1.2
0.4
Simulations
0.3
0.8
Density
Density
0.2
0.1
0.0
0.0
N=9 (9 units)
−1.5 −0.5 0.0 0.5 1.0 1.5 2 4 6 8
x x
X1 , X2 , . . . , X9
Uniform Beta
X1 + X2 + · · · + X9
3.0
X̄ =
4
9
3
2.0
Density
Density
1.0
1
0.0
0
x x
Normal Gamma
2.0
Simulations
0.6
1.5
Density
Density
0.4
1.0
0.2
0.5
0.0
0.0
N=25 (25 units)
−0.5 0.0 0.5 1.0 1 2 3 4 5 6
x x
X1 , X2 , . . . , X25
Uniform Beta
X1 + X2 + · · · + X25
X̄ =
5
25
6
4
Density
Density
2
2
1
0
0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.7 0.8
x x
CLT: Univariate
If y1 , y2 , . . . , yn are i.i.d. random variables with mean μ = E (y ) and
√
variance σ 2 , then n(ȳn − μ) converges in distribution to N (0, σ 2 ), and
we write √ d
n(ȳn − μ) −
→ N (0, σ 2 )
or also
d
ȳn −
→ N (μ, σ 2 /n) as n → ∞
CLT: Multivariate
If Ý 1 , Ý 2 , . . . , Ý n are i.i.d. random vectors with mean μ = E (Ý ) and
√
covariance matrix Σ, then n(Ý̄ n − μ) converges in distribution to a
multivariate normal distribution, and we write
√ d
n(Ý̄ n − μ) −
→ N (0, Σ)
Body length
Genetic factors
Economic factors
...
Delta Method
Delta Method
Let Ý n = (y1n , . . . , ypn )T be a sequence of random vectors with
√
E (Ý n ) = μ and n(Ý n − μ) →d Þ
E (y ) = π and Var (y ) = π (1 − π)
n
y= yk ∼ Bin(n, π)
k=1
n k
f (k, n, π) = P(y = k) = π (1 − π)n−k
k
E (y ) = nπ and Var (y ) = nπ (1 − π)
y ∼ N nπ, nπ(1 − π)
Normal approximation
x1=36:45
x2= c(25:35, 46:55)
x1x2= seq(25, 55, by=.01)
0.08
0.06
Binomial Probability
0.04
0.02
0.00
25 30 35 40 45 50 55
On the 10 April 1912 the largest passenger steamship in the world left
Southampton England, to New York City. At 23:40 on 14 April, it struck
an iceberg and sank at 2:20 the following morning, resulting in the deaths
of 1,517 people in one of the deadliest peacetime maritime disasters in
history.
Alonso, A. Inference I 40 / 568
Delta Method Example: Titanic
On the 10 April 1912 the largest passenger steamship in the world left
Southampton England, to New York City. At 23:40 on 14 April, it struck
an iceberg and sank at 2:20 the following morning, resulting in the deaths
of 1,517 people in one of the deadliest peacetime maritime disasters in
history.
Alonso, A. Inference I 40 / 568
Odds of Survival
Odds of surviving
π
Θsurvival =
1−π
n
y= yk ∼ Bin(n, π)
k=1
π(1 − π)
π
∼ N π,
n
survival = π
Θ
1−π
x 1
g (x) = ln , it is easy to show, g (x) =
1−x x(1 − x)
2 π(1 − π)
ln Θsurvival ∼ N ln (Θsurvival ) , g (π)
n
1
Thus asymptotically Var ln Θsurvival =
π(1 − π)n
> install.packages("msm")
> library(msm)
> titanic.path="C:/Equizo/Courses/KULeuven/titanicmissing.txt"
> titanic<- read.table(titanic.path, header=T, sep=",")
> head(titanic,5)
survived pclass sex age
1 1 1st 0 29.0000
2 0 1st 0 2.0000
3 0 1st 1 30.0000
4 0 1st 0 25.0000
5 1 1st 1 0.9167
> p_survival=mean(titanic$survived)
> log_survival_odds=log(p_survival/(1-p_survival))
> n=nrow(titanic)
> var_p_survival=(p_survival*(1-p_survival))/n
> se_odds_delta=deltamethod(g=~log(x1/(1-x1)), mean=p_survival, cov=var_p_survival)
> se_odds_delta^2
[1] 0.003384579
> log_survival_odds
[1] -0.6545499
> survival_odds
[1] 0.5196759
> A.
Alonso, Inference I 44 / 568
Confidence intervals
Of all the realizations of the interval some will contain the true value
of the parameter, but others will not
The probability that the stochastic interval will contain the true value
of the parameter is called the confidence level of the interval
n
p(Ý | θ) = p(yi | θ)
i =1
Interval estimator: [a(Ý ), b(Ý )] aims to include the true value with a
pre-specified probability
Interval estimator
Many 95% CIs have only asymptotically the correct coverage. For
small samples their good behavior needs to be established via
simulations
S S
CI1−α = X̄ − T1− 2 ,n−1 √ , X̄ + T1− 2 ,n−1 √
α α
n n
The probability that the stochastic interval will contain the true value
of the parameter is called the confidence level of the interval
P (μ lies in CI1−α ) = 1 − α
Confidence intervals
S S
CI0.95 = X̄ − T0.975,n−1 √ , X̄ + T0.975,n−1 √
n n
The probability that the stochastic interval will contain the true value
of the parameter is called the confidence level of the interval
95% of all realizations of the interval will contain the true value of
the population mean μ
0.2
n , x + T1−α 2S
0.0
CI: [ x − T1−α 2S
−0.2
−0.4
0 20 40 60 80 100
Samples
1 − α = 0.95 1 − α = 0.99
Alonso, A. Inference I 51 / 568
Likelihood
Bayesian paradigm
Frequentist paradigm
Likelihood paradigm
Likelihood
Probability that sample Ý = (y1 , . . . , yn ) has happened under
p-dimensional θ is given by p(Ý | θ)
Maximum likelihood:
Given a data set Ý , the value of θ that maximizes L(θ | Ý ) is called the
maximum likelihood estimator (MLE) and is denoted as θ
we rather maximize
To find θ
n
(θ | Ý ) ≡ log L(θ | Ý ) = log p(yi | θ)
i =1
We look at the i.i.d. case, but the MLE properties can be extended to
the non-i.i.d case
Score function
The score function is defined as:
∂ (θ) ∂ (θ) ∂ (θ)
Ë (θ) = = ,...,
∂θ ∂θ1 ∂θp
Score function
If the data were generated by p(y | θ 0 ) then
E Ë (θ 0 ) = 0
n n
Var Ë (θ 0 ) = T×
i =1 Var × i (θ 0 ) = i =1 E × i (θ 0 ) i (θ 0 )
In the i.i.d. data case Var Ë (θ 0 ) = nE × (θ 0 )
T × (θ
0) where
∂
× (θ 0 ) = log p(y | θ 0 )
∂θ
Moreover, the so-called Information Matrix Equality (IME) holds
∂2
E × (θ 0 ) × (θ 0 ) = −E
T
log p(y | θ 0 ) = Á 1 (θ 0 )
∂θ∂θ
and, hence, Var Ë (θ 0 ) = nÁ 1 (θ 0 ) = Á (θ 0 )
For large samples Ë (θ 0 ) ∼ N 0, Á (θ 0 ) with Á (θ 0 ) = −E À (θ 0 )
Cramér-Rao bound
In its simplest form, the bound states that the variance of any
unbiased estimator is at least as high as the inverse of the Fisher
information
(θ) = ( ∂ψ
i
∂θ ) is the Jacobian matrix
j
Properties MLE
Efficiency: It achieves the CRB when the sample size tends to infinity.
This means that no unbiased estimator has lower asymptotic mean
squared error (MLE asymptotically unbiased)
mle (Ý ) ∼ N θ 0 , Á (θ 0 )−1 with Á (θ 0 ) = −E À (θ 0 ) for
Therefore, θ
large samples
Regularity conditions
The number of parameters must not increase with the sample size, at
least not too quickly
θ k+1 ≈ θ k − À (θ k )−1 Ë (θ k )
n y
P(y | π) = π (1 − π)(n−y ) (y = 0, 1, . . . , n)
y
12 9
L(π | 9) = π (1 − π)3
9
12
(π | 9) = log + 9 log(π) + 3 log(1 − π)
9
Alonso, A. Inference I 68 / 568
Surgery example: Binomial likelihood
MLE: maximize L(π | y ) or better (π | y )
(π | y ) = y ln π + (n − y ) ln(1 − π) + constant
d y (n − y )
(π | y ) = − =0⇒π
= y /n
dπ π (1 − π)
For y = 9 and n = 12 ⇒ π
= 0.75
## Likelihood function
n.size=12
y.su=9
llik2 <- function(p)-sum(dbinom(y.su,prob=p,size=n.size,log=TRUE))
p_MLE=nlm(llik2,p=c(0.5), hessian = TRUE)
> p_MLE
$minimum
[1] 1.354394
$estimate
[1] 0.7499995
$gradient
[1] -1.190159e-07
$hessian
[,1]
[1,] 64.03399
## Likelihood function
n.size=12
y.su=9
llik2 <- function(p)-sum(dbinom(y.su,prob=p,size=n.size,log=TRUE))
p_MLE=nlm(llik2,p=c(0.5), hessian = TRUE)
20
15
10
5
0
π
∼ N [π, π
(1 − π
)/n]
π
∼ N [π, π
(1 − π
)/n]