Statistical-Models
Statistical-Models
Giorgio Mantero
15-02-2022
1
Contents
1 Introduction to Statistical models and some recalls 9
1.1 Some experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Sample spaces and σ-fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Statistical models (non-parametric) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Statistical models (parametric) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Parametric simple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Statistical models (nonparametric) II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7 Nonparametric statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.8 Statistical models in mathematical statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.8.1 Population, sampling and sampling schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.8.1.1 Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.8.1.2 Sampling schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.8.1.3 Cluster sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.9 Aims of statistical inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.10 Point estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.11 Estimator and estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.12 Sample mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.13 Unbias - Consistent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.13.1 Unbiased estimators and consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.13.2 Estimation of the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.13.3 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.13.4 CI for the mean of a normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.14 Sum up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.15 Testing statistical hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.15.1 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.15.2 Role of the hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.15.3 Level of the test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.15.4 One-tailed and two-tailed tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.15.5 Test statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.15.6 Rejection region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.15.7 Large sample theory for the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.15.8 p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2
2.5.1 Bernoulli scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6 Binomial random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.7 Geometric random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.8 Negative binomial random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.9 Generalization of "N Bin(r, p)" to real "r" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.10 Poisson random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.10.1 Poisson is the limit of the binomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.10.2 Over-dispersed Poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.11 Mixture of two random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.12 Mixture of several random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3 Lab 1 40
3.1 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5 Lab 2 65
5.1 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3
6 Multivariate random variables 69
6.1 Bivariate random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1.1 Marginal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1.3 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1.4 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1.5 Covariance and correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1.6 Properties of the covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2 The multinomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2.1 Generate data from a multinomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3 Chi-square goodness-of-fit test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.4 Bivariate continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.4.1 Conditional distributions and conditional expectation . . . . . . . . . . . . . . . . . . . . . . 77
6.4.2 Independence and correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.5 Multivariate normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.6 Multivariate standard normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.6.1 Figures of multivariate normal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.6.2 Property - Linear combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.6.3 Property - Subsets of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.6.4 Property - Zero-correlation implies independence . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.7 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.7.1 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.7.2 Generate multivariate normal data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7 Lab 3 85
7.1 Multivariate random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8 Likelihood 91
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.2.1 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.3 Maximum likelihood estimation [MLE] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.3.1 MLE: example for a Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.4 Log-likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.4.1 Properties of the MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
8.4.2 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.5 Score function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.6 Score and MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.6.1 Role of score in MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.6.2 Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.6.3 Variance and information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.6.4 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4
8.7 The Cramer-Rao lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.7.1 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.7.2 Another information identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.7.3 Asymptotic distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.7.4 Multiple parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.8 Exponential families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8.8.1 Exponential families – in view of GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.9 Properties of the score statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.9.1 Score statistic for exponential families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.9.2 Mean and variance for exponential families . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.10 Link functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.10.1 Benefits of canonical links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
10 Lab 4 123
10.1 Simulation - The bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5
11.5 Properties of the LS estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
11.6 The Gauss-Markov theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
11.7 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
11.8 F-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
11.9 The Student’s "t" test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
11.10"R2 " coefficient (coefficiente di determinazione) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
11.11The plot of the residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
11.12Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
11.13Working with qualitative predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
11.14Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
11.15Nonparametric approach: the k-NN method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
11.16Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
11.16.1 k-NN in Software R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
11.17The k-NN for classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
11.18The effect of "k" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
11.18.1 With Software R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
11.19Pros and cons of k-NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
12 Lab 5 145
12.1 Multiple Linear Regression - Boston Housing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6
13.21Asymptotic distribution of the MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
13.22The Wald test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
13.23The likelihood ratio test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
13.24Wald test vs Likelihood ratio test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
13.25Deviance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
13.26Deviance residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
13.27Computational remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
15 Lab 6 191
15.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
15.1.1 Exercise 1 - Credit card default . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
17 Lab 7 216
17.1 Regression for count data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
17.1.1 Exercise 1 - AirBnB’s in Nanjing (China) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
7
18.2 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
18.3 Bias-variance trade-off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
18.4 Drawbacks of the validation-set approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
18.5 K-fold cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
18.6 Special case: LOOCV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
18.6.1 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
18.7 Cross-validation in Software R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
18.8 Cross-validation for classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
18.9 Cross-validation for classification in Software R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
18.10Cp , AIC, BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
20 Lab 8 238
20.1 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
20.1.1 Exercise 1 - Credit card balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
20.1.2 Exercise 2 - Cancer remission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
8
1 Introduction to Statistical models and some recalls
1.1 Some experiments
The first ingredient of a statistical model is the set of all possible observed outcomes of the random variables
involved in the problem. This is the sample space "Ω" of a random experiment.
Die roll
Roll a die:
Ω = {1, 2, 3, 4, 5, 6}
Coin toss
Toss a coin 10 times:
Ω = {H, T }10 or Ω = {0, 1}10
In this case we are considering the set of all possible vectors of length "10" of "0" and "1".
Record of failures
Record the number of failures in a internet connection in a given time interval:
Ω=N
Ω = (0, +∞)100
Remember
Given a sample space, the probability is defined over the events, i.e. subsets of the sample space.
Given a sample space "Ω", a σ-field of "Ω" is a family "F" of subsets such that:
• ∅∈F [Event = element of the σ-field F]
• If "E ∈ F", then "Ē ∈ F" ["E" is each element-event and "Ē" is its complement]
• If "(Ei )∞
i=1 ∈ F", then "∪i=1 Ei ∈ F"
∞
9
Typical examples:
Discrete case
For discrete experiments the discrete σ-field is the field of all the possible subsets. All the subsets of Ω are
events:
F = ℘(Ω)
Continuous case
For discrete experiments: "F" is the Borel σ-field (the σ-field generated by intervals). It contains:
• All intervals (open, closed, . . . )
The construction of a σ-field is exact the definition of the domain of a probability function.
Remember
P : F −→ R
such that:
• "0 ≤ P(E) ≤ 1" for all "E ∈ F"
• "P(Ω) = 1"
P∞
• "(Ei )∞
i=1 ∈ F, P(∪i=1 Ei ) =
∞
i=1 P(Ei )" : for disjoint events the probability of the union is the sum of
the probabilities
or
(Ω, F, (Fθ )θ∈Θ )
where "F" denotes the cumulative distribution function of a random variable.
10
The probability distribution has known shape with unknown parameters:
Gaussian model 1
For quantitative variables:
X ∼ N (µ, σ 2 )
known "σ 2 ", is a 1-parameter statistical model with "θ = µ", and "Θ = R".
Gaussian model 2
For quantitative variables:
X ∼ N (µ, σ 2 )
both "µ, σ 2 " unknown, is a 2-parameter statistical model with "θ = (µ, σ 2 )", and "Θ = R × [0, +∞)".
Remember
Remember that, given two quantitative variables "X" (regressor) and "Y", the regression line:
Y = b0 + b1 X
The model
Y = β0 + β1 X + ε
is a 3-parameter statistical model with "θ = (β0 , β1 , σ 2 )" where "σ 2 " is the variance of "ε".
(Ω, F, (F )F ∈D )
Usually some restrictions on "F" are made so that "F" is not as general as possible, but belongs to a set "D"
of distributions ["D" is the set of all the possible probability distributions].
11
Density estimation based on 500 sampled data (temperature in 500 weather stations) [a very naive nonparametric
density estimation: the histogram]:
12.07 7.70 10.62 13.57 8.07 12.23 8.61 12.25 18.17 11.10. . .
x̄ = 11.253 s = 3.503
12
1.8 Statistical models in mathematical statistics
Before analyzing situations where several variables (response and predictors) are involved we need to deepen our
mathematical knowledge about probability and statistics, and we start by considering only one variable at a time.
We start here with toy examples to fix the mathematical background. We will come back to the problems with
several variables in the second part of the lectures.
Population is the (theoretical) set of all individuals (or experimental units) with given properties. The population
is the target set of individuals of a statistical analysis. For instance, the following are examples of populations:
1.8.1.1 Samples
Sample
The analysis of the whole population can be too expensive, or too slow, or even impossible. Statistical tools allow
us to understand some features of a population through the inspection of a sample, and controlling the sampling
variability. This is the basic principle of inference: we analyze the sample in order to generalize the information
we have on the sample to the whole population under the assumption that the sample is a good representation of
the entire set of the population.
1.8.1.2 Sampling schemes How to choose a sample? The sampling scheme affects the quality of the results
and therefore the choice of the sampling scheme must be considered with great care. Usually, one has to find a
tradeoff between two opposite requirements:
13
Among probability sampling schemes (techniques based on randomization):
• Simple random sampling (without replacement): the elements of the samples are selected like the
numbers of a lottery.
• Simple random sampling (with replacement): the elements of the samples are selected like the num-
bers of a lottery, but each experimental unit can be selected more than once. Although this seems to be
a poor sampling scheme, it leads to mathematical easy objects (densities, likelihoods, distributions of the
estimators. . . ), so that it is commonly used in the theory.
• Stratified sampling: the elements of the sample are chosen in order to reflect some major features of the
population (remember our discussion on "controlling for confounders"): the units are sampled proportionally
from each group:
• Systematic sampling: systematic sampling (also known as interval sampling) relies on arranging the study
population according to some ordering scheme and then selecting elements at regular intervals through that
ordered list.
• Cluster sampling: The sample is formed by clusters in order to speed up the data collection phase. In some
cases, cluster sampling has a two-stage procedure if a further sampling scheme is applied to each cluster.
14
1.8.1.3 Cluster sampling Among nonprobability sampling schemes:
• Accidental sampling (sometimes known as grab, convenience or opportunity sampling) is a type of non-
probability sampling which involves the sample being drawn from that part of the population which is close
to hand.
• Voluntary sampling: the sample is selected on a voluntary basis. This is quite common in social media
surveys. It is difficult to make inference from this sample because it may not represent the total population.
Often, volunteers have a strong interest in the main topic of the survey.
Remark: in our theory, only samples with independent random variables will be considered.
Estimator
An estimator of the parameter "θ" is a function of a known sample:
T = T (X1 , . . . , Xn )
Note that:
• The estimator "T" is a function of "X1 , . . . , Xn " and not (explicitly) of "θ"
• The estimator "T" is a random variable
When the data on the sample are available, i.e., when we know the actual values "x1 , . . . , xn ", we obtain an estimate
of the parameter "θ".
Estimate
The estimate is a number:
θ̂ = t = t(x1 , . . . , xn )
15
1.12 Sample mean
To estimate the mean "µ" of quantitative random variable "X" based on a sample "X1 , . . . , Xn " we use the sample
mean, defined as:
n
X1 + · · · + Xn 1X
X̄ = = Xi
n n i=1
X1 + · · · + Xn
V AR(X̄) = V AR =
n
1
= 2 (V AR(X1 ) + · · · + V AR(Xn )) =
n
1
= 2 nσ 2 =
n
σ2
=
n
which tells us that the variance of the sample mean goes to "0" when "n" goes to infinity: we have the consistency
property:
lim V AR(X̄) = 0
n→∞
Unbiased estimator
An estimator "T" is an unbiased estimator of a parameter "θ" if its expected value is equal to the parameter
itself:
E(T ) = θ ∀θ
16
The mean square error or "T" is defined as:
M SE(T ) = E (T − θ)2
and it is equal to the variance "V AR(T )" for unbiased estimators. The rule for the MSE is "the lower the better":
Consistent estimator
An estimator is consistent if:
lim V AR(Tn ) = 0
n→∞
• Unbiased
• Consistent
for the population mean, for whatever underlying distribution, provided that the mean and variance of "X" exist.
Confidence interval
A confidence interval (CI) for a parameter "θ" with a confidence level "1 − α ∈ (0, 1)" is a real interval
"(a, b)" such that:
P(θ ∈ (A, B)) = 1 − α
Let "X1 , . . . , Xn " be a sample of Gaussian random variables with distribution "N (µ, σ 2 )" (both parameters are
unknown). In such a case, the variance is estimated with the sample variance:
n
1 X
S2 = (Xi − X̄)2
n − 1 i=1
and:
X̄ − µ
T = S/√n
17
It is easy to derive the expression of the CI for the mean:
thus
S S
CI = X̄ − t α2 √ , X̄ + t α2 √
n n
x̄ = 25.33 s2 = 21.70
and therefore the 95% confidence interval for the mean is:
CI = (22.37, 28.29)
> qt(0.975,11)
[1] 2.200985
18
1.14 Sum up
Our main goal is the estimation of parameters: the basic idea is that we have a population "Ω" (a large set of
individual statistical units). Making inference about a population means that we pick a sample and on that sample
we perform data analysis. Then we generalize the data analysis we did on the sample to the whole population
(inference).
"Parameter estimation" means that we assume that our data has a given particular analytical shape (for instance
a Gaussian distribution), and that the only "unknown" is the parameter (or parameters). "Parametric statistics"
means that we fix "f " and that θ is unknown.
How to measure the performance of an estimator:
• Unbiasedness of the estimator: the expected value of the estimator is equal to the parameter. The estimator
is centered around the parameter:
E(T ) = θ
• Consistency of the estimator: the variance MSE (Mean square error) is equal to:
M SE = E((T − θ)2 )
we want to pick an estimator with lowest MSE possible. When the sample size goes to infinity:
n→∞
• Confidence interval: an easy way to put both these two pieces of information together is to use a CI for
the parameter: we give information about the position and the precision. We define a "Confidence Interval"
as the same time as:
– Point estimation
– Precision of the estimate
The definition for a general parameter, CI with level "(1 − α)", is an interval "(A, B)" such that:
19
1.15 Testing statistical hypotheses
A statistical test is a decision rule. It is a completely different way of doing estimation w.r.t. confidence interval:
this means that in the end we have two opposite options which are "accept" or "reject" an hypothesis.
1. In the first stage we state an hypothesis about the value of our parameter (or the distribution) under investi-
gation
1.15.1 Hypotheses
We want to check if the mean score in an exam is higher than the historical value "23.4", after the imple-
mentation of new online teaching material.
or
H0 : µ = 23.4 H1 : µ ̸= 23.4
In this case the two alternatives don’t have the exact same meaning even if, from the mathematical pov,
both can be used. In this case, as we want to check if the online material has increased the mean-score of
the class, is better to use the first statement.
Remember that when we set up a test our objective (our willing) is to obtain the result of "accepting "H1 ".
A statistical test is conservative: when constructed for a given nominal significance level, the true probability of
incorrectly rejecting the null hypothesis is never greater than the nominal level. We take "H0 " unless the data are
strongly in support of "H1 ". The statement to be checked is usually placed as the alternative hypothesis.
Statistical tests are very useful when dealing with regression models (testing the significance of variables).
We want to check if the mean score in an exam is higher than the historical value "23.4", after the imple-
mentation of new online teaching material. The correct hypotheses here are:
20
1.15.3 Level of the test
The test is conservative so we fix the probability of Type I error (reject "H0 ") [of course if we fix the Type I we
can’t fix the Type II]:
α = PH0
We fix the Type I since we first set the value of "α" (we control the probability of a false positive) [false positive
means that we introduce into the model a non-significant regressor]. Remember that these two errors compensate
each other: if we reduce one type of error we increase the other one as these two kind of errors are related.
Remember
For composite "H0 " we can reduce it to a simple one by taking the value of "H0 " nearest to "H1 ".
• Two-tailed test
H0 : µ = µ0 H1 : µ ̸= µ0
The decision rule is given by a test statistic, which is a function of the sample:
Test statistic
A test statistic is a function "T" dependent on the sample "X1 , . . . , Xn " and the parameter "θ". The
distribution of "T" must be completely known "under H0 " (so when it’s true).
Thus:
T = T (X1 , . . . , Xn , θ)
Remark: note that "T" is not in general an estimator of the parameter "θ".
21
In the case of the mean of normal distributions we have a sample "X1 , . . . , Xn " from "N (µ, σ 2 )" with both "µ"
and "σ 2 " unknown.
H0 : µ = µ0 = 23.4 H1 : µ > 23.4
we could use the sample mean "X̄" but the distribution of "X̄" under "H0 " (so when it is true) is:
σ2
X̄ ∼ N µ0 ,
n
and "σ 2 " isn’t known (in general). But a good choice is:
X̄ − µ0
T = S/√n
∼ t(n−1)
This is why statistical test is "easier" than confident intervals. Here we have no "unknowns": we don’t have to "come
back" to "µ" as we can directly apply the formula to decide if "H0 " can be accepted or not.
The philosophy of the test statistics is as follows: if the observed value is "sufficiently far from" to "H0 " in the
direction of "H1 ", then we reject the null hypothesis; otherwise we don’t reject "H0 ". The possible values of T are
divided into two subsets:
• Rejection region [its form depends on the type of the alternative hp we set in our test]
• Acceptance region (or better, a non-rejection region)
For scalar parameters such as the mean of a normal distribution we have three possible types of rejection regions:
PH0 (T ∈ R) = α
For the Student’s t test the critical values can be found on the Student’s t tables (a = −tα/2 , b = +tα/2 ).
22
Exercise - t-test (test statistic)
23 30 30 29 28 18 21 22 18 27 28 30
We have "x̄ = 25.33", "s2 = 21.70" and for a one-tailed right test (level 5%):
R = (1.7959; +∞)
Notice that the sample means records a great increase (almost 2 points) but is still not enough to accept the
alternative hypothesis. Probably here the "problem" is the small sample size as we have very few observations:
with more observations the result may have become significant and we may reject the null hypothesis.
The sample mean parameter has a special property established through the Central Limit Theorem:
Given a sample "X1 , . . . , Xn " i.i.d. from a distribution with finite mean "µ" and variance "σ 2 ", we have:
X̄ − µ
σ/√n
−→ N (0, 1)
Thus, for large "n", the distribution of the sample mean is approximately normal. We will come back later on the
Central Limit Theorem and its usefulness for statistical models.
1.15.8 p-value
All statistical software don’t compute the rejection region, but the "p-value" instead: we take the distribution of the
test statistic, we take the p-value of the test inside the distribution and we compute the probability of obtaining,
under "H0 ", values of the test statistic as extreme as the results actually observed.
P-value
The "p-value" is the probability of obtaining under "H0 " test results at least as extreme as the results actually
observed
In the case of one-tailed test the standard theory was about selecting a critical value "tα " in order to determinate
the region of acceptance and rejection. The p-value works differently: it doesn’t compute the "tα " but instead
23
computes the probability of the test statistic on the observed value in the direction of "H1 ".
The practical rule is:
There are two main advantages when working with the p-value instead of the rejection region and test-statistics:
• First of all there are many ways to compute the same test (we can use several different test statistics
with different rejection regions): we don’t need to define a priori what kind of test we are going to adopt as
we will get the same p-value.
• We don’t need to define a priori the "α": the p-value is computed without fixing any probability of error,
so we can compute it and then check if it is less or greater than the standard thresholds.
This means that we can compute the "p-value" without saying what kind of test we use and without defining the
level of the test: it is completely independent.
In our example we use the "t−test" [the test for the mean of a normal distribution in all possible combinations]
where we set:
• "X" which is the data we provide (in this case our 12 scores)
> x=c(23, 30 ,30, 29, 28, 18, 21 ,22 ,18, 27, 28, 30)
> t.test(x,mu=23.4,alternative="greater")
we see we obtain:
• "t = 1.4378" which represents the test statistic (which is the same we previously computed)
• "df = 11" which represents the degrees of freedom ("n − 1")
• "p − value = 0.08916" which represents the p-value
Small p-values are significant tests while large p-values aren’t significant. As our p-value is greater than "α" our
test is not significant and so we accept the null hypothesis "H0 ". Another advantage of the p-value is that we can
detect if we are in borderline situations (it means that different values of "α" can make the difference in accepting
of rejecting): this happens when my p-value is very close to the value "α".
24
2 Discrete random variables
2.1 Why random variables
Consider a random experiment modeled by a sample space "Ω": remember that the sample space is the set of
all possible outcomes in some experiment. Random variables, like probabilities, were originated in gambling and
therefore, some terminology comes from gambling. The actual value of a random variable is determined by the
sample point "ω ∈ Ω" that prevails, when the underlying experiment is conducted. We cannot know a priori
the value of the random variable, because we do not know a priori which sample point "ω" will prevail when the
experiment is conducted. We try to understand the behavior of a random variable by analyzing the probability
structure of that underlying random experiment. The typical random experiment in statistics is sampling.
A random variable (r.v.) translates each possible outcome into a real number, so mathematically is a
function:
X : Ω −→ R
When the image of "X" is finite or countably infinite, the r.v. is a discrete random variable. For example we
can think about:
• The expected value is the sum of all possible values multiplied by the density of probability
X
E(X) = XfX (X)
X
25
2.4 Cumulative distribution function
The cumulative distribution functions takes any possible real number and computes the probability of values lower
than or equal to "x". The cdf has two main properties:
Remark: notice that "FX " is defined for all "x ∈ R".
• 0 ≤ FX (X) ≤ 1
• "FX " is non-decreasing because, as long as we move towards "+∞", we constantly increase the value
• " lim FX (X) = 0" and " lim FX (X) = 1"
x→−∞ x→+∞
x 1 2 3 4 5 6 7
fX (X) 0.1 0.3 0.2 0.1 0.1 0.1 0.1
26
Exercise - barplot cumulative distribution function (cdf)
barplot(p,names.arg=x)
plot(x,Fx,pch=19)
#OR ALTERNATIVELY#
plot(stepfun(x,c(0,Fx)))
Bernoulli distribution
A r.v. "X" follows the Bernoulli distribution with parameter "p ∈ (0, 1)" if its density is:
fX (0) = 1 − p fX (1) = p
and we write:
X ∼ Bern(p)
The expected value and variance are indeed:
Let us consider a set of sequence of binary random variables. So we take an experiment with two possible outcomes
and we repeat it:
X1 , X2 , . . . , Xn , . . .
of Bernoulli random variables i.i.d (independent [the experiments don’t influence each other] and identically
distributed [we repeat the exact same experiment]). Such a sequence of i.i.d. random variables is named a
Bernoulli scheme.
Questions:
1. What is the probability of "x" successes in the first "n" trials? [Binomial distribution]
2. What is the probability that the first success appears after "x" failures? [Geometric distribution]
3. What is the probability that the r -th success appears after "x" failures? [Negative binomial distribution]
27
2.6 Binomial random variables
The first question is easy as essentially we can consider at a Bernoulli scheme:
X1 , X2 , . . . , Xn , . . .
X = X1 + · · · + Xn
where
• "px " is the probability that the first "x" trials are 1’s
• "(1 − p)n−x " is the probability that the remaining "n − x" trials are 0’s
n
• " " is the number of combinations of "x" successes within the "n" trials.
x
Binomial distribution
A discrete r.v. "X" has binomial distribution with parameters "n" and "p" if its density is:
n x
fX (X) = p (1 − p)n−x x ∈ {0, . . . , n}
x
and we write:
X ∼ Bin(n, p)
The two parameters are:
• the number of trials "n" (integer "n ≥ 1")
• the probability of success at each trial "p ∈ (0, 1)"
The expected value and the variance are indeed:
Remember that the binomial random variable "X" is simply the sum of "n" i.i.d. Bernoulli r.v.’s.
1
X ∼ Bin 10,
3
We need "P(X ≥ 9)":
9 1 10 0
10 1 2 10 1 2 ∼
P(X = 9) + P(X = 10) = + = 0.0003556
9 3 3 10 3 3
28
Software R for the binomial distribution provides 4 functions:
• dbinom: density
The previous computation (where we have "X", "n" and "p" respectively as parameters) is
dbinom(9,10,1/3) + dbinom(10,10,1/3)
#OR ALTERNATIVELY#
1 - pbinom(8,10,1/3)
Use the following Software R code to plot the densities of different binomial r.v.’s by changing "n" and/or
"p".
x = 0:20
y = dbinom(x,20,0.5)
plot(x,y,pch=15)
#OR#
barplot(y,names.arg=x)
29
2.7 Geometric random variables
We now answer the second question: let us consider the number of failures "X" until the first success in a Bernoulli
scheme with parameter "p" (this is the waiting time of the first success). For all "x ≥ 0", we have:
Geometric distribution
A discrete r.v. "X" has geometric distribution with parameter "p" if its density is:
and we write:
X ∼ Geom(p)
The expected value and the variance are indeed:
1−p 1−p
E= V AR(X) =
p p2
It means that, in a sequence of independent trials, the information about the number of failures until a given
time has no information about the future: the classic example is the "lottery".
Let "X ∼ Geom(p)", we have:
P(X = x + h|X > h) = P(X = x)
We compute the probability given "X > h", which means that in the first "h" trials we know we had no
successes (all failures).
The memory-less property (also called the forgetfulness property) means that a given probability
distribution is independent of its history: any time may be marked down as time zero. If a probability
distribution has the memory-less property the likelihood of something happening in the future has no relation
to whether or not it has happened in the past. The history of the function is irrelevant to the future.
Notice that in most probability textbooks a slight different definition of the geometric density is used:
30
Exercise - Geometric density - Cumulative distribution function
x=0:20
y=dgeom(x,1/6) #density#
yc=pgeom(x,1/6) #cumulative distribution function#
par(mfrow=c(1,2))
plot(x,y,pch=19,main="Density")
plot(stepfun(x,c(0,yc)),lwd=2,main="CDF")
par(mfrow=c(1,1)) #to reset the plot window#
2 Plot the density of the "Geom(1/2)" (use "0 ≤ x ≤ 20") and the corresponding CDF’s
x=0:20
y=dgeom(x,1/2) #density#
yc=pgeom(x,1/2) #cumulative density distribution#
par(mfrow=c(1,2))
plot(x,y,pch=19,main="Density")
plot(stepfun(x,c(0,yc)),lwd=2,main="CDF")
par(mfrow=c(1,1)) #to reset the plot window#
31
Exercise - Geometric density - Cumulative distribution function
Toss a die until it gives "1" (this is a Bernoulli scheme with "p = 1/6").
• What is the probability that the game ends exactly in the sixth trial?
• What is the probability that the game ends exactly by the sixth trial?
dgeom(5,1/6)
[1] 0.0669796
pgeom(5,1/6)
[1] 0.665102
We need:
A discrete r.v. "X" has the negative binomial distribution with parameters "r" and "p" if its density is:
r+x−1
fX (X) = P(X = x) = (1 − p)x pr x≥0
r−1
and we write:
X ∼ N Bin(r, p)
So the construction of the density of this r.v. is basically the same we performed for the geometric distribu-
tion, but now we set the success as the "r-th" element.
The expected value and the variance are indeed:
r(1 − p) r(1 − p)
E(X) = V AR(X) =
p p2
Waiting for the "r-th" success is just like waiting for the first one, then "lack-of-memory" and then starting over
with a new process and so on. The negative binomial distribution is simply the sum of "r" geometric distributions
and, based on the lack-of-memory, the sum is the sum of independent random variables.
32
2.9 Generalization of "N Bin(r, p)" to real "r"
It is useful to express the factorial using the "Γ" function, which is a function that interpolates the values of the
factorial (we can extend the concept of factorial to every real positive number):
Gamma function
The Gamma function is defined by Z ∞
Γ(s) = ts−1 e−t dt
0
for any positive real number "s > 0".
Γ(r + x)
fX (X) = P(X = x) = (1 − p)x pr x≥0
Γ(r)x!
Remark: since "r" will be regarded as a dispersion parameter in a statistical model, it is crucial to have this
generalized definition available.
λX
fX (X) = e−λ x≥0
x!
and we write:
X ∼ P(λ)
The expected value and the variance are indeed:
E(X) = λ V AR(X) = λ
It is a very simply distribution as the expected value is equal to the variance (this can be both an advantage or
disadvantage): since there’s only one parameter, if you fix the expected value you are also fixing the variance.
33
Exercise- Poisson density - Poisson distribution
The annual number of earthquakes in Italy is Poisson r.v. with parameter "λ = 4". Compute
dpois(0,4)
[1] 0.01831564
1-ppois(6,4)
[1] 0.110674
1-sum(dpois(0:6,4))
[1] 0.110674
Mimic the exercise on the geometric distribution to plot the densities of some Poisson distribution. The
plots of "P(5)" and "P(2)":
• "n" is large
• "p" is small
34
Theorem - Sequence of binomial random variables
Let "(Xn )n≥1 " be a sequence of binomial r.v.’s "Xn ∼ Bin(n, pn )" with parameters "n" and "pn ". We know
that if "npn → λ > 0" when "n → ∞" then:
λx
lim P(Xn = x) = e−λ
n→∞ x!
n=30
p=0.1
x=0:n #vary if needed#
y1=dbinom(x,n,p)
lambda=n*p
y2=dpois(x,lambda)
plot(x,y1,col="red",pch=19)
points(x,y2,col="black",pch=22)
From the plot we can see that we obtained a good approximation even when "n" is not that large.
So the idea of the Poisson distribution is to have an approximation of the binomial distribution. This is valid as
"n → ∞" but it’s a reasonable approximation also for smaller "n".
While Poisson is the first choice to model count data, it has the constraint:
V AR(X) = E(X) = λ
It is especially useful for discrete data over an unbounded positive range whose sample variance exceeds the sample
mean. In such cases, the observations are over-dispersed with respect to a Poisson distribution, for which the
mean is equal to the variance. The two parameters "r" and "p" can be adjusted to reach the desired mean and
variance. The data is "over-dispersed" (property) when the variance of the sample is greater than the expected
value.
For the opposite case (under-dispersion), when the variance is less than the expected value, we can’t use again
the negative binomial distribution as the probability "p ∈ (0, 1)" and this implies that the variance is larger than
the expected value (it is the denominator). This is a problem but in the majority of cases based on real data we
only find over-dispersion (which we can model) and not under-dispersion.
35
Exercise 1 - Poisson and negative binomial distributions
E(X) = 3 V AR(X) = 6
are indeed:
r=3 p = 0.5
r(1 − p) r(1 − p)
E(X) = V AR(X) =
p p2
so we can easily compute "p" as the ratio:
r(1−p)
E(X) p
= r(1−p)
=p
V AR(X)
p2
2 Plot (in the same graph) the density of "P(3)" and of the negative binomial above, with the following
Software R code:
x=0:15
y1=dpois(x,3)
y2=dnbinom(x,3,0.5)
plot(x,y1)
points(x,y2,pch=16)
The plot of the density of "P(3)" and of the "N Bin(3, 0.5)" are:
The negative binomial (black dots) has a larger variance as we have a greater probability on the tail, while
the Poisson (white dots) is more concentrated around the mean.
36
Exercise 2 - Poisson and negative binomial distributions
E(X) = 10 V AR(X) = 14
r(1 − p) r(1 − p)
E(X) = V AR(X) =
p p2
so we can easily compute "p" as the ratio:
r(1−p)
E(X) p 10
= =p= = 0.714286
V AR(X) r(1−p) 14
p2
We then obtain:
p = 0.714286 r = 25
2 Plot (in the same graph) the density of "P(10)" and of the negative binomial above, with the R code
as above (use the x-range "0 ≤ x ≤ 20").
x=0:20
y1=dpois(x,10)
y2=dnbinom(x,25,0.714286)
plot(x,y1)
points(x,y2,pch=16)
3 Try with other values of mean and variance (for instance, try "E(X) = 10" and "V AR(X) = 100").
...
Mixtures are used in statistics when, inside our population, we have sub-populations with different behaviors. The
37
best way to describe this kind of populations is by using the mixtures: this means that inside our populations we
have sub-classes with different distributions or with the same distribution but with different parameters. They also
model two-stage experiments.
We could for example take into consideration the mixture of two different Poisson distributions. In this case we
consider two populations with the same weight "α = 12 ":
Let "X ∼ P(5)" and "Y ∼ P(2)". The density of the mixture with "α = 12 " is:
1 −5 5x 1 2x
f (x) = e + e−2
2 x! 2 x!
1 Plot the mixture density above
x=0:20
y=1/2*exp(-5)*(5^x)/(factorial(x))+1/2*exp(-2)*(2^x)/(factorial(x))
plot(x,y)
...
38
Exercise - Zero-inflated Poisson
In some cases, one component of the mixture is deterministic. This is for instance the case of the Zero-
Inflated Poisson. Zero-inflated means that we observe a Poisson distribution but we have an excess of zero.
The ZIP distribution is defined as the mixture:
where:
• "δ0 (x)" is a (non-random) zero
• "fY (x)" is a "P(λ)" distribution
Data of this kind are very common in count-data. A real example of this distribution may be "car accidents":
car accidents follows a Poisson distribution plus an excess of zero for those who don’t drive cars.
Remember that, for example, for an "α = 0.2" we will get a greater "zero" element as it also contains the
probability of the Poisson distribution.
39
3 Lab 1
3.1 Discrete random variables
• Poisson approximation to the binomial distribution
1. Fix "n = 50" and "p = 0.05". Plot on the same graph the density of the binomial "Bin(n, p)" and the
relevant approximating Poisson density [take "0 : 20" as domain of the density]. How much probability
is lost with this choice?
n=50 #number of obs#
x=0:20 #domain of the density#
p=0.05 #probability#
y1=dbinom(x,n,p) #binomial density 1#
plot(x,y1,col="red") #plot of the binomial density 1#
lambda=n*p #definition of the parameter lambda#
y2=dpois(x,lambda) #Poisson density 1#
points(x,y2,col="green") #plot of the Poisson#
1-sum(y1) #the probability lost: 1-sum of probabilities#
As you can see from the code the lost probability is a very small number: this means that we practically
have no more probability left in the range [21,50]. We can then reduce our domain to [0,20] to have a
better representation of our density. From the graphical point of view notice that beyond "9" we only obtain
the green points as the two distributions coincide (remember that this is a graphical approximation).
2. Vary the parameters "n" and "p", choosing a small "p" in order to improve the approximation.
n=500 #increased number of obs (x10)#
x=0:20
p=0.005 #decreased probability (:10)#
y1=dbinom(x,n,p)
plot(x,y1,col="red")
lambda=n*p
y2=dpois(x,lambda)
points(x,y2,col="green")
1-sum(y1)
40
Since we want to obtain a better approximation we increase the number of observations and we decrease
the probability. Indeed the approximation "Binomial to Poisson" is good as long as "n → ∞" and "p → 0".
In this case we try to replicate the example by multiplying and dividing for a "10" factor. From the plots
we notice that there’s a smaller difference in the densities (almost all the points graphically coincide).
Since the graphical feedback isn’t a great solution from a statistical point of view, the best way to decree
the goodness of the approximation is again to define a vector of the difference between the two densities.
For the Poisson distribution we know that the mean is equal to the variance (this isn’t a good property
when modeling discrete data).
3. When the differences become too small, define a vector "d" of difference and plot it (with barplot).
Observe that, for fixed "λ" the maximum difference decreases when "n" increases (and thus "p" decreases).
difference=y2-y1 #vector of difference (Poisson-Binomial)#
barplot(difference, names.arg=x)
When the approximation is very good the two plots are basically the same: in order to check the difference
we have to set a "difference" variable between the two densities. If we look at the same example but with
the initial values of "n" and "p" we obtain the same plot (the same shape) but with higher error: we
obtain the same result but with a "x10" error (read the values on the "y" axis).
["names.arg": this parameter is a vector of names appearing under each bar in bar chart]
4. Do you find common behaviors in the plots obtained in item "3)" for different values of "n" and "p"? Can
you explain this? [Hint: compare the formulae of the two variances].
Notice from the plot that the binomial has a slightly lower variance with respect to the Poisson as red
points are a bit higher around the mean and lower over the tails. This happens because the two variances
are indeed:
Bin V AR(X) = np(1 − p) P ois V AR(X) = np
The approximated Poisson is always an overestimation of the original variance of the binomial distribution
as it suppresses the "(1 − p)" factor.
41
COMPLETE EXERCISE
###1)
n=50 #number of obs#
x=0:20 #domain of the density#
p=0.05 #probability#
###2)
n=500 #increased number of obs#
x=0:20
p=0.005 #decreased probability#
y1=dbinom(x,n,p)
plot(x,y1,col="red")
lambda=n*p
y2=dpois(x,lambda)
points(x,y2,col="green")
1-sum(y1)
###3)
difference=y2-y1 #vector of difference (Poisson-Binomial)#
barplot(difference, names.arg=x)
42
• Poisson vs Negative binomial
1. In the file "discrete.txt" you find two variables "x" and "y" on a sample of 100 individuals. Import the
dataset in Software R.
In this case we can simply use the "import dataset" setting or we can also use the console command below
(of course remember to select the right directory):
data=read.table("E:/. . ./discrete.txt", header=T)
View(discrete)
2. Use "x". Use the barplot with the following code to have the distribution of the data in the range "0 : 20":
Then use the function "points" to overlay the density of the best Poisson approximation, and the best
Negative binomial approximation (use different "pch" and/or different colors).
In this case we use the "attach()" function. By doing this the database is attached to the R search path.
This means that the database is searched by R when evaluating a variable, so objects in the database can
be accessed by simply giving their names. This means that if for example we have a dataframe "df" which
contains different variables (x1 , x2 , . . . , xn ), if we apply this function "attach(df) we obtain:
so we don’t need to specify the name of the dataframe everytime we use a function. At the end of the
analysis we will of course use the "detach()" function. This workflow is useful just when operating with
one dataframe at the time as it becomes impossible to use when operating with several dataframes.
We then convert the variable "x" into a "factor": this allows us to perform graphical representations of
the data with labels also when we register "0" frequencies.
attach(data) #we attach the dataframe#
barplot(table(factor(x,levels=0:20))/100)
lambda=mean(x)
points(dpois(0:20,lambda),pch=17,col="red")
p=mean(x)/var(x)
r=mean(x)*p/(1-p)
points(dnbinom(0:20,r,p),pch=19,col="blue")
From the values of "mean(x)" and "var(x)" we notice that our data suffer from some overdispersion: the
variance is significantly larger than the mean: from our theory the Negative Binomial should be a better
approximation of our data than the Poisson distribution.
This is the best approximation of our data using the Poisson distribution but as we can see from the
graphical point of view the 2 maxima don’t really correspond to each other.
As we can see the Negative Binomial distribution has a larger variance (flatter than the Binomial).
43
3. Repeat for "y"
We then repeat the same path for the "y":
mean(y)
var(y)
lambda=mean(y)
p=mean(y)/var(y)
r=mean(y)*p/(1-p)
barplot(table(factor(y,levels=0:20))/100)
points(dpois(0:20,lambda),pch=17,col="red")
points(dnbinom(0:20,r,p),pch=19,col="blue")
detach(data)
44
COMPLETE EXERCISE
###1)
data=read.table("E:/. . ./discrete.txt", header=T)
View(data)
###2)
attach(data) #we attach the dataframe#
barplot(table(factor(x,levels=0:20))/100)
lambda=mean(x)
points(dpois(0:20,lambda),pch=17,col="red")
p=mean(x)/var(x)
r=mean(x)*p/(1-p)
points(dnbinom(0:20,r,p),pch=19,col="blue")
###3)
mean(y)
var(y)
lambda=mean(y)
p=mean(y)/var(y)
r=mean(y)*p/(1-p)
barplot(table(factor(y,levels=0:20))/100)
points(dpois(0:20,lambda),pch=17,col="red")
points(dnbinom(0:20,r,p),pch=19,col="blue")
detach(data)
45
4 Continuous random variables
4.1 Continuous random variables
Random variable in continuous case
The definition is exactly the same we previously saw: a random variable (r.v.) is a function:
X : Ω −→ R
The difference is in the image of "X": in the continuous case the image lies in a continuous set. When the
image of "X" is not countable (typically, the real line or a real interval) the r.v. is a continuous random
variable.
For example:
Remark: notice that (last example) continuous random variables are used to model counts when numbers are
large.
fX : R −→ R
such that the probability of each interval is exactly the integral of the density between the extreme points, for all
"a, b ∈ R" (the area under the curve):
Z b
R(a ≤ X ≤ b) = fX (x) dx
a
So of course when moving from the discrete to the continuous case we swap from "sums" to "integrals".
46
4.2 Cumulative distribution function
In the continuous case the cumulative distribution function (cdf) plays a much important role. It is defined by
the probability of the left-tail for each point:
Z x
FX (x) = P(X ≤ x) = fX (t) dt x∈R
−∞
Remark: notice that "FX " is defined for all "x ∈ R".
• 0 ≤ FX (x) ≤ 1
An example from the well-celebrated fundamental theorem of integral calculus (density of the standard Normal
distribution):
Z b
P(a ≤ X ≤ b) = fX (x) dx = FX (b) − FX (a)
a
This intuition translates into a property which says that the probability of an interval is computed from
the density with an integral. If we know the cumulative distribution function we can compute it: it is enough to
take the difference of the cumulative distribution functions in the upper and lower bounds (FX (b) − FX (a)). This
means that if the cumulative distribution function is available we don’t have to deal with integrals when computing
probabilities.
Definition
The quantile function of r.v. "X" is:
47
If "FX " is continuous and strictly monotone, then the definition simplifies to:
−1
QX (p) = FX (p)
A continuous random variable "X" has uniform density distribution over the interval "[α, β]" if its density
is: (
1
for α ≤ x ≤ β
fX (x) = β−α
0 otherwise
and we write:
X ∼ U[α, β]
The expected value and variance are indeed:
β+α (β − α)2
E(X) = V AR(X) =
2 12
48
Application
The uniform, and especially the "U[0, 1]" plays a central role in the framework of simulation. Let us suppose
−1
we want to generate a random number with density "fX ". If the cdf "FX " is known and "FX " can be
computed, then:
• Choose a random number "u" from a "U[0, 1]" distribution
−1
• Compute "QX (u) = FX (u)"
Definition
In some sense the continuous exponential distribution is the counter-part of the geometric random variable
in the discrete case, in the sense that it has the lack-of-memory property we previously defined for the
geometric distribution. A continuous random variable "X" has exponential distribution with parameter
"λ > 0" if its density is:
fX (x) = λe−λx
when "x > 0" (and "0" otherwise). We write:
X ∼ ε(λ)
49
Exercise - Exponential random variable
Let us consider a r.v. "X" representing the income distribution in a population. If "X" has exponential
distribution with mean 22 ("K$"), compute the probability that an income lies between "15" and "20".
The parameter of the exponential distribution is:
1 1
λ= =
E(X) 22
pexp(20,1/22)-pexp(15,1/22)
[1] 0.1028064
remember that the prefix "p" gives the cumulative distribution function "cdf". The exponential distribution
has a simple expression of the cdf for "x ≥ 0":
FX (x) = 1 − e−λx
Plot in the same graph the densities of "ε(1)", "ε(1/2)" and "ε(1/5)". Use "[0, 10]" for the x-range.
x=seq(0,10,0.1)
y1=dexp(x,1)
y2=dexp(x,1/2)
y3=dexp(x,1/5)
plot(x,y1,type="l",lwd=3,col="cyan")
lines(x,y2,lwd=3,col="black")
lines(x,y3,lwd=3,col="magenta")
50
4.5 Exponential distribution
The exponential distribution is the continuous counterpart of the geometric distribution.
Remember
Gamma distribution
A continuous r.v. "X" has Gamma distribution with parameters "α > 0" and "λ > 0" if its density is:
λα xα−1 e−λx
fX (x) =
Γ(α)
when "x > 0" (and "0" otherwise). Here our "Γ" is again the function:
Z ∞
Γ(s) = ts−1 e−t dt
0
We write:
X ∼ Gamma(α, λ)
The expected value and variance are indeed:
α α
E(X) = V AR(X) =
λ λ2
We encounter a special case for "α = 1" as we have "Gamma(1, λ) = ε(λ)".
The sum of "n" independent exponential r.v.’s "E(λ)" is a "Gamma(n, λ)". However, the definition of the
Gamma distribution is given for a generic positive first parameter. The distribution "Gamma( n2 , 12 )" is a
well-known distribution, the Chi-square distribution "χ2 (n)".
To see what happens with different values of "α" and "λ", and to learn about a different parametrization of the
Gamma distribution, look at
[https://en.wikipedia.org/wiki/Gamma_distribution]
51
Exercise - Gamma random variables
Use (and adapt) the following R code to plot the densities of several gamma random variables:
1 For fixed "α = 2", varying "λ"
2 For fixed "λ = 1", varying "α"
3 Use the R help to learn about two different parametrizations of the Gamma r.v.’s
###Fixed shape###
x=seq(0,10,0.1)
y1=dgamma(x,shape=2,rate=1)
y2=dgamma(x,shape=2,rate=2)
y3=dgamma(x,shape=2,rate=0.4)
plot(x,y2,type="l",lwd=3,col="black",main="Fixed shape")
lines(x,y1,type="l",lwd=3,col="cyan")
lines(x,y3,type="l",lwd=3,col="magenta")
###Fixed rate###
y1=dgamma(x,shape=1,rate=1)
y2=dgamma(x,shape=2,rate=1)
y3=dgamma(x,shape=4,rate=1)
plot(x,y2,type="l",lwd=3,col="black",main="Fixed rate")
lines(x,y1,type="l",lwd=3,col="cyan")
lines(x,y3,type="l",lwd=3,col="magenta")
52
4.7 Gaussian (normal) random variables
Gaussian distribution
A continuous r.v. "X" has normal or Gaussian distribution with parameters "µ" and "σ 2 > 0" if its
density is:
1 (x−µ)2
fX (x) = √ e− 2σ2
2πσ
for all "x ∈ R". We write:
X ∼ N (µ, σ 2 )
The expected value and variance are indeed:
E(X) = µ V AR(X) = σ 2
since we have these results the parameters account for the first two moments of the distribution. It’s
important to notice that the graph of the density is symmetric around the expected value.
This means that if we know that the distribution of two independent random variables is normal, then we
also know the distribution of their sum.
53
3. Central limit theorem:
The Gaussian distribution is the limit of the sample mean for whatever distribution. The sample mean of a
normal sample is normally distributed for each "n" but for large values of "n" the sample mean is normally
distributed for whatever distribution of our data, even for distributions with properties far from the Gaussian.
If "X1 , X2 , . . . , Xn , . . . " is a sequence of i.i.d r.v.’s with "E(Xi ) = µ" and "V AR(Xi ) = σ 2 " not necessarily
normally distributed, then the sample mean:
X1 + · · · + Xn
X̄n =
n
satisfies:
X̄n − µ
σ/√n
−→ N (0, 1)
Let us consider the construction of the boxplot under the point of view of probability theory. One of the main
features of the boxplot is the detection of (univariate) outliers.
Boxplots are the representation of continuous random variables based on the quartiles. The quartiles are
the quantiles which divide the statistical distribution into 4 parts, each of them having one fourth of the
distribution.
Graphically the outliners are the data points displayed with a special symbol: they represent the data
points "1.5" times the interquartile range far from the boxplot. These values are optimized under a normal
distribution: this means that the boxplot has been defined having in mind the Gaussian distribution. The
definition of outliers is optimal if our data comes from a normal distribution.
54
Exercise - Boxplot for continuous random variables - computation of quantiles
Here we need to compute the probability of the outline region. We compute the interquartile range by consid-
ering "Q1 " and "Q3 " (the length of the box):
For the normal "N (0, 1)" distribution we obtain a "IQR = 1.34898". The lower and the upper limits are the
points "1.5" times the interquartile range far from the borders of the box. The probability of the lower zone
in this case is "0.003488302" (by symmetry we obtain the same thing on the other tail): so the probability of
the outline zone is approximately "0.7%". This means that if we generate for example 1’000 data we expect
to find 7 outliers.
The mixtures are used to model the presence of two (or more) sub-populations with different parameters. They
also model two-stage experiments.
55
4.10 Mixture of several random variables
The generalization to more than two variables is straightforward:
with "αj ∈ (0, 1)" and " j αj = 1". Here the "fXk (x)" are normal distributions with their own parameters:
P
The problem is: if we want to estimate the density of a continuous variable "X" on the basis of a sample of size "n",
both the histogram and the empirical cumulative distribution function (cdf) are discrete in nature. This seems to
be bad, because we would like to have a continuous object.
4.11.2 Histogram
Let us give a deeper look into the histogram definition. If "f (x)" is a smooth density, we have that:
Z x+h
x+h x+h
x−h x−h 2
P <X< =F −F = f (z) dz ≃ hf (x)
2 2 2 2 x−h
2
56
where "h > 0" is a small (positive) scalar called "bin width". If "F (x)" were known, we could estimate "f (x)" using:
F ( x+h
2 ) − F( 2 )
x−h
fˆ(x) =
h
but this is not practical because if we know "F " we don’t need to estimate "f ". Different choices for "m" and "h"
will produce different estimates of "f (x)":
hist(x,main="Sturges")
hist(x,breaks="FD",main="FD")
par(mfrow=c(1,1))
• K(x) ≥ 0 ∀x ∈ R It is non-negative
• K(x) = K(−x) ∀x ∈ R The kernel is symmetric around "0"
R +∞
• −∞ K(x) = 1 The integral is "1"
57
In other words, "K" is a non-negative function that is symmetric around "0" and integrates to "1": therefore the
expected value is "0". A simple example is the uniform (or box) kernel:
(
1 if − 12 ≤ x ≤ 21
K(x) =
0 otherwise
Another popular kernel function is the Normal distribution kernel with"µ = 0" and fixed variance "σ 2 ":
1 x2
K(x) = √ e− 2σ2
2πσ
We could also use a triangular kernel function:
(
K(X) = 1 − |x| if x ∈ [−1, 1]
K(x) =
0 otherwise
58
http://upload.wikimedia.org/wikipedia/commons/4/47/Kernels.svg
Given a random sample "X1 , . . . , Xn " i.i.d. from an unknown density "f (x)", the KDE of "f " is:
n n
1 X xi 1 X
ˆ x − xi
f (x) = K (x) = K
n i=1 h nh i=1 h
59
4.11.5 Bias-variance trade-off
In this example if we choose a very small "h", which is basically the standard deviation of the kernel, we would
obtain very concentrated and equal kernels for each data point. The result of the sum of these kernels wouldn’t be
a "smooth" graph as in the previous figure but instead we would obtain 6 different peaks (bias). On the contrary
if we choose a very large "h" (a very large dispersion for each kernel) we would obtain 6 very flat kernels, whose
sum would be a density close to a flat line (variance). The optimal choice lays of course between these two extreme
situations.
x = rnorm(20)
plot(density(x))
plot(density(x,kernel="epanechnikov"))
plot(density(x,kernel="epanechnikov",adjust=0.5))
There’s a clever way to define bandwidth: we can move from the optimal bandwidth (enlarging or restricting
the kernel) by using the "adjust" parameter (Software R multiplies the optimal value for that parameter).
For small values we obtain peaks for every data points (not useful density estimation as we obtain several
local maxima), whereas for bigger values we get greater variance and we lose relevant information about the
data points (we may also lose the outliers) [try for example values of "0.1" and "2"].
60
4.12 Checking for normality
Our objective here is to check if our sample is (or not) a normal one. There are basically two methods: the first
one is based on a graphical comparison of the data and is called Q-Q plot (Quantile-Quantile plot). It is based
on the comparison of two quantiles over the horizontal (empirical quantiles of our distribution) and vertical
(corresponding quantile of the standard normal distribution) axes. If our data is normal then the two distributions
should coincide (the data follow a straight line).
Q-Q plots display the observed values against normally distributed data (the line represents perfect normality).
Another way to check for normality is to use statistical tests. Statistical tests for normality are more precise since
actual probabilities are calculated. Tests for normality calculate the probability that the sample was drawn from a
normal population.
61
4.13 Anderson-Darling test
The Anderson-Darling statistic belongs to the class of quadratic EDF statistics (tests based on the Empirical
Distribution Function, not on the moments). Quadratic EDF statistics look at the "distance":
Z ∞
n (Fn (x) − F (x))2 w(x) dF (x)
−∞
where we have:
• "w(x)" is a weight function which slightly reduces the weight on the extremes (technical factor to obtain a
better test)
• "Fn " is the Empirical Cumulative Distribution Function
• "F " is the Theoretical Cumulative Distribution Function under "H0 "
Thus, the Anderson-Darling distance places more weight on observations in the tails of the distribution.
The Anderson-Darling tests computes the distance (the area) between the Empirical Cumulative Dis-
tribution Function and the Theoretical Cumulative Distribution Function: the more the area, the less
the correspondence of the functions (the null hypothesis corresponds to an area equal to "0").
where "ϕ" is the cdf of the standard normal distribution "N (0, 1)"
• The critical values are in suitable tables
62
Exercise - Andrerson-Darling test statistic
Let us consider the following data (population density of the cities in a Mexican state). Import the data
from the file "mexvillages.txt":
Village PopulationDensity
Aranza 4.13
Corupo 4.53
SanLorenzo 4.69
Cheranatzicurin 4.76
Nahuatzen 4.77
Pomacuaran 4.96
Sevina 4.97
Arantepacua 5.00
Cocucho 5.04
Charapan 5.10
Comachuen 5.25
Pichataro 5.36
Quinceo 5.94
Nurio 6.06
Turicuaro 6.19
Urapicho 6.30
Capacuaro 7.73
Load (and install if necessary) the "nortest" library and use the Software R "help" in order to find how to
perform the Anderson-Darling test.
First of all we perform the Q-Q plot (quantile-quantile compared with the normal distribution).
hist(mexvillages$PopulationDensity)
View(mexvillages)
attach(mexvillages)
###Q-Q PLOT###
qqnorm(PopulationDensity)
qqline(PopulationDensity)
63
We then perform the Anderson-Darling normality test:
ad.test(PopulationDensity)
data: PopulationDensity
A = 0.77368, p-value = 0.03543
The p-value is "3%" so, considering a standard "5%" threshold level, we reject the null hypothesis: this
means that we reject the normality (our data is non-normal). The non-normality problem in this example is
basically caused by the last outlier point we find near "7.5": if we remove that point we would probably accept
the null hypothesis.
64
5 Lab 2
5.1 Continuous random variables
• Normality test
1. Import the data in the file "cust_sat.xlsx". The data set contains the satisfaction score assigned to an
internet banking service by the customers, together with some information about the customers (place
of residence, gender, age). The score in given in a scale 0–1000, and it is the result of the sum of 100
scores on a scale 0–10.
cust_sat <- read_excel(". . ./cust_sat.xlsx")
View(cust_sat)
table(gender)
table(gender)/length(gender)*100 #percentages#
summary(age)
boxplot(age, horizontal=TRUE)
summary(score)
boxplot(score, horizontal=TRUE)
In this particular case we don’t actually notice a "normality" behavior as we have, particularly in both
tails, some variation from the straight line. If we then perform the Anderson-Darling normality test on
65
the "age" data we obtain a very small p-value, which leads us to the rejection of the null hypothesis: our
data is not normally distributed.
Anderson-Darling normality test
data: age
A = 1.7119, p-value = 0.0002166
From the graphical point of view we can see again that we have some variations in the tails, mostly
on the left one. We then perform the Anderson-Darling normality test on the "score" data and again,
considering a threshold of "5%", we reject the null hypothesis: our data is not normally distributed.
Anderson-Darling normality test
data: score
A = 1.4235, p-value = 0.001108
5. The place of residence is a variable with three levels. Take the score separately for each place of residence
and do the normality tests.
ad.test(score[Country=="UK"])
ad.test(score[Country=="GER"])
ad.test(score[Country=="IT"])
So now we compute the same test but separately for each country. From the results we notice that we
accept the normality for all the three countries:
Anderson-Darling normality test
data: score[Country == "UK"]
A = 0.3574, p-value = 0.4519
So the question now is: why the main sample does not respect the normal distribution but its 3 sub-samples
do?
66
6. Compare also the the QQ-plots.
boxplot(score~Country)
67
COMPLETE EXERCISE
###1)
cust_sat <- read_excel("M. . ./cust_sat.xlsx")
View(cust_sat)
###2)
#Dimension check: number of rows and columns#
dim(cust_sat)
attach(cust_sat)
#Categorical variables#
table(Country)
table(Country)/length(Country)*100 #percentages#
table(gender)
table(gender)/length(gender)*100 #percentages#
summary(age)
boxplot(age, horizontal=TRUE)
summary(score)
boxplot(score, horizontal=TRUE)
###3)
#Normality check for the variable age#
qqnorm(age)
qqline(age)
ad.test(age)
###4)
#Normality check for the variable score#
qqnorm(score)
qqline(score)
ad.test(score)
###5)
#Normality check for the variable Country#
ad.test(score[Country=="UK"])
ad.test(score[Country=="GER"])
ad.test(score[Country=="IT"])
###5)
#Compare the Q-Q plots#
boxplot(score~Country)
68
6 Multivariate random variables
6.1 Bivariate random variables
The basic idea of considering multivariate random variables is to measure two or more variables on the same sample
space which means, in statistical terms, on the same statistical individuals. Let us consider two (discrete) r.v.’s "X"
and "Y" on the same sample space.
Joint density
The joint density of "X" e "Y" is the density of the two variables taken together. It is the function of the
pair "X, Y ". We consider the probability of the intersection of:
If the supports of "X" and "Y" are finite (finite sample space), the function "f(X,Y ) " is usually summarized in a
two-way table. In this case we consider a variable "X" (Bernoulli distribution) and a variable "Y " with 3 levels.
This is a joint-densirty as each number in the table gives the probability of the intersections of two events. Since it
is a density of course all the values inside the table sum up to "1".
Y
X 0 1 2
0 0.11 0.09 0.20
1 0.30 0.14 0.16
Given a joint density "f(X,Y ) (x, y)" the marginal densities of "X" and "Y" the densities of the variable without
having any information of the other variable. They are defined by:
X X
fX (x) = P(X = x) = fX,Y (x, y) fY (y) = P(Y = y) = fX,Y (x, y)
y x
The name "marginal" comes from the previous tabular: we can obtain the joint density just by summing up by row
(or column) all the singular densities.
Joint density
Y
X 0 1 2
0 0.11 0.09 0.20 0.40
1 0.30 0.14 0.16 0.60
0.41 0.23 0.36
The joint density is more informative than the marginal distribution as, given a joint density, we can compute
the marginal distribution but we can’t perform the contrary (there are infinite possibilities).
69
6.1.2 Independence
Independence means that the knowledge about the variable "X" doesn’t affect the density distribution (probability)
of the variable "Y ". Two random variables are independent if the realization of one does not affect the probability
distribution of the other.
Independence
The r.v.’s "X" and "Y" are independent if for all "∀A, B ∈ F":
i.e.
f(X,Y ) (x, y) = fX (x)fY (y) ∀x, ∀y
Two r.v.’s are independent if and only if the joint density is the product of the marginal densities.
Given two r.v.’s "X" and "Y" the conditional distributions of "Y" given "X" are the distributions of "Y" for fixed
values of "X". Formally:
Conditional distribution
The conditional distribution of one variable "Y " given "X = x" (fixed), is obtained by the joint distribution
divided by the marginal distribution of the fixed value:
f(X,Y ) (x, y)
fY |X=x (y) =
fX (x)
It is a univariate distribution of the "Y " but we have several of them. Indeed we have one conditional
distribution of the "Y " for each possible value of "X = x".
Compute the conditional distributions of "Y" given "X" (it has two possible values) for:
Y
X 0 1 2
0 0.11 0.09 0.20 0.40
1 0.30 0.14 0.16 0.60
0.41 0.23 0.36
This means that we obtain one different conditional distribution for each row of the table (for the different
values of the fixed variable).
This concept is pretty simple in discrete case since the "X" variable can only assume "few" values: it becomes a bit
complicated in continuous case as the "X" values are infinite.
70
6.1.4 Conditional expectation
The expected value of "(Y |X = x)" is the conditional expectation of "Y " given "X = x". The conditional
expectation is simply the set of all the conditional distributions row by row.
First of all we generate our data (the table) and then we compute the marginals (which are simply the
columns and rows sums)
x=c(0,1)
y=c(0,1,2)
p=matrix(c(0.11,0.09,0.20,0.30,0.14,0.16),nrow=2,byrow=T) #joint distr#
mX=rowSums(p) #marginal distribution X#
mY=colSums(p) #marginal distribution Y#
We can then compute the conditional distribution for the "Y " when "X = 0" and "X = 1", and then the
conditional expectation:
The covariance is a measure of the strength of the association between two random variables. A value of "−1"
describes a perfect negative correlation whilst a value of "1" describes a perfect positive correlation. For two r.v.’s
we define:
COV (X, Y ) = E((X − E(X))(Y − E(Y ))) = E(XY ) − E(X)E(Y )
and
COV (X, Y )
Cor(X, Y ) =
σX σY
with the same meaning as in exploratory statistics.
Remark: some useful formulae:
71
For random vectors "X = (X1 , . . . , Xp )" such numbers are usually arranged into matrices:
and similarly
1 Cor(X1 , X2 ) . . . Cor(X1 , Xp )
..
Cor(X2 , X1 ) 1 .
...
Cor(X) =
.. .. .. .
. . . ..
Cor(Xp , X1 ) ... ... 1
Covariance matrix
The covariance matrix can be written as:
COV (X) = E (X − E(X))(X − E(X))t = E(XX t ) − E(X)E(X)t
∀a ∈ Rp at · COV (X)a ≥ 0
Compute the correlation matrix for the bivariate density. In this case we have two random variables "X"
and "Y ":
Y
X 0 1 2
0 0.11 0.09 0.20 0.40
1 0.30 0.14 0.16 0.60
0.41 0.23 0.36
1
−0.2562
−0.2562 1
72
6.2 The multinomial distribution
The multinomial scheme is a sequence of i.i.d. random variables:
X1 , X2 , . . . , Xn , . . .
each taking "k" possible values: this means that the sample space of each "k" variable is "ω = {1, . . . , k}". Thus,
the multinomial trials process is a simple generalization of the Bernoulli (binomial) scheme (which corresponds to
"k = 2"). Let us denote:
pi = P(Xj = i) i ∈ {1, . . . , k}
As with our discussion of the binomial distribution, we are interested in the random variables that count the number
of times each outcome occurred. Thus, let:
where "#" denotes the "cardinality", the number of elements. Note that " i Yn,i = n" so if we know the values of
P
"k − 1" of the counting variables, we can find the value of the remaining counting variable (by difference).
Example
Consider 20 tosses of a dice "ω = {1, 2, 3, 4, 5, 6}". The sample we consider is:
2, 4, 3, 1, 6, 6, 1, 3, 5, 5, 2, 1, 3, 4, 5, 1, 5, 6, 1, 1
So we have:
Y20,1 = 6 Y20,2 = 2 Y20,3 = 3
Y20,4 = 2 Y20,5 = 4 Y20,6 = 3
We don’t have to check the last value of "Y20,6 " as the 6 variables aren’t independent: the knowledge of the
first five variables becomes deterministic for the last one.
Indicator variables
Prove that we can express "Yn,i " as a sum of indicator variables:
n
X
Yn,i = 1(Xj = i)
j=1
Multinomial coefficient
The distribution of "Yn = (Yn,1 . . . , Yn,k )" is called the multinomial distribution with parameters "n" and
"p = (p1 , . . . , pk )". Its density is:
n
P (Yn,1 = j1 , . . . , Yn,k = jk ) = pj1 . . . pjkk
j1 , . . . , j k 1
for "(j1 , . . . , jk )" such that "j1 + · · · + jk = n". Here the symbol
n n!
=
j1 , . . . , jk j1 ! . . . jk !
73
We write "multinom(n, p)" for the multinomial distribution with parameters "n" and "p". For a multinomial
distribution "Y" with parameters "n" and "p", the expected value and variance are indeed:
Note that the number of times outcome "i" occurs and the number of times outcome "j" occurs are negatively
correlated, but the correlation does not depend on "n" or "k". Does this seem reasonable?
The important point is that the "COV " depends on "n" whilst the "COR" does not: the correlation is independent
on "n" and only depends on the structure of the "p".
Show that "Yn,i " has the binomial distribution with parameters "n" and "pi ":
n j
P(Yn,i = j) = pi (1 − pi )n−j j ∈ {0, . . . , n}
j
Suppose that we roll 4 ace-six flat dice (faces 1 and 6 have probability "1/4" each; faces 2, 3, 4, and 5
have probability "1/8" each). Write the covariance and correlation matrices of the relevant multinomial
distribution.
Use the Software R function "rmultinom" to generate data from a multinomial distribution and in particular:
• Toss 1 fair die
74
6.3 Chi-square goodness-of-fit test
The chi-square is a distribution derived from the normal [0,1] distribution (relative to the "t" distribution). The
goodness of fit means that we take a probability distribution on a finite sample space and we check if our sample is
derived from our distribution under "H0 ". This means that we compare an empirical distribution function (a set of
observed frequencies) with a theoretical distribution function (set of theoretical frequencies which are probabilities).
Suppose we have a sample of size "n" for a categorical r.v. "X" with a finite number of levels, and denote with
"{1, . . . , k}" such set of levels (frequency table of our sample). Let us denote with "ni " the observed frequency for
the i-th level. The probability distribution of "X" is a multinomial distribution with parameters "n" and "p1 , . . . , pk ",
i.e., the probabilities of the "k" levels. The chi-square goodness-of-fit test is a test to check if the variable "X"
has fixed probabilities "q1 , . . . , qk " or not. We want to check if our frequency table (empirical distribution) comes
from our set of fixed probabilities or not.
The null hypothesis of our test is to set a fixed vector and we assume that our empirical distribution comes
from a multinomial random variable with probabilities "q1 , q2 , . . . , qk ". The alternative hypothesis is every
other possible distribution, so every other possible vector of probabilities.
H0 : p1 = q1 , . . . , pk = qk
In the limit case when we have a perfect fit between our "H0 " and the observed counts, we should have the
expected frequencies exactly equal to observed counts. Under "H0 " this test statistic should be near to "0".
When "T " is large enough we reject the null hypothesis as there’s too much difference between the observed
and the expected counts.
Under the null hypothesis "H0 ", "T " has an asymptotic chi-square distribution with "(k − 1)" degrees of freedom
(they slightly change the shape of the distribution even if the form remains unchanged). The test is one-sided
(right), and thus:
R = (cα , +∞)
The critical value "cα " [χ2c ] can be found in statistical tables, or with Software R. Otherwise we can operate through
75
the p-vale’s study:
Suppose we want to check if a die is fair. We toss the die 200 times (this is the empirical distribution):
Outcome 1 2 3 4 5 6 Total
ni 36 30 25 30 38 41 200
In a fair die (this is our theoretical distribution under "H0 ") all outcomes have the same probability, so
we use the uniform distribution as the null hypothesis:
1
q1 = · · · = q6 =
6
We perform the chi-square test at the 5% level.
First, compute the expected counts:
n̂i = nqi = 33.33
and compare the table of the expected counts.
Outcome 1 2 3 4 5 6 Total
n̂i 33.33 33.33 33.33 33.33 33.33 33.33 200
with the table of the observed counts:
Outcome 1 2 3 4 5 6 Total
ni 36 30 25 30 38 41 200
The Pearson test statistic is:
(36 − 33.33)2 (30 − 33.33)2 (25 − 33.33)2 (30 − 33.33)2 (38 − 33.33)2 (41 − 33.33)2
T = + + + + + = 5.38
33.33 33.33 33.33 33.33 33.33 33.33
The value "T = 5.38" must be compared with the suitable critical value of the chi-square distribution: we
have "6 − 1 = 5" (k − 1) degrees of freedom so that:
> qchisq(0.95,5)
[1] 11.0705
the rejection half-line is "R = (11.07; +∞)". The observed value does not belong to the rejection half-line
(11.0705 > 5.38), thus we accept "H0 " and conclude that the die is fair.
76
6.4 Bivariate continuous random variables
For continuous random variables the approach is basically similar, we just replace the "sums" with "integrals".
The joint density of "X" e "Y" is the function "fX,Y (x, y)" such that:
Z b Z d
f(X,Y ) (x, y) dydx = P(a ≤ X ≤ b, c ≤ Y ≤ d)
a c
What we obtain in this case is a figure whose total volume is equal to "1". A graphical representation is for
example the following:
Given two r.v.’s "X" and "Y" the conditional distributions of "Y" given "X" are the distributions of "Y" for fixed
values of "X". Formally it is the ration between the joint density and the marginal density:
Conditional distribution
The conditional distribution of "Y" given "X = x" is:
f(X,Y ) (x, y)
fY |X=x (y) =
fX (x)
The expected value of "(Y |X = x)" is the conditional expectation of "Y" given "X = x".
The main difference in the interpretation here is that in continuous case we can’t interpret this expression
in terms of probability as we previously did in the discrete case. In the continuous case we can’t do this as
the probabilities of a single value are "0" (we would be dividing "0/0"), so we can only work with densities.
In the discrete case we can use both definitions.
77
6.4.2 Independence and correlation
Again also in this case we have to work with densities and not with probabilities as we can’t translate the expression
(all the probabilities are "0").
Two random variables are independent if and only if the joint density is the product of the marginal densities:
Covariance and correlation have the same definitions as in the discrete setting.
Write the joint distribution of the pair "(X, Y )" in the following situations:
1. "X, Y " are exponentially distributed "ε(λ)", independent
In this case we consider two independent functions distributed:
X ∼ ε(λ) Y ∼ ε(λ)
Since they are independent it’s very easy to describe their joint density:
So we understand that independence is a very easy case where the joint density is derived by the
marginal distributions.
2. "X, Y " are uniformly distributed "U[0, 1]", independent
In this case we consider two independent functions distributed:
X ∼ U[0, 1] Y ∼ U[0, 1]
Since they are independent it’s very easy to describe their joint density:
f (x, y) = 1 0 ≤ x ≤ 1, 0 ≤ y ≤ 1
So we see that the density is always equal to "0" except in the square "1 × 1" where it is equal to "1".
78
Joint distributions of exponential and uniform functions
Write the joint distribution of the pair "(X, Y )" in the following situations:
3 "X" is exponentially distributed "ε(λ)", "Y " is uniformly distributed "U[0, 1]", independent.
In this case we consider the mix of the previous cases:
X ∼ ε(λ) Y ∼ U[0, 1]
The idea behind this is that when assuming independence it’s pretty easy to define the joint density, starting from
he marginal density. In most cases without independence we are not able to do inference and to do statistical
modeling.
E(X) = µ V AR(X) = σ 2
• The standard deviations units are measured through the covariance matrix "Σ" (remember that "Σ" must be
positive definite).
79
• The exponent term becomes now:
(x − µ)(Σ)−1 (x − µ)
1 1
f (X) = exp − (x − µ)Σ (x − µ)
−1
(2π)p/2 det(Σ)1/2 2
This is denoted with "Np (µ, Σ)", where the "p" means that we have a normal distribution in "p" dimensions. Again
from the previous construction we notice that the mean and the variance are the parameters of the distribution:
for "x ∈ Rp ". This is named as the multivariate standard normal distribution.
• Density and contour plot for a bivariate normal with zero covariance
The "circular" shape means "independence". Standard normal distribution means that the marginal distribu-
tion has mean "0", variance "1" and in addition we have no covariance between the two variables "x1 " and
"x2 ". This means that the plot is perfectly symmetric around the center "(0, 0)": this leads to the formation
of circular level-lines.
80
In the univariate case we know there is a standardization, so that from a normal distribution that we know
"X ∼ N (µ, σ 2 )" we can take standard one defined as "z = X−µσ " which is by definition a normal distribution
"z ∼ N (0, 1)". Plus if we start from a normal distribution "z ∼ N (0, 1)" we can generate general univariate
normal distributions simply by taking "X = µ + σz": in this way we obtain a distribution of "X" with mean
"µ" and variance "σ 2 " [this to prove that in the univariate case it’s easy to generate every possible normal
distribution starting from the standard one].
Basically the same holds even for the multivariate case. The generalization of the linear transformation in
the multivariate case is the matrix multiplication: matrix multiplication, for the multidimensional case, is
what the linear transformation is for the univariate case.
With matrix multiplication we can move from a plot of the previous case where the two components are
independent (the distribution is perfectly symmetric), to distributions like the following ones where we have
dependence between the two variables "x1 " and "x2 " (we obtain ellipses).
• Density and contour plot for a bivariate normal with positive covariance [positive correlation]
• Density and contour plot for a bivariate normal with negative covariance [negative correlation]
The first relevant property of the multivariate normal distribution concerns linear combinations.
at X ∼ N (at µ, at Σa)
81
Exercise - Multivariate normal function distribution
Given a multivariate normal:
5 16 12
X∼ ,
10 12 36
and "at = (3, 2)", find the distribution of "Y = at X".
If we take the transposed:
Y = at X = 3X1 + 2X2
then the expected value of "Y " is:
5
E(Y ) = a µ = (3, 2)
t
= 3 · 5 + 2 · 10 = 35
10
16 12 3 3 3
σY2 = (3, 2) = 3 · 16 + 2 · 12 3 · 12 + 2 · 36 = (72, 108) = 72 · 3 + 108 · 2 = 648
12 36 2 2 2
More generally, if "X ∼ Np (µ, Σ)", "A" is matrix with dimension "q × p", and "d" is a q-dimensional vector, then:
Y = AX + d
1 1
A=
1 −1
All multivariate random variables defined as subsets of a multivariate normal distribution are again multivari-
ate normal. So if we take every component (every marginal distribution) of a standard normal distribution
you again obtain a normal distribution.
If "X ∼ Np (µ, Σ)" then all the marginal distributions are normally distributed, and this is also true for pairs
"(Xi , Xj )", and so on. This is an easy consequence of the previous property. Why?
For multivariate normal distributions, having zero-correlation implies having independent components. This
normal distribution case is basically the unique one where zero-correlation implies independence: for general
random variables we previously saw that independence implies zero-correlation but the contrary wasn’t true
in general.
In formulae, if if we consider a normal distribution "X ∼ Np (µ, Σ)" and we assume the zero-correlation (which
means that the VAR-COV matrix "Σ" is a diagonal matrix [there are only the variances of the components and
no covariance]), then all the components of "X" are pairwise independent.
82
Exercise - Zero-correlation (bivariate distribution)
Prove this statement in the case "p = 2". To ease computations take a zero-mean bivariate distribution.
Conditional distributions of a multivariate normal distribution are again (multivariate) normal distributions.
This means that if we fix the value of one of the variables (for example in the following plot we set "x1 ")
and we consider the conditional distribution of the second variable ("x2 ") we obtain a normal distribution.
Let "X1 , . . . , Xn " be independent and identically distributed random vectors with "E(X) = µ". Then
n
1X
x̄n = xi
n i=1
That is
x̄n −→ µ and Sn −→ Σ
and these are true regardless of the true distribution of the "Xi "’s.
83
Central limit theorem
In Software R, multivariate normal distributions are handled through the "mvtnorm" package.
Multivariate R packages
84
7 Lab 3
7.1 Multivariate random variables
• Linear transformations of the Multinormal distribution
In this exercise you need to (install and) load the libraries "MASS" and "mvtnorm". Let:
Y = AX [1]
where:
0 1 0
X ∼ N2 ,
0 0 1
is the standard multinormal distribution and:
1
c
A=
c 1
So the idea is simple: we start from the "X" (a standard normal distribution because for the mean we have a 0-vector
and the VAR-COV matrix is an identity) in two dimensions, and then we consider the "A" matrix. The model we
consider is then:
1 c X1 + cX2
X1 Y1
Y = AX = = =
c 1 X2 cX1 + X2 Y2
1. Compute the theoretical values of the expected value and the variance/covariance matrix of "Y ".
This part can be computed both by hand and by using Software R. First of all we import all the packages we
need and then we compute:
0 1 0 1 c
µX = ΣX A=
0 0 1 c 1
Y = AX ∼ N2 (Aµ, AΣAt )
We know that the distribution of the transformed variable "Y " is a bivariate normal distribution with mean
"Aµ" and VAR-COV matrix "AΣAt " (notice that, since the "A" matrix is symmetric, then "A = At "). We
can then compute the expected value:
1 c 0 0
Aµ = =
c 1 0 0
1 c 1 0 1 c 1 c 1 1 + c2 c+c c2 + 1 2c
c
ΣY = = = =
c 1 0 1 c 1 c 1 c 1 c+c c2 + 1 2c c2 + 1
Notice that the multiplication for the identity matrix doesn’t change the matrix.
2. Choose different values for "c" (e.g. "c = −4, −2, 0, 2, 4"), generate "1000" observations from the distribution
of "Y " using the function "mvrnorm" in MASS and plot the observations. The code for doing this is:
The idea of this lab is to find an empirical counter part of this small computation: in the 2-dimensional case
it’s difficult to plot the theoretical densities of the random variables because we have 3 dimensional objects. The
best idea is to move from the theoretical densities to the empirical ones: we can do this via a random-numbers
generation. We generate random numbers from the relevant normal distribution and then we compute the
scatter plot to see the ellipses based not on the theoretical distribution but by plotting the two dimensional
scatter plot of the data points. We can achieve this by using the "nvrnorm" command.
85
Here we need to define the vector of the mean (it’s always the null vector) and the variance matrix (which is
the previous "c2 + 1" and "2c" we defined) and then generate 1000 random vectors from a multivariate normal
distribution with the previous parameters. In this case we consider "c = 2" but we can try also with different
values.
library(mvtnorm)
library(nortest)
library(MASS)
meanvec=c(0,0)
varmat=matrix(c(5,4,4,5),nrow=2)
#it is the equivalent of "varmat=matrix(c(c^2+1,2c,2c,c^2+1),nrow=2)"#
>varmat
[,1] [,2]
[1,] 5 4
[2,] 4 5
If for example we consider "c = −2", so if we change the direction of the linear transformation, we basically
obtain the same matrix (the variance is still "5")but the correlation (the covariance) has opposite sign ("−4").
In order to generate random numbers we just need to copy the proposed lines of code:
where we take the binorm sample and then we set as "y1" and "y2" the two dimensions (the two components
of the vector). What we obtain is a sort of ellipse on the "positive" direction (which is right since we have a
positive ("+4") covariance):
86
3. Choose a value for "c" (different from "0" and "1"). Estimate the mean vector, covariance matrix and the
correlation matrix of your bivariate normal sample. Use the functions "mean", "cov", "cor".
If we use these functions on "y1" and "y2" we have estimates based on a sample size of "1000" so we expect to
have a quite precise estimation. We can then check if our computations are correct by calculating the mean,
variance, covariance and by comparing them with the theoretical model:
4. Also make histograms and normal QQ-plot for each variable. What does the normal QQ-plot show? Use the
functions "plot", "hist", "qqnorm", "qqline" .
Here we check that the components of our rotated vector "y" are again normally distributed. This means that
if we take a normality test for both "y1" and "y2" we should accept the normality hypothesis. We can also use
the QQ-plot function:
hist(y1)
hist(y2)
qqnorm(y1)
qqline(y1,col="red")
qqnorm(y2)
qqline(y2,col="red")
Both from the histograms, and more precisely from the QQ-plots, we can verify that both variables are normally
distributed:
87
5. What does it happens in the previous items when "c → ±∞"? And when "c = 1"?
Here we investigate what happens when we move to special cases. Special cases are indeed when:
• "c = 0" : in this case we obtain an identity matrix. It is a very special case because this matrix doesn’t
alter the vector so the "y" is equal to the "x".
• "c = ±∞" : in this case the correlation coefficient goes to "0" (almost no correlation)
• "c = 1" : this is the opposite case since the correlation coefficient is equal to "1". Graphically this means
we would have a perfect line.
These results can be demonstrated by considering again the "Σ" covariance matrix of the "y". The correlation
coefficient is the ratio between the covariance and the two standard deviations:
COV (Y1 , Y2 )
ρ=
σY1 σY2
2c
=p
(c + 1) (c2 + 1)
p
2
2c
= 2
c +1
If we then consider the limit as "c → ∞" of the correlation coefficient we obtain the limit of two polynomials
(numerator and denominator). Notice that the denominator goes to infinity much faster than the numerator
so the limit goes to zero:
2c
lim =0
c→∞ c2 + 1
We can also test these results also with empirical functions. For example we should consider "c = 100" and
"c = 1":
###CASE "c=100"
varmat=matrix(c(10001,200,200,10001),nrow=2) #remember that we have "c^2+1" and "2c"#
binorm.sample <- mvrnorm(1000,meanvec,varmat)
y1 <- binorm.sample[,1]
y2 <- binorm.sample[,2]
plot(y1,y2)
###CASE "c=1"
varmat=matrix(c(2,2,2,2),nrow=2) #remember that we have "c^2+1" and "2c"#
binorm.sample <- mvrnorm(1000,meanvec,varmat)
y1 <- binorm.sample[,1]
y2 <- binorm.sample[,2]
plot(y1,y2)
88
For the first case we obtain a scatter plot with almost no correlation:, whilst int he second case we obtain
perfect linear correlation.
6. Equation [1] is the basis to define generic multivariate Gaussian random vectors. Construct the vector
"binorm.sample" by means of two independent Gaussian vector and then applying rotation:
x1 <- rnorm(1000,mu,sigma)
x2 <- rnorm(1000,mu,sigma)
binorm.sample.ind=matrix(c(x1,x2),byrow=F,ncol=2)
binorm.sample=binorm.sample.ind%*%t(A)
Check with the analysis in items 3,4,5 that we obtain the same result as in the case of direct multivariate
generation.
89
COMPLETE EXERCISE
###2)
library(mvtnorm)
library(nortest)
library(MASS)
meanvec=c(0,0)
varmat=matrix(c(5,4,4,5),nrow=2)
#it is the equivalent of "varmat=matrix(c(c^2+1,2c,2c,c^2+1),nrow=2)"#
varmat
###3)
mean(y1)
mean(y2)
var(y1)
var(y2)
cov(y1,y2)
###4)
hist(y1)
hist(y2)
qqnorm(y1)
qqline(y1,col="red")
qqnorm(y2)
qqline(y2,col="red")
###5)
###CASE "c=100"
varmat=matrix(c(10001,200,200,10001),nrow=2) #remember that we have "c^2+1" and "2c"#
binorm.sample <- mvrnorm(1000,meanvec,varmat)
y1 <- binorm.sample[,1]
y2 <- binorm.sample[,2]
plot(y1,y2)
###CASE "c=1"
varmat=matrix(c(2,2,2,2),nrow=2) #remember that we have "c^2+1" and "2c"#
binorm.sample <- mvrnorm(1000,meanvec,varmat)
y1 <- binorm.sample[,1]
y2 <- binorm.sample[,2]
plot(y1,y2)
90
8 Likelihood
We now move from probability to statistics (estimation). We saw that in general in the probability theory we have
a wide range of possible probability distributions, each of them defined with different parameters: this means that
we need a general method to estimate these parameters. The first method we consider is based on the "likelihood".
The idea behind this method is that we have an objective function, the "likelihood", and we want to maximize it
(we consider the first derivative).
8.1 Introduction
The notion of likelihood is the most important (and practical) way to analyze parametric statistical models. We
see this now for simple distribution functions, and we will investigate its use for regression models in the next
lectures. The idea of likelihood has a long history in statistics, but the formal, rigorous study of likelihood and
their properties as the foundation for inference is largely due to the work of R. A. Fisher in 1922.
8.2 Definitions
Remember that:
or
(Ω, F, (Fθ )θ∈Θ )
where "F " denotes the distribution function of a random variable. Notice that the parameter "θ" ranges into
the set "Θ".
Likelihood
Let "f (x|θ)" denote the joint density of the random sample, the set of observations "(X1 , X2 , . . . , Xn )". The
joint density of the sample is the multivariate density of the random vector "(X1 , X2 , . . . , Xn )". Then, given
that sample "x" is observed, the likelihood function of "θ" is defined to be:
L(θ|x) = f (x|θ)
Likelihood Joint density
So the likelihood function is exactly the joint distribution of the sample. The difference is in the meaning
of the two arguments: in probability theory if we take a density or a probability function we consider the
density or probability given the parameter "θ" (it is a known quantity and my unknowns are the results); in
statistics the data is given and the unknown is the function of "θ". In probability we consider the "x" given
the "θ", whilst in likelihood we work with the "θ" given the data.
It is crucial to keep in mind, however, that "f (x|θ)" is a function of "x" with "θ" known, while "L(θ|x)" is a function
of "θ" with "x" known.
Remark: notice that the symbol L(θ|x)" is simply a notation and not a conditional probability (we are not in
a probability space). Indeed the equation we find in the definition in some books is also written as:
Lx (θ) = fθ (x)
This is why independent random variables are particularly important in statistics: for independent samples it’s easy
to describe the likelihood.
91
Two further basic definitions on statistical models are the following:
Identifiable model
A statistical model:
(Ω, F, (Pθ )θ∈Θ )
is identifiable if the map:
θ −→ (Pθ )
is injective.
The map, the corresponding distribution (probability) function for each "θ", is injective. Because my object
is the data which comes from a density. If the same density comes from two different values of "θ" we can’t
decide which is the right one (it means that we have "θ1 " and "θ2 " which give the same density "f (x|θ)").
In this case we can’t estimate the true value between the two parameters "θ" because they converge on
the same density. The basic requirement to perform statistical inference is then having identifiable models,
meaning that two different parameters generate two different distributions (fortunately in our framework all
the models are identifiable).
Regular model
A statistical model:
(Ω, F, (Pθ )θ∈Θ )
is regular if all the densities (probability distributions) "(Pθ∈Θ )" have the same support [the support is the
set where the density is strictly positive].
For example:
Poisson model
Under the Poisson distribution, all the densities have the same support "N", independent of the value of the
parameter "λ", thus the model is regular.
Uniform model
Under the uniform distribution "U[0, θ]", the support of the densities clearly depends on the value of "θ",
thus the model is not regular.
8.2.1 Interpretation
While "f (x|θ)" measures how probable various values of "x" are for a given value of "θ", "L(θ|x)" measures how
likely the sample we observed was for various values of "θ".
So, if we compare two values and "L(θ1 |x) > L(θ2 |x)", this suggests that it is more plausible, in light of the data
we have gathered, that the true value of "θ" is "θ1 " than it is that "θ2 " is the true value. Of course, we need to
address the question of how meaningful a given difference in likelihoods is, but certainly it seems reasonable to ask
how likely it is that we would have collected the data we did for various values of the unknown parameter "θ".
92
8.3 Maximum likelihood estimation [MLE]
Perhaps the most basic question is: which value of "θ" maximizes "L(θ|x)"?
This is known as the maximum likelihood estimator, or MLE, and typically abbreviated as "θ̂" (or "θ̂M LE "
if there are multiple estimators we need to distinguish between). Provided that the likelihood function is differen-
tiable and with a unique maximum, we can obtain the MLE by taking the derivative of the likelihood and setting
it equal to "0" (see below the Bernoulli and Normal examples):
d
L(θ|x) = 0
dθ
Consider a coin, which can be fair (θ1 = 1/2) or unfair (θ2 = 1/3), where "θ" is the probability of Head. We
flip the coin 100 times, obtaining 43 Heads. What can we conclude?
In this case we don’t have to compute derivatives as we simply have to check if the likelihood in "θ1 " is greater
(lesser) than the likelihood in "θ2 ".
L(θ1 |x) > L(θ2 |x)
Here we have "Θ = {1/3, 1/2}". Using the likelihood approach we compute:
Since "L(43|θ = 1/2) > L(43|θ = 1/3)" the maximum likelihood principle is in favor of "θ = 1/2". If we take
the parameter space equipped with a probability function, then one would compute:
• for "θ = 12 "
P(43|θ = 1/2)P(θ = 1/2)
P(θ = 1/2|43) =
P(43)
Here the Bayes theorem is used and if the probability on "Θ" is uniform, then:
This paradigm is used in Bayesian statistics, a theoretical framework not covered in this class.
In general the value of "θ" ranges on some intervals or in the whole line: this means we need the derivatives. The
first example we are going to see it’s the MLE for the Bernoulli distribution in the general case.
93
8.3.1 MLE: example for a Bernoulli distribution
Suppose that "X" follows a Bernoulli distribution with probability of success given by "θ".
Bern(θ) 0<θ<1
Find the MLE of "θ". Given data "x = (x1 , . . . , xn )" for a sample of size "n", the likelihood is:
n s
L(θ|x) = θ (1 − θ)n−s
s
Pn
where "s = i=1 xi " is the number of successes. Take the derivative of "L(θ|x)" with respect to "θ" and
equals to zero:
n
(sθs−1 (1 − θ)n−s + (n − s)θs (1 − θ)n−s−1 (−1)) = 0
s
θs−1 (1 − θ)n−s−1 (s(1 − θ) − (n − s)θ) = 0
thus:
s
θ̂ =
n
The MLE estimator is therefore the well-known sample proportion:
P
S Xi
θ̂ = = i=1
n n
94
8.4 Log-likelihood
Note that, for independent and identically distributed (i.i.d.) samples (data), the likelihood is equal to the
product of the densities of the marginal distributions. We have:
n
Y
L(θ|x) = f (xi |θ)
i=1
Is difficult to take its derivative as it requires an extensive use of the product rule. An alternative that is almost
always simpler to work with is to maximize the log of the likelihood, or log-likelihood instead (this means that
we work with the sum and not with the product). The log-likelihood is usually denoted with:
n
X
ℓ(θ|x) = log[f (xi |θ)]
i=1
Of course the results (the maxima) are the same as the logarithm is a monotone function. Note that the Bernoulli
example from before is a bit easier when working with the log-likelihood (try it). For other distributions, such as
the normal, it is much easier than working with the likelihood directly.
Find the MLE for the mean "θ" of the normal distribution with known variance "σ 2 ".
We then have to compute the MLE for a normal distribution "N (µ, σ 2 )" with "σ 2 " known. So the density
for the one dimensional case is:
1 (x−µ)
f (x|µ) = √ e− 2σ2
2πσ
we then consider the logarithm:
1 (x − µ)
log f (x|µ) = log √ −
2πσ 2σ 2
We now move from the one dimensional density to the sample "X1 , . . . , Xn " and we compute the log-
likelihood:
1 = 1n (xi − µ)2
P
ℓ(µ|x) = n log √ − i
2πσ 2σ 2
We now consider the derivative of the log-likelihood:
n n
dℓ(µ|x) 1 X 1 X
=− 2 2(xi − µ)(−1) = 2 (xi − µ) = 0
dµ 2σ i=1 σ i=1
In this equation we have the "1/σ2 " constant which we can cancel so we obtain:
n
X
(xi − µ) = 0
i=1
95
Example 2 - MLE for the Normal distribution
If for example we consider again a normal distribution with both the mean "µ" and the variance "σ 2 ", the
density and the log-likelihood of the sample are the same:
Pn
1 (xi − µ)2
ℓ(µ, σ 2 |x) = n log( √ ) − i=1 2
2πσ 2σ
but now we can’t cancel the first item as we previously did since it isn’t a constant (there’s a variable inside).
In this case we have to use the partial derivatives with respect to the two parameters "µ" and "σ 2 ":
∂ℓ(µ, σ 2 |x)
= ...
∂µ
∂ℓ(µ, σ 2 |x)
= ...
∂σ 2
MLE Invariance
If "θ̂" is the MLE of a parameter "θ" and "g" is a real function, then "g(θ̂)" is the MLE of "g(θ)".
So if we compute the MLE "θ̂" of a parameter then we can compute the MLE for every possible function of
the parameter "θ" simply by applying the same function to the MLE.
Biasedness
Remark: note that MLE is preserved under transformations, while unbiasedness does not, unless in the
case of linear functions, so in general they are usually biased.
For example if we take the example we previously considered we see that the MLE for the parameter "µ" is
equal to the sample mean (it is unbiased), whilst the MLE for the parameter is biased as the denominator
is just "n" and not "n − 1":
2
Pn
∂ℓ(µ,σ |x) = 0 xi
µ̂ = i=1 = x̄ Unbiased
∂µ n
∂ℓ(µ,σ2 |x) = 0 σˆ2 = 1 Pn (x − x̄)2 Biased
∂σ 2 n i=1i
The poin here is that the MLE usually needs a further step to correct this bias.
Asymptotic unbiasedness
Under mild regularity conditions on "L(θ|x)", if "θ̂" is the MLE of a parameter "θ", then, as "n" goes to
infinity. So for large samples it holds:
θ̂M LE −→ θ
96
Asymptotic normality
Under mild regularity conditions on "L(θ|x)", if "θ̂" is the MLE of a parameter "θ", then, as "n" goes to
infinity (it is a sort of generalized Central Limit Theorem):
θ̂ − θ
−→ N (0, 1)
σ(θ̂)
8.4.2 Sufficiency
Our problem now is to estimate a parameter and our idea is to use Likelihood through the data (the likelihood is
a function of the parameter "θ" given the data). For the data we considered the vector set "X1 , . . . , Xn " , so the
whole set of data. The question is: do we actually need the entire set of data (all the information) to compute our
estimate? The answer is no: we don’t need the whole information on the sample to compute the MLE (for example
we consider the mean of the normal distribution: it is estimated by using the sample mean, which means it doesn’t
need the "n" dimensional vector as it’s enough to have the sum of the data and all the discarded information is not
essential).
The point here is that we don’t need the whole "n" dimensional set of numbers and that we can reduce our
information to a relevant number (the sum of our vector).
Example: we consider our "n" dimensional vector (sample) "X1 , . . . , Xn ". When we write "L(µ|X)" we are
taking an "n" dimensional vector as we consider all the values of our experiment:
n
X
L(µ|X) = L(µ|S) where S= Xi
i=1
so we can rewrite the likelihood function not as a function of "X" (we don’t store all the "n" numbers) but as a
function of the sum, which is just one value.
A rough definition is: a statistic "T (X)" is sufficient for a parameter "θ" if it contains all the information to write
the likelihood (and to compute the MLE). The formal definition of sufficiency is based on conditional distributions.
Sufficiency
A statistic "T (X)" is sufficient for a parameter "θ" if the conditional distribution of the data "X" given "T "
does not depend on "θ".
A statistic "T (X)" is sufficient for a parameter "θ" if and only if the likelihood function can be factorized as:
for some non-negative functions "f " and "gθ ", where "f " does not depend on "θ".
So we can write the likelihood of "θ" given the "X" as a function of "θ" given "T ":
L(θ|X) = L(θ|T )
so we can describe the density of our random variable (remember that the joint density of the sample is
indeed the likelihood) as a function of "T ": we don’t need the whole sample.
97
Exercise - Estimation problem
The statistic "T = X̄" for the parameter "λ" of the Poisson distribution is a sufficient statistic.
So we have a sample with a Poisson distribution with unknown "λ" that we have to estimate:
X1 , . . . , Xn ∼ P (λ)
d 1 1
L(λ|X1 , . . . Xn ) = −ne−nλ λT + e−nλ T λT −1
dλ x1 !, . . . , xn ! x1 !, . . . , xn !
and now we compute when it vanishes:
>0
−nλ
e
λT −1 [−nλ + T ] = 0
x1 !, . . . , xn ! >0
=0
>0
−nλ + T = 0
T
λ̂ = = X̄
n
so the MLE of the parameter "λ" of the Poisson distribution is the sample mean "X̄": this is not surprising
as the expected value of a Poisson distribution is indeed "λ". The information on "T " is enough to compute
the MLE: sufficient statistic.
Thus, the derivative of the log-likelihood w.r.t. the parameter is given its own name: the score.
Score function
The score, commonly denoted with "U ", is:
d
UX (θ) = ℓ(θ|X)
dθ
Note that "U " is a random variable (like in the case of ML estimator), as it depends on "X", and is also
a function of "θ".
Remark: with i.i.d. data, the score of the entire sample is the sum of the scores for the individual observations:
X
U= Ui
i
In view of this remark, in the following we will work in general with the score for samples of size "1".
98
8.6 Score and MLE
It is easy to observe that the MLE is found by setting the sum of the observed scores equal to "0":
X
Ui (θ̂) = 0
i
Find the score for the normal distribution. More precisely find the score for the model "N (θ, σ 2 )", with "σ 2 "
known.
Xi − θ
Ui =
σ2
Pn
Xi − nθ
U = i=1 2
σ
θ̂ = X̄
We take one random variable. Remember that the likelihood:
1 (x−σ)2
L(θ|X) = √ e− 2σ2
2πσ
so the log-likelihood is:
2
− (x−σ)
1
2
2σ
ℓ(θ|X) = log √
2πσ
and then the derivative is:
dℓ 2(x − θ)
=0− (−1)
dθ 2σ 2
so the score is:
dℓ (x − θ)
U x(θ) = =
dθ σ2
The score of the whole sample is the sum of the single score components. The computational improvement
of the score function is that we consider the derivatives at the stage where the sample size is equal to "1" and
we sum them up: previously we defined instead the likelihood of the sample and then we took the derivative
(it was a more difficult approach).
8.6.2 Mean
We now turn our attention to the theoretical properties of the score. It is worth noting that there are some regularity
conditions that "f (x|θ)" must meet in order for these theorems to work. For the purposes of this class we will assume
that we are working with a distribution for which these hold.
99
Theorem - Expected value
The variance of "U " is given a special name in statistics: it is called the Fisher information or simply the
information. The Fisher information is the amount of information (contained in a sample) that an observable
random variable "X" carries about an unknown parameter "θ" of a distribution that models "X".
Xi − θ
Ui =
σ2
E(Ui ) = 0
1
Xi − θ
V AR(Ui ) = V AR = 4 V AR(Xi )
σ2 σ
Remark: remember that the variance of a linear transformation "V AR(aX + b) = a2 V AR(x)".
So this is the information for a sample of size "1". If we consider the information of a sample with a greater
size we just have to multiply it: X n
(θ) = 2
σ
Remark: notice that the information is the reciprocal of the variance of the sample mean.
8.6.4 Information
In the case of the one-parameter normal model the information does not depend on "θ":
1
I(θ) =
σ2
This is not true in general, as shown below.
100
8.7 The Cramer-Rao lower bound
Under some regularity conditions (basically the same for which "E(U ) = 0"), the following theorem gives the clear
meaning of the word "information".
Let "T (X)" be any statistic with finite variance for all "θ". Denote "E(T (X)) = ψ(θ)". Then, for all "θ", if
"ψ" is the identity function, then:
1
V AR(T (X)) ≥
I(θ)
so the variance of an estimator is always greater or equal to the reciprocal of the information. So if we
consider the sample mean of a normal distribution we obtain the equality: we are in the best case. In some
sense the sample mean is the best possible estimator to estimate the mean of a normal distribution: every
other possible estimator of the mean have higher variance (worse performance).
This theorem then tells us that the information is linked with the variance of the estimator and, as we
previously said, the variance of every estimator is always greater or equal to the reciprocal of the information.
8.7.1 Information
For i.i.d. data, once again we use additivity: the information for a sample is the sum of the information of the
components.
Remark: each observation contains the same amount of information, and they add up to the total
information in the sample.
Another property of the score function which is often useful is the following:
′
I(θ) = −E[U ]
101
8.7.3 Asymptotic distribution
One final, very important theoretical result for the score may be obtained by applying the central limit theorem:
{U − E[U ]} d
√ −→ N (0, Ii (θ))
n
or equivalently:
1 d
√ U −→ N (0, Ii (θ))
n
d
where the expression "−→" means that the quantity on the left converges in distribution to the distribution on
the right as the sample size "n → ∞". We will write concisely "U ≈ N (0, I(θ))".
The preceding results all describe a situation in which we are interested in a single parameter "θ". It is often the
case (and always the case in regression modeling) that "f (x)" depends on multiple parameters. All of the preceding
results can be extended to the case where we are interested in a vector of parameters "θ = (θ1 , θ2 , . . . , θp )".
The score is now defined as
U(θ) = ∇ℓ(θ|x)
where "∇ℓ(θ|x)" is the gradient (a vector) of the log-likelihood (we consider all the partial derivatives):
∂ℓ(θ|x) ∂ℓ(θ|x) ∂ℓ(θ|x)
∇ℓ(θ|x) = , ,...,
∂θ1 ∂θ2 ∂θp
Note that the score is now a "p × 1" vector: to denote this we use the bold notation "U". The MLE is now found
by setting each component of the score vector equal to zero; i.e. solving the (sometimes linear) system of equations
"U= 0" (all the partial derivatives equal to zero), where "0" is a "p × 1" vector of zeros.
The score still has mean zero:
E(U) = 0
The variance of the score is now intended as the covariance matrix, and it is still the information:
although the information "I(θ)" is now a "p × p" matrix. We have that:
I(θ) = −E(JU )
2
where "JU " is a "p × p" matrix of second derivatives with "(i, j)"-th element " ∂∂θℓ(θ|x)
i ∂θj
". This matrix is called the
Jacobian of the score or the Hessian matrix of the log-likelihood (remember that the Jacobian matrix is the
matrix containing all the cross partial derivatives [the second derivatives]).
It is still true that, for i.i.d. data. If we have a sample of size "n" we can sum the scores of the components:
X
U= Ui
i
and: X
I(θ) = Ii (θ)
i
and also in this framework the asymptotic normality of the score holds:
U ≈ Np (0, Ii (θ))
102
Exercise - Normal model with "σ 2 " unknown
Write the score and the information matrix for a sample "X1 , . . . , Xn " of a normal distribution "N (θ, σ 2 )"
with both parameters unknown.
X1 , . . . , Xn ∼ N (µ, σ 2 )
So we use the property of the score function: we consider a single component of the sample and then the score
of the sample is the sum of the scores and the information matrix is the sum of the information matrices.
We can work then with a single density:
1 (xi −µ)
Xi f (x1 ) = √ e− 2σ2
2πσ
and then we take its logarithm:
1 (xi − µ)2
ℓ(µ, σ|xi ) = log f (xi ) = log √ − log(σ) −
2π 2σ 2
and so we can compute the score (we have two partial derivatives):
For the information matrix we have to consider the variance of "U ". There are two methods to compute the
matrix:
• The first one is which is more difficult:
I(µ, σ 2 ) = V AR(Ui )
• The second is easier and involves the Jacobian matrix of the matrix "U " (we take the second derivative
of the score) [we will apply this]:
I(µ, σ 2 ) = −E(JU )
In the matrix we need to take the derivative of the first element with respect to "µ" and then "σ" and then the
derivative of the second element with respect to "µ" and then "σ". So in the first row we consider the derivative
of the first element " xiσ−µ
2 " and we compute it with respect to the two parameters (partial derivatives): in the
∂
first column of the matrix we compute the derivative with respect to "µ" [ ∂µ ] whilst in the second column
∂
we compute it with respect to "σ" [ ∂σ ]. In the second row we compute again the derivative for the second
(xi −µ)2
element of the score "− σ1 + σ3 " with respect to the two parameters:
!
− σ12 − 2(xσi −µ)
I(µ, σ ) = −E
2 3
3(xi −µ)2
− 2(xσi −µ)
3 σ
1
2 − σ4
1
0
= σ2
0 2
σ2
Notice that in the non-diagonal elements we have the "E(xi − µ) = 0" and that in the last element we have
that the expected value "E((xi − µ)2 ) = σ 2 " is equal to the variance. The information matrix we obtained is
a diagonal matrix: the "0" means that in the normal distribution the sample mean and the sample variance
are independent (it is a special property of the normal distribution). This means that the estimation of the
mean and the estimation of the variance are two independent processes (there’s no correlation between the
estimation of the parameter "µ" and the parameter "σ").
103
8.8 Exponential families
Now we introduce a special family of probability distributions, namely the exponential family, we examine some
special cases, and we see what it is about members of the exponential family that makes them so attractive to work
with. Since we work with i.i.d. samples, we write the likelihood for a single observation. It is easy to write down
the likelihood of a sample of size "n" by taking a product of n terms (or the sum when using the log-likelihood).
The exponential family is a set of probability distributions where it’s easy to find the likelihood and, more impor-
tantly, it’s easy to find the score and the information matrix (which is usually difficult to compute since we have to
calculate an expected value or a variance).
Exponential families
A statistical model belongs to the exponential family if the likelihood can be written as (remember that
"exp" just means "exponential"; it’s a way to write more clearly the exponential):
and so we have a linear expression on "θ" multiplied by the sufficient statistic, and a constant (doesn’t depend
on the sample) "−ψ(θ)" which is the deterministic function. The parameter "θ" is named as the canonical
parameter, and "T " is of course the sufficient statistic.
In the multi-parameter case we make use of the inner product [prodotto scalare]:
where we have that "θt T (x) = θ1 T1 (x) + θ2 T2 (x) + . . . ". Also the name now becomes more clear because
the likelihood is expressed in the form of the exponential function of this special argument. It’s also easy to
compute the log-likelihood since we just have to take the exponent.
We only compute the one dimensional case. The density of the geometric function is:
remember that the parameter "p" is between "0" and "1" since it’s the probability of the underlying Bernoulli’s
scheme. The likelihood is then the density (as we said we consider size "1"):
where "log(1 − p)" is "θ". Remember that our goal is to obtain the form "L(p|x) = exp(−ψ(θ) + θT )". So we
have:
L(p|x) = exp(−ψ(θ) + θX)
So we have written this distribution as an exponential function. If we want to be more precise we can compute
the explicit expression of the function "ψ" (we need to compute "p" as a function of "θ"):
ψ(θ) = − log(p)
θ = log(1 − p)
eθ = 1 − p
p = 1 − eθ
L(p|x) = exp(− log(1 − eθ ) +θX )
−ψ(θ)
T (X) = X
104
Exercise - Normal distribution (exponential family)
Write the Normal distribution ("σ 2 " known) in the form of the exponential family.
X ∼ N (µ, σ 2 ) σ 2 unknown
Here we perform the same passages we saw in the previous exercise, computing the likelihood for the density:
1 (x−µ)2
L = (µ|x) = f (x) = √ e− 2σ2
2πσ
1 (x − µ)2
= exp log √ −
2πσ 2σ 2
1 x + 2µx + µ2
2
= exp log √ −
2πσ 2σ 2
= exp . . .
µ
T (X) = X θ=
σ2
2σ 2 " so we have again that "T (X) = X" is equal to the identity.
Notice that the linear expression is " 2µx
(A bit difficult) Write the Normal distribution (both "µ" and "σ 2 " unknown) in the form of the exponential
family with two parameters.
So the idea is that it’s not difficult to write densities in the form of exponential families but in most cases we need
to perform a change of parameter.
For applications in the theory of linear models we will use a slightly more general form of the exponential family,
introducing also a scale parameter or nuisance parameter "ϕ".
A distribution falls into the exponential family if its distribution function can be written as:
xθ − b(θ)
f (x|θ, ϕ) = exp + c(x, ϕ)
ϕ
where the canonical or natural parameter "θ = h(µ)" depends on the expected value of "X", "ϕ" is a
positive scale parameter, and "b", "c" are arbitrary functions.
So the function is exactly the same:
• We have a linear part on the "θ" (the canonical parameter)
• "b(θ)" is a function of the parameter "θ" but not of the "x": is the deterministic function so there’s no
randomness in "b"
As we will see, if a distribution can be written in this manner, maximum likelihood estimation (MLE) and
inference are greatly simplified and can be handled in a unified framework.
105
Example 1 - Poisson distribution (exponential families)
To get a sense of how this expression of the exponential family works, let’s work out the representation of a
few common families, starting with the Poisson:
exp(−µ)µX
f (X|µ) =
X!
This can be rewritten as:
f (X|µ) = exp(X log µ − µ − log X!)
so we have that the canonical parameter is "θ = log µ". Observe that "Xθ = x log µ−b(θ)
ϕ " where "ϕ = 1" and
that "c(ϕX) = − log x!". Thus falling into the exponential family with "θ = log µ" and "b(θ) = eθ ". Note
that the Poisson does not have a scale parameter ("ϕ = 1"). For the Poisson distribution, the variance is
determined entirely by the mean.
1 (X − µ)2
f (x|µ, σ ) =
2
√ exp −
2πσ 2 2σ 2
1 2
Xµ − 2 µ 1 x2
= exp − + log(2πσ )
2
σ2 2 σ2
which is in the exponential family with "θ = µ", "b(θ) = 21 θ2 " and "ϕ = σ 2 ".
f (X|µ) = µX (1 − mu)1−x
µ
= exp x log + log(1 − µ)
1−µ
which is in the exponential family with the canonical parameter, "b" and "ϕ" equal to:
µ
θ = log b(θ) = log(1 + eθ ) ϕ=1
1−µ
note that "c(ϕ, x) = 0". Even in the simple case of the Bernoulli distribution the canonical parameter isn’t
equal to the expected value (the standard parameter).
Note that, like the Poisson, the Bernoulli distribution does not require a scale parameter. The more general
case of the binomial distribution with "n > 1" is also in the exponential family ("n" fixed).
106
8.9 Properties of the score statistic
8.9.1 Score statistic for exponential families
Why it is important to work with densities using the form of exponential families? We use this form because we have
simple expressions for the score and the information matrix. As we have seen maximum likelihood theory revolves
around the score. The score is the derivative of the log-likelihood with respect to the parameter "θ". Consider,
then, the score for a distribution in the exponential family:
θx − b(θ)
ℓ(θ, ϕ|x) = + c(ϕ, x)
ϕ
which is: ′
d x − b (θ)
U= ℓ(θ, ϕ|x) =
dθ ϕ
and in the case of a sample "X1 , . . . , Xn " the "U " is just the sum of the scores:
X xi − nb′ (θ)
U=
ϕ
and we know that the expected value is:
E(U ) = 0
P
xi ′
this means that the sample mean is an unbiased estimator of " n b (θ)".
Recall from our previous lecture that the score has the following properties:
′
E(U ) = 0 V AR(U ) = −E(U )
b” (θ)
V AR(U ) =
ϕ
Thus, for the exponential family the mean and variance of "X" can be computed through derivatives.
′ d
E(X) = b (θ) = b(θ) V AR(X) = ϕb” (θ)
dθ
so we need to derive with respect to the parameter "θ": if our parameter isn’t the canonical parameter we
first need to move the expression to the canonical form (otherwise we can’t take the derivative).
107
Mean and variance
e−µ µx
We consider the distribution "X ∼ P (µ)" and "f (x) = x! ". We have:
So "θ = log µ" (where "µ = eθ ") and "b(µ) = µ". Then we have "b(θ) = µ(θ) = eθ ". We obtain:
L(µ|x) = exp(xθ − eθ )
Since the mean and the variance of the Poisson distribution are the same we can write:
E(X) = µ V AR(X) = µ
Norice that if in the previous passage we had taken directly the derivative of "b(µ) = µ" without using the
"θ" we would have obtained "E(X) = 1" and "V AR(X) = 0" which are wrong.
Note that the variance of "X" depends on both the scale parameter (a constant) and on "b" (there’s a connection),
′
a function which controls the relationship between the mean and variance. Letting "µ = b (θ)" and writing "b” (θ)"
as a function of "µ" with "V (µ) = b” (θ)" [variance function], we have:
This point is really important since it allows us to compute the moments of random variables (expected values of
powers or related functions of the random variable) by considering the derivatives:
• For the Normal distribution, "V (µ) = 1" since the mean and the variance are not related.
• For the Poisson distribution, "V (µ) = µ" since the variance increases with the mean.
• For the Binomial distribution, "V (µ) = µ(1 − µ)" since the variance is largest when "µ = 1/2" and
decreases as "µ" approaches "0" or "1".
g(µ) = g(E(X))
Since ML estimation is particularly simple when the distributions are expressed in the canonical form of the
exponential family, with natural parameter "θ = h(µ)", we state the following definition.
108
The use of the canonical link function corresponds to a reparametrization of the distribution in terms of the
natural parameter of the exponential family.
We can see that this way of writing distributions in terms of the canonical parameter and the exponential family
simplifies the research of the MLE and so on. There is, therefore, a reason to prefer the canonical link when specifying
the model. Although one is not required to use the canonical link, it has nice properties, both statistically and in
terms of mathematical convenience:
In linear regression for normal distributions we write the expected value of our response variable equal to:
E = β0 + B 1 x 1 + · · · + β p x p
and we haven’t explicitly seen this part because for the normal distribution, when estimating the mean, the canonical
function is the identity. So what we have is:
Y ∼ N (µ, σ 2 )
E(Y ) = β0 + β1 x1 + · · · + βp xP
But if for example we take a Bernoulli distribution we know that our response variable is no more a quantitative
variable as it is a "0" and "1" variable:
Y ∼ Bern(µ)
µ = E(Y ) = β0 + β1 x1 + · · · + βp xP
But we know that "µ" is the parameter of the Bernoulli distribution is "0 < µ < 1", while the predictors in the
form "β0 + · · · + βp xP " are all over the real line (the regression line goes over "R"): this means that we can
have
predictions of the probability outside the "[0,1]" range. So if we move from "µ" to the canonical "theta = log µ
1−µ "
we then obtain a parameter "θ" that covers all the real line "−∞ ≤ θ ≤ +∞" because the limits are:
lim θ = −∞ lim θ = +∞
µ→0 µ→1
By operating this way we have a flexible expression of our density to use in our regression problems.
[We will analyze the last two points in full detail in the second part]
We now disclose some issues of the second part of this class. In a regression context, we will see that the
canonical link is the best way to link the expected value of a response variable to a linear predictor:
θi = g(µi ) = ηi = β0 + β1 xi1 + · · · + βp xip
Maybe you have not noticed this point in your previous regressions. Why? Because for the normal distribution the
canonical link function is the identity, so in practice you don’t need the notion of link function.
eη
µ
η = g(µ) = log µ = g −1 (η) =
1−µ 1 + eη
As "η → −∞", then "µ → 0", whilst as "η → ∞", then "µ → 1". On the other hand, if we had chosen, say,
the identity link, "µ" could lie below "0" or above "1", clearly impossible for the binomial distribution.
109
9 Simulation and the bootstrap
9.1 Motivation
Before computer age, statistical analysis used probability theory to derive statistical expression for standard errors
(or confidence intervals) and testing procedures. For instance in the Student’s t-test, normality is assumed: "Let
"X1 , . . . , Xn " be a sample of "N (µX , σX
2
)" and let "Y1 , . . . , Ym " be a sample of "µY , σY2 )" or in a regression model:
Y = Xβ + ε
the normality of the error term is assumed: "Let "εi " be i.i.d. with distribution "N (0, σ 2 )"...".
Most formulas are approximations, based on large samples. With computers, simulations and resampling meth-
ods can be used to produce (numerical) standard errors and testing procedure (without the use of formulas, but
with a simple algorithm).
"A random sequence is a vague notion... in which each term is unpredictable to the uninitiated and whose digits
pass a certain number of tests traditional with statisticians.
Derrick Lehmer, quoted in Knuth, 1997
The first problem is then to show how computers generate "random" numbers. Computers are deterministic machines
in general so the first question is "how a machine can generate randomness?". The answer is quite simple since
computers and machines in general don’t actually generate real random numbers but they generate pseudo-random
numbers (they follow a rule for the generation). The goal of pseudo-random numbers generators is to produce
a sequence of numbers in the interval "[0, 1]" that imitates ideal properties of random number (with a uniform
distribution).
LCG algorithm
LCG produces a sequence of integers "X1 , X2 , . . . " between "0" and "m−1" following a recursive relationship:
and sets:
Xi
ui =
m
So the idea is to take a linear equation: we define the "i + i" element as a linear function of the "Xi " and
then we consider the "mod" operation which is the remainder of the division. The sequence of "ui " are a
sequence of pseudo-random numbers in "[0,1]". They are pseudo random because if you know the values of
the parameters "a", "b", "m" and the starting point (the seed) "x0 " then the sequence is deterministic and
we can recover the sequence.
Usually the parameters "m" and "a" are set to large numbers to have good performances: we obtain a quite good
approximation of the uniform [0,1] distribution (e.g. "m = 232 − 1", "a = 16807(= 75 )" and "b = 0"). Also, "m" and
"a" must be coprime (why?).
In Software R the random number generation is done with the "runif " function. This function "runif" is based on
the Mersenne-Twister algorithm (a refined version of LCG) and it generates numbers from the "U[0, 1]" distribution.
110
Try this code several times:
runif(10)
So notice that in the first case we obtain always different values because the algorithm starts at different "x0 "
everytime the program receives a new input, whilst in the second case, where we set the starting point, we obtain
always the same output.
What we need to do is simulations, which means generating random numbers from a given probability distribution.
Sampling
Let us suppose we need to generate a sample of size 100 from a finite set. We have 3 possible outcomes:
S = {A, B, C}
where:
pA = P(A) = 0.13 pB = P(B) = 0.35 pC = P(C) = 0.52
First we generate random numbers "u" from the uniform "U[0, 1]" distribution:
• If "u ≤ 0.13" then we choose "A"
• If "0.13 < u ≤ 0.48" then we choose "B"
111
Try to do that in Software R with the "if " conditional statement (we sample a category variable):
u=runif(100) #We generate 100 uniform random numbers from the distribution#
catv=rep("C",100)
for (i in 1:100)
{
if (u[i]<=0.13){catv[i]="A"}
else if (u[i]<=0.48){catv[i]="B"}
}
barplot(table(catv),ylim=c(0,60))
u=runif(100)
catv=cut(u,c(0,0.13,0.48,1),labels=c("A","B","C"))
barplot(table(catv),ylim=c(0,60))
Software R has a built-in function for sampling from finite sets, the function "sample":
help(sample)
help(sample)
samp=sample(1:6,1000,replace=T)
#we can check the result by using the barplot#
barplot(table(samp))
3. Use the "sample" function to generate "1000" random tosses of two dice, taking the sum
samp1=sample(1:6,1000,replace=T)
samp2=sample(1:6,1000,replace=T)
samp=samp1+samp2
barplot(table(samp))
4. Use the "sample" function to generate a random permutation of the set "{1, . . . , 100}"
sample=(1:100,100,replace=F)
112
9.6 Sampling from a parametric (discrete) distribution
For several parametric families the sampling problem is solved through the functions "rbinom", "rgeom" and so
on. The first step is the generation of a "U[0, 1]" random variable. There are optimized algorithm for the inversion
of the cdf (in the discrete case the cdf is never invertible).
A guy collects coupons. He buys one coupon at a time and all coupons have the same probability. Approx-
imate by simulation the distribution of the time needed to complete the collection of "n" coupons.
The first coupon is a good one for sure. For the second coupon, the waiting is a geometric random variable
n ". So we are in the case of geometric distributions with a variable parameter (it changes
with parameter " n−1
at each step).
We fix "n = 100".
1. The parameters are:
n=100
p=(n:1)/n
2. Generate 100 geometric random variables with the appropriate parameters as follows:
x=rgeom(n,p)
3. Remember the meaning of the geometric parameters in Software R and define the waiting time as:
t=sum(x)+n
Of course this is the structure of the generation of one waiting time. If we want to approximate the
distribution of the waiting time we just need to replicate this code a certain number of times.
113
4. Repeat "1000" times and get the approximate distribution of the waiting time:
time=rep(0,1000)
for (i in 1:1000)
{
x=rgeom(n,p)
time[i]=sum(x)+n
}
mean(time)
var(time)
hist(time)
So the key point here is that we use simulations to approximate unknown distribution functions: in this case for
example we don’t know the exact distribution of this function.
x=rmultinom(1,20,c(1,2,3))
• 1 vector
• the vector has sum 20
• there are three outcomes with probabilities (proportional to) "(1, 2, 3)".
The goal of the exercise below is to show empirical correlations in the multinomial distribution converge to the
expected value from the formulae:
114
2. With the functions "cov" and "cor" compute the variance/covariance matrix and the correlation matrix
of the sampled data
mean(rand[,1])
var(rand[,1])
cov(rand)
cor(rand)
4. Plot the empirical cdf and compare with the theoretical cdf with
plot(ecdf(x))
x=sort(x)
lines(x,pexp(x,0.2))
If we have a random sample "X1 , . . . , Xn " of size "n", then empirical distribution function "F̂n " is the cdf of
the distribution that puts mass "1/n" at each data point "xi ". Thus:
n
1X
F̂n (x) = 1(Xi ≤ x)
n i=1
where "1(·)" stands for the indicator function, i.e. the function which equals "1" if the statement in brackets
is true, and "0" otherwise.
In other words, "F̂n (x)" shows the fraction of observations with a value smaller than or equal to "x".
115
Theorem - Unbiased and consistent estimator
The behavior of this empirical distribution function is exactly what we expect as it converges to the correct
value. It is an unbiased estimator of the theoretical distribution function because its expected value is
equal to the theoretical distribution function and it is a consistent estimator because its variance goes to
"0" as the sample size grows to infinity.
For all "x ∈ R":
F (x)(1 − F (X))
E[F̂n (x)] = F (X) V AR[F̂n (x)] =
n
Thus for "n → ∞" we have that "F̂n (x) → F (x)". Actually a stronger result holds.
Glivenko-Cantelli theorem
If "X1 , . . . , Xn " is a random sample from a distribution with cdf "F ", then the distance between the empirical
distribution function and the theoretical distribution function goes to zero as "n" goes to infinity:
Bootstrap
The bootstrap is a family of methods that resamples from the original data.
From the population (the first line), which is composed by balls with different colors, we consider a sample (the
second line) which is our data of "N " elements.
If we can’t assume any information about the probability distribution about of the random variable over the
population (we can’t assume for example if it is a Poisson or Normal distribution), then our unique information is
in the sample. The main idea of the Bootstrap is that the best approximation of the underlying distribution is the
empirical distribution of the sample itself. So if we have no information about the population (the first line), we
then assume that the sample is the best possible information about the shape of the distribution of the population.
The motivation of this reasoning exactly can be found in the theorem we previously saw. So the idea is to use our
sample as a population.
We draw samples from the original sample and this is why we call this "resampling". We repeat this process
several times and we compute the statistics of interest over the bootstrap samples and the values based on these
116
samples are approximation of the distribution in the population. Of course we consider samples with replacement
because we draw samples with the same size.
With this procedure we can estimate several parameters:
• Statistical tests
The bootstrap is non-parametric in the sense that we can do the previous tasks without making any as-
sumptions on the distributions of the variables (we don’t make any assumptions about the shape of our
distribution).
The bootstrap is based on the idea that the observed sample is the best estimate of the underlying distribution.
Thus, the bootstrap is formed by two main ingredients:
• The plug-in principle: the sample we take and consider is taken as a population. This means that the em-
pirical estimate (for example for a parameter) on the sample is now the unknown parameter in the population.
This means that we need a further estimation step which is the following point.
• A Monte Carlo method to approximate the quantities of interest based on samples taken with replacement
from the original sample and with the same size. This means that we do computations via resampling and
simulations and not with the exact analytical argument.
If we consider the "real world" we follow a precise path: we take the population, we consider a random sample and
from it we estimate a function: we estimate some parameters using an estimator which is a function of the sample.
If we consider the "bootstrap world" we have an estimated population which is our original sample. Here the
estimation is performed by considering bootstrap datasets (bootstrap samples).
The real world is a good way of reasoning if operating in a parametric framework but can’t be applied in a non-
parametric framework: this obliges us to move to the bootstrap world. Notice that in the bootstrap "universe" we
denote the elements with the notation "∗ ".
117
9.10.2 How to do it
We can now investigate this technique applied to the simplest case which is the estimation of a parameter.
We take a sample "X1 , . . . , Xn " and suppose that "θ" is our parameter of interest, and let "θ̂" be its estimate on
the sample. In order to obtain an approximate distribution of the parameter "θ̂" we can use the bootstrap method
without making any assumption on the distribution of the sample. The steps are:
1. Take a bootstrap sample "X1∗ , . . . , Xn∗ " from "X1 , . . . , Xn " of the same sample size
2. Compute the estimate "θ∗ " on the bootstrap sample
3. We repeat the previous steps a large number of times in order to have enough values to obtain a good
approximation of the distribution.
How to approximate the distribution of the sample mean "X" in a non-parametric framework? With the
bootstrap.
A standard number for the bootstrap replication is "5000": we then define a vector which will contain the
values of the average of all the bootstrap samples and then, each bootstrap sample in the "for" cycle is defined
as a sample with replacement from our population with the same sample size of our original sample. In the
following code we sample from the vector "x" a sample of the same size "8" with replacement.
We then consider the mean of each bootstrap sample and we store each value inside the vector "bootx". In
the end we plot the graph to visualize the results:
x=c(12,14,15,15,20,21,30,47)
B=5000
bootx=rep(NA,B)
for (i in 1:B)
{
boots=sample(x,8,replace=T) #This is the M.C. method#
bootx[i]=mean(boots)
}
plot(density(bootx),lwd=2)
118
Here is the plot:
Since it is an empirical function it is not perfectly estimated: we see it looks like a right tailed distribution.
Remark: due to the randomness of the simulations, different executions give rise to (slightly) different plots
(at each replication we obtain different random numbers).
We can then, of course for a much smaller number of iterations, look inside of our for cycle, printing the
samples and the corresponding mean. We consider for example "B = 10":
x=c(12,14,15,15,20,21,30,47)
B=10
bootx=rep(NA,B)
for (i in 1:B)
{
boots=sample(x,8,replace=T)
print(boots)
bootx[i]=mean(boots)
print(bootx[i])
}
Doing so we can see that we obtain different bootstrap samples with different means: the histogram of the
density estimation of all these values generates the density estimation of the distribution of our sample mean.
In this case we don’t consider the plot of the density as "10" isn’t sufficiently large.
So the bootstrap methodology is especially useful when dealing with small samples (in the last example we just had
"8" elements) as we didn’t have any tool to deal with them.
Of course the bootstrap for large samples is still useful and interesting for different parameters when the Central
Limit Theorem doesn’t hold.
119
9.11 Estimating the bias of an estimator
Let "θ" be a parameter of interest and "θ̂" its sample estimate. We want to provide an estimate of the bias of "θ̂".
The bootstrap estimates the bias of an estimator based on "B" bootstrap samples is:
where "θi∗ " is the estimate on the i-th bootstrap sample and "θ̄∗ " is the average of the bootstrap estimates.
x=c(12,14,15,15,20,21,30,47)
B=5000
bootx=rep(NA,B)
for (i in 1:B)
{
boots=sample(x,8,replace=T)
bootx[i]=median(boots)
}
sd(bootx)
120
9.14 Confidence intervals: the basic method
There are several methods to define confidence intervals based on the bootstrap.
Based on "B" bootstrap samples, compute the distribution of "θ̂∗ ". A confidence interval for "θ" with
confidence level "1 − α" is:
∗ ∗
θ̂α/2 , θ̂1−α/2
This is the trivial way to define a confidence interval, just take the relevant percentiles of the empirical
bootstrap distribution "θ̂∗ ".
The bootstrap method is very flexible for every parameter we want to estimate: it’s an empirical distribution so
we don’t need any information a priori. For example now we try to compute the confidence interval for the median:
x=c(12,14,15,15,20,21,30,47)
B=5000
bootx=rep(NA,B)
for (i in 1:B)
{
boots=sample(x,8,replace=T)
bootx[i]=median(boots)
} #we compute the interval#
int=c(quantile(bootx,0.025),quantile(bootx,0.975)) #2.5% in both tails#
(θ̂ + δ1 , θ̂ + δ2 )
Based on "B" bootstrap samples, compute the distribution of "δ ∗ ". A confidence interval for "θ" with
confidence level "1 − α" is:
(θ̂ − δ1−
∗
α/2 , θ̂ − δα/2 )
∗
Note that in this expression the two quantiles are in a reverse order.
121
We can easily repeat the process using Software R:
x=c(12,14,15,15,20,21,30,47)
B=5000
med=median(x)
bootdelta=rep(NA,B)
for (i in 1:B)
{
boots=sample(x,8,replace=T)
bootdelta[i]=median(boots)-med
}
int=c(med-quantile(bootdelta,0.975), med-quantile(bootdelta,0.025))
Every bootstrap procedure is based on an unconditional cycle where we consider the bootstrap sample with
replacement from the original sample (and with the same size). Then we compute the statistic of interest, we store
the values on a vector and then we can compute, for example, the standard deviation, the variance, the CI and so
on.
1. Generate a sample of size "15" from an exponential distribution with mean "1.4".
2. Use the bootstrap percentile method to define a confidence interval for the non-parametric skewness:
M e(X) − µ
W =
σ
Remark: the non-parametric skewness "W " is a measure of asymmetry. If "W > 0" it means we have
a left-skewed distribution, while "W < 0" means we have a right-skewed distribution. A "W = 0" means we
have a symmetric distribution since the mean and the median coincide.
Since the exponential distribution is a skewed distribution (on the right) we should obtain a positive value
for the parameter "W ".
• For multivariate analyses, resampling must be for paired observations (we will consider this case in the
Lab)
• For time series, resampling must consider blocks of consecutive observations.
122
10 Lab 4
This lab is an exercise about bootstrap: in particular it’s a bootstrap for the computation of confidence intervals.
library(readr)
mustang <- read_csv(. . ./mustang.csv")
View(mustang)
2. Compute the basic descriptive statistics for the three variables, including the correlation matrix. What is the
average price of a used Mustang in this sample?
We now take a look at the data and give a basic description: we consider the mean, the standard deviation
(variance) and we check for outliers with the boxplot. In this case we just have three quantitative variables
but for categorical variables we can also perform the "table" and "barplot" to check how many levels have each
variable. Here we also compute the correlation matrix.
summary(age)
sd(age) #sd to check the dispersion of the r.v.#
boxplot(age, horizontal=TRUE) #graphical summary of the values#
summary(miles)
sd(miles)
boxplot(miles, horizontal=TRUE)
summary(price)
sd(price)
boxplot(price, horizontal=TRUE)
mean(age)
mean(miles)
mean(price)
var(age)
var(miles)
var(price)
cor(mustang)
123
From the output of the correlation matrix we notice that the "price" is negatively correlated both with "age"
and "miles".
3. Using this sample we would like to construct a bootstrap confidence interval for the average price of all used
Mustangs sold on this website. To take for instance "1000" bootstrap samples and record the means use:
mean(price)
Since we are operating a simulation we will obtain slightly different results every time we run the code. To
reduce the variability in this output we could compute an increased number of replications (for example 5000).
Remember that "1000" is the number of samples for which we estimate the density of the estimator: by
increasing its value we obtain a better density estimation.
124
4. Write your confidence interval based on "B = 1000" bootstrap samples, and compare it with the confidence
interval given by the "t.test" function where normality is assumed.
We compute the "t.test" function (the Student t Test) for the mean, which also calculates the confidence interval
(it computes the default 95% interval). So the idea here is to compare the bootstrap confidence interval with
the parametric confidence interval. Here the normality of the data is assumed.
t.test(price)
The null hypothesis of the test checks if the mean is equal to zero whilst the alternative hypothesis checks that
the mean is different from zero. Here we are only interested in the part regarding the confidence interval: notice
that with the bootstrap simulation we obtained a slightly smaller confidence interval (it is a better result):
We can conclude that bootstrap is a good family of techniques which leads to an adaptation of the results to the
actual shape of the data. This means that usually the results are better with bootstrap than with the normality
assumption: there’s of course one exception which is when the data are exactly normal.
plot(density(price))
If we consider the density estimation we see that the distribution has a wide right-tail but in general the shapes
recalls the normal distribution.
125
5. Plot the bootstrap distribution using the density estimation.
We now plot the density estimation of the sample mean based on the bootstrap replications:
plot(density(boot_means))
The lower and upper bounds are respectively "(7.069764, 13.84862)": we can see that standard deviation of the
sample is "11.11362", a value between the two bounds of our interval.
7. Try to compute a confidence interval for the minimum and compare with the actual minimum in the original
sample.
This last point is to check the case where bootstrap isn’t actually the correct answer for the estimation of a
parameter. If we try to compute the bootstrap estimation for the minimum (or the maximum as it is the
same for our analysis) we see that the parameter is really difficult to estimate. We then compute the actual
minimum (maximum) of our sample and then do a comparison:
int=c(quantile(boot_mins,0.025), quantile(boot_mins,0.975))
min(price)
int
126
As we can see the confidence interval for the minimum using the bootstrap is between "3" and "7" but the
actual minimum of our sample is "3". So for example if we consider our sample points we see that the sample
minimum would be the red point. If it is an actual value of the distribution it means that the minimum in the
population would lie below the sample minimum: this means that are certain that the minimum lays below the
red point. But if we apply resampling from our sample and then compute the minimum, then the minima of
each bootstrap sample would lay on the right side.
The "min" and the "max" are then two special parameters of the distribution: even using things like normal
confidence intervals based on large samples (we assume for example a distribution like the green one and then
we consider the confidence interval) we obtain wrong results because one half of the distribution lays on the
wrong part of the graph [the central limit theorem does not hold].
In this case the right distribution would be the so called Gumbel distribution: we need an asymmetric probability
distribution with respect to the data in order to consider the confidence interval. The same holds of course
symmetrically for the maximum.
Notice that even in the parametric case using likelihood we can’t use the Central Limit theorem to approximate
the distribution of the estimator because the minimum (maximum) isn’t the center of the distribution. As we
just said before in order to perform this we need completely asymmetric distributions (on the left and on the
right).
127
COMPLETE EXERCISE
###1)
library(readr)
mustang <- read_csv(". . ./mustang.csv")
View(mustang)
###2)
dim(mustang) #check the dimensions#
attach(mustang)
summary(age)
sd(age) #sd to check the dispersion of the r.v.#
boxplot(age, horizontal=TRUE) #graphical summary of the values#
summary(miles)
sd(miles)
boxplot(miles, horizontal=TRUE)
summary(price)
sd(price)
boxplot(price, horizontal=TRUE)
mean(age)
mean(miles)
mean(price)
var(age)
var(miles)
var(price)
cor(mustang)
###3)
boot_means = rep(NA, 1000) #we replicate 1000 times#
for(i in 1:1000){
boot_sample = sample(mustang$price, 25, replace = TRUE)
boot_means[i] = mean(boot_sample)
}
int=c(quantile(boot_means,0.025), quantile(boot_means,0.975)) #take the quantiles#
int #compute the interval#
mean(price)
###4)
t.test(price)
plot(density(price))
###5)
plot(density(boot_means))
###6)
boot_sds = rep(NA, 1000)
for(i in 1:1000){
boot_sample = sample(mustang$price, 25, replace = TRUE)
boot_sds[i] = sd(boot_sample)
}
int=c(quantile(boot_sds,0.025), quantile(boot_sds,0.975))
sd(price)
int
###7)
boot_mins = rep(NA, 1000)
for(i in 1:1000){
boot_sample = sample(mustang$price, 25, replace = TRUE)
128
boot_mins[i] = min(boot_sample)
}
int=c(quantile(boot_mins,0.025), quantile(boot_mins,0.975))
min(price)
int
129
11 Regression: parametric and nonparametric approaches
11.1 Introduction
Linear regression is the first statistical model to investigate relationships, dependencies and casuality. Linear
regression is designed for the case when the response variable is quantitative (and with some further assumption
we will discuss later). For linear regression we need:
If "p = 1" (the number of linear predicotrs) we have a simple regression, if "p > 1" we have a multiple regression.
Y = β0 + β1 X1 + β2 X2 + . . . βp Xp + ε
The parameters "β0 , . . . , βp " are unknown and must be estimated from the data. In vector notation we write:
β = (β0 , . . . , βp )
Remark
In regression analysis the predictors are considered as deterministic (fixed values). The random variable is
only on the residuals, and thus on the response variable.
and we choose the "β̂" that minimizes the sum of the squared residuals (least squares method). The residual sum
of squares (RSS) is the sum of all the observations (statistical units), of the square difference between the observed
value of the response variable and the predicted value of the response variable.
n
X
RSS = (yi − ŷi )2
i=1
130
For the simple regression (only one regressor "X1 ") the expressions are simple:
COV (X1 , Y )
Slope : β̂1 = Intercept : β̂0 = ȳ − β̂1 x̄1
V AR(X1 )
while for multiple regression, the expression is somewhat more complicated and it is usually expressed in terms of
the design matrix. This matrix contains all the predictors column by column and then a column made up just
by "1" for the intercept. It is defined as:
X = [1, X1 , X2 , . . . , Xp ]
Also, let "y = (y1 , . . . , yn )t " the vector of the observed responses. Assume for example that we have:
p=2 n=3
Y = Xβ + ε
1
y1 x11 x12 β0 ε1
Y = y2 X = 1 x21 x22 β = β1 ε = ε2
y3 1 x31 x32 β2 ε3
Then the minimization of the RSS is reached with:
β̂ = (X t X)−1 X t y
with independent "εi ∼ N (0, σ 2 )". Notice that the variance is the same for all the statistical units.
Thus we have (in a more concise form) that each response variable is a normal random variable because it’s the
sum of a constant plus a normal random variable. So our response variable in the sample is normally distributed:
Yi ∼ N ((Xβ)i , σ 2 )
Incidentally, in this case the LS estimator is also the solution of the MLE problem (the equations and solutions
become the same). This is the first time we deal with a sample of independent random variables but not identically
distributed: this is because each response variable has its own "mean" (we don’t have the same distribution for
all the random variables). If we try to plot this in the simple regression case we have that for each value of the
predictor we have a normal distribution which moves along the regression line.
131
So our sample on the "Y " regression is a set of normally distributed random variables but the symmetry of all
the random variables follows the regression line: for each value of the predictors we have a different mean of the
response variable (the shape of the Gaussian distribution is the same since we have the same variance).
B = (X t X)−1 X t Y
• The LS estimator "B" is unbiased (the expected value of the estimator is equal to the parameter):
E(B) = β
So the first part is a constant (a number) since we have neither the "Y " nor the "ε" (we just have the expected
value of "β" which is "β"). In the second term we have "ε" and so we obtain:
= β + (X T X)−1 X T E(ε) = β
=0
but since the mean of the error is "0" the last component disappears.
• If the error terms are uncorrelated with equal variance "σ 2 ", the variance/covariance matrix of "B" is:
From this equation we can see that for uncorrelated predictors the components of the vector "B" are uncor-
related. This is important because if we have predictors with a large correlation we face serious problems:
if we have correlation on the "B" and we have an error in the component, there’s a high probability to have
an error in all the components. It is indeed very important in regression models to avoid highly correlated
predictors (this problem is called multicollinearity). This is why regression modeling is problematic in very
large datasets: if we have a large number of predictors there’s a high probability of having correlation between
the predictors (this may have an impact on the estimation and on the testing) [the last part of this course
will be about the model selection].
132
11.6 The Gauss-Markov theorem
This last point is about the selection of the consistency of the estimator.
Gauss-Markov Theorem
The Gauss-Markov theorem says that the LS estimator "B" is BLUE: under the assumption of error terms
uncorrelated and with equal variance "σ 2 ", the LS estimator:
B = (X t X)−1 X t Y
is BLUE (Best Linear Unbiased Estimator). This means in particular that for all other linear estimator
"B̃ = CY " one (the best) has the lower variance:
V AR(Bj ) ≤ V AR(B̃j )
11.7 Questions
1. Is at least one of the predictors "X1 , X2 , . . . , Xp " useful in predicting the response?
If the answer is no we can’t use our data since there’s no significant connection between the "Y " and the
predictors. This means that the model isn’t useful at all.
2. Do all the predictors help to explain "Y ", or is only a subset of the predictors useful?
Given that we have at least one useful predictor in our set of predictors, can we use a smaller subset of
predictors? Smaller sets of predictors are always more preferable.
3. How well does the model fit the data?
Here we compute the goodness of fit of the model to the data.
4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction?
5. Can we incorporate categorical predictors and interactions between the predictors?
133
Advertising (ISLR)
Here we have a sample of "200" markets: our objective is to explore the relations among the "sales" of a prod-
uct and the advertising budget on "TV ", "radio", and "newspaper". The data is in the file "Advertising.csv".
• Response variable: "sales"
• Predictors: "TV ", "radio", "newspaper"
Here we perform a linear regression by using the "lm" function: notice that we have the response variable
"sales" and then all the regressors separeted by the "∼".
model=lm(sales~TV+radio+newspaper,data=Advertising)
summary(model)
anova(model)
Call:
lm(formula = sales ~ TV + radio + newspaper, data = Advertising)
Residuals:
Min 1Q Median 3Q Max
-8.8277 -0.8908 0.2418 1.1893 2.8292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.938889 0.311908 9.422 <2e-16 ***
TV 0.045765 0.001395 32.809 <2e-16 ***
radio 0.188530 0.008611 21.893 <2e-16 ***
newspaper -0.001037 0.005871 -0.177 0.86
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 1.686 on 196 degrees of freedom
Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
> anova(model)
134
11.8 F-Test
The first question is: "is at least one of the predictors "X1 , X2 , . . . , Xp " useful in predicting the response?".
Let us consider two nested models (nested means that we have a large model with a smaller model inside of it).
So we have:
Ideally we want to compare the residuals of the two models. So we define the residual sum of squares for both the
complete and the reduced model:
n n
(c) (r)
X X
RSSC = (yi − ŷi )2 RSSR = (yi − ŷi )2 = SSY
i=1 i=1
Of course we have that the complete model is more precise than the reduced model since we have a larger number
of degrees of freedom:
RSSR > RSSC
We again use the hypothesis test method in order to test the validity of the reduced model. If we accept the reduced
model it means that there’s no significant predictor in our model: we accept a model with only the intercept. We
define:
H0 : β1 = · · · = βp = 0 H1 : ∃βj ̸= 0
The test statistic is based exactly on the difference between the RSS of the empty and of the complete model.
If this difference is sufficiently small than we accept "H0 " (there are no significant predictors). Fortunately
we don’t have to compute this value since we can use the p-value (the last line of our R output).
If the errors "εi " are i.i.d. normally distributed, then "F " has a known distribution under "Ho ", named as "the
Fisher’s distribution with "p" degrees of freedom in the numerator and "(n − p − 1)" degrees of freedom in the
denominator". The test is a one-tailed right test.
135
The F-test can be used in general for testing the nullity of a set of "q" coefficients. In such a case let us consider
two nested models:
H0 : β1 = · · · = βq = 0 H1 : ∃βj ̸= 0
with Fisher distribution with "q" degrees of freedom in the numerator and "(n − p − 1)" degrees of freedom in the
denominator.
Ho : βj = 0 H1 : βj ̸= 0
136
There are also other possible choices:
• The adjusted "R2 ", which takes into accounts the number of predictors:
RSSC /(n − p − 1)
2
Radj =1− SSY /(n − 1)
e = Y − Ŷ
So the idea is to check possible non-linear relationships between the "Y " and the predictors by using the residuals.
Independency
The residuals, unlike the errors, do not all have the same variance: it decreases as the corresponding x-values
gets farther from the means.
Studentized residuals
The Studentized residuals are (we standardize them by dividing the residuals by their standard deviation):
e
ẽi = √i
RSE 1 − hii
where "hii " is the i-th diagonal element of the hat matrix:
H = X(X t X)−1 X t
It is very important to use Studentized residuals instead of the standard residuals because with them we
can check if the residual is large or not: for Studentized residuals the standard bounds are "[−2, 2]" and so
all the values outside this interval are considered "unusual" (a sort of outlier of the residuals).
137
In a good regression model, the Studentized residuals should not present any pattern and, as we just said,
observations with Studentized residual outside the interval "[−2; 2]" should be regarded as outliers. This is our case:
here we have the "predictors" and the "Studentized predictors" on the axes. We take the predicted values "Ŷ " over
the abscissae axis and Studentized residuals "ẽ" over the ordinate axis. By looking at the graph we understand
we have a good regression when we don’t obtain any shape (we should obtain a regular cloud of points inside the
standard interval).
In our case we have two main problems: the first one is that we have a cloud of values which generates a particular
shape, and second we have several outliers (values outside the standard interval "[-2,2]") on the bottom part of the
graph.
11.12 Prediction
In this last point we see how to use the regression for the prediction. This is the first part of "true statistical model":
we have data points, from the data points we compute an equation and then we use the equation to predict values
of the response variable, also for data for which we don’t have the response variable. While computing the predicted
value isn’t that hard (we just need to plug the "X" into the equation and obtain the prediction), in this framework
computing the variance of the prediction (computing the confidence interval) is a bit more difficult. Prediction for
new observations is indeed made through confidence intervals. There are two kinds of such intervals:
• CI’s for the mean predicted value [prediction of the random variable]
• CI’s for a specific new value (forecast which is the new observation) [it is a confidence interval for the outcome
of the random variable]
We use the Software R function "predict" with "interval="confidence"" for means and "interval="prediction""
for forecasts. For instance:
predict(model,newdata=data.frame(TV=150,radio=50,newspaper=50), interval="prediction")
where "model" is the output of the "lm" function containing the equation. The output is the prediction for a new
observation which invests "150" for "TV", "50" for "radio" and "50" for "newspaper". From the output we see that
the new prediction for the company is "19.12" with a confidence interval of "[15.82; 22.54]".
138
11.13 Working with qualitative predictors
In our previous example we only worked with quantitative variables but predictors are not random variables (there’s
no randomness in them). This means that "as predictor" we can use for example quantitative variables or qualitative
predictors (for example "the type of the company" or its "location"). In this last case we indeed have a predictor but
we don’t have any "number" for it so we can’t use it in our equation. When we need to use a qualitative predictor
with "k" levels, we need:
• to create "k − 1" dummy variables (indicator variables with "1" on a given level, and "0" otherwise). Notice
that we can’t take all the "k" elements because we would obtain a "non full rank" matrix and we wouldn’t
compute the MLE and the LS estimator.
• to test the significance of the predictor with an F-test, since "(k − 1)" coefficients are involved simultaneously
This step is done by the "lm()" function, provided that the system correctly classifies the qualitative variable.
11.14 Interactions
To check for interactions between the predictors we need to add the product of two (or more columns) to our design
matrix. For instance if we include the interaction "TV-radio" we write (we replace the "+" with the "∗" since we
pass from an additive model to a multiplicative model):
model.wi=lm(sales~TV*radio+newspaper,data=data)
summary(model.wi)
> Call:
lm(formula = sales ~ TV * radio + newspaper, data = data)
Residuals:
Min 1Q Median 3Q Max
-6.2929 -0.3983 0.1811 0.5957 1.5009
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.728e+00 2.533e-01 26.561 < 2e-16 ***
TV 1.907e-02 1.509e-03 12.633 < 2e-16 ***
radio 2.799e-02 9.141e-03 3.062 0.00251 **
newspaper 1.444e-03 3.295e-03 0.438 0.66169
TV:radio 1.087e-03 5.256e-05 20.686 < 2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 0.9455 on 195 degrees of freedom
Multiple R-squared: 0.9678, Adjusted R-squared: 0.9672
F-statistic: 1466 on 4 and 195 DF, p-value: < 2.2e-16
This model has 4 predictors which are "TV ", "radio", "newspaper" and the interaction between "TV*radio". From
the output we notice that all the estimates are positive numbers: this means that each expenses in adv causes an
increase in the sales (the response variable). We then notice that "newspaper" is not significant and that there’s a
significant positive interaction (we have to look at the "TV:radio" estimate): the effect is more than linear. This
means that it’s better for example to split the budget among these two variables (and not only one of them) since
their interaction generates more sales (has a positive effect due to their "correlation" [of course not the correlation
of random variables but as we mean their "association"]).
Notice that we could also consider multiple interactions but also remember that the more interactions we add,
the less "useful" the model is. Usually when variables are non-significant we don’t consider them in the interaction
analysis: interaction is defined only for significant predictors.
139
11.15 Nonparametric approach: the k-NN method
As we have considered in the parameter estimation the two approaches (parametric approach based on the likelihood
estimation and the non-parametric based on bootstrap) we perform the same also in the regression framework. The
regression line based on the LS estimation and the MLE is the so called parametric estimation: we write a linear
equation and we have all the parameters which are the coefficients inside our equation.
There’s a simple non-parametric counterpart for the regression which is called "k-NN". The "k-NN " (abbreviation
for k-Nearest Neighbors) algorithm is the easiest non-parametric solution for regression (and classification). This
method computes predicted values based on a similarity measure on the regressors. It is a "lazy learning" algorithm:
it does computations only when the computation of a predicted value is requested.
k-NN method
In this framework we don’t use equations for the computation of the predicted values (for our model) but,
whenever we have a new data to predict, we find its "k" nearest neighbors in the space of the predictors and
use them to compute the predicted value.
where "Nn+1 " is the set of the "k" observed points nearest to the "(n + 1)"-th (the neighbors).
k-NN
We consider an example where we have two predictors "X1 " and "X2 ": our goal is to predict a response
variable "Y " (we suppose that the observed values are the black triangles in the figure). We need to compute
our predicted value for the values of the predictors given the red point in the graph. In order to define
the predicted value we just to consider the nearest "k" values (points) and then compute the mean of the
response variable of these points. So if for example we consider "k = 2" and "k = 5" it means that, starting
from the point, we consider the "2" and "5" nearest points.
140
For "k = 1", we simply divide our plane into several sectors. The predicted values form a piece-wise constant
function over polygons: such regions are called Voronoi cells.
Regression line
Try to figure out what is the "regression line" in the simple linear regression (one predictor) when "k = 1".
In this case for each point we need to consider the nearest point: we have a step function. Here for each
value of the "x" we consider the nearest value on the "x" axis and then we take the corresponding value of
the response variable.
11.16 Distance
Another important problem for the application is "how to compute the distance between the points". Usually the
distance is the Euclidean distance in "Rp " between two vectors (we compute the square root of the sum of all the
differences): v
u p
uX
d(xi , xj ) = t (xi,h − xj,h )2
h=1
h=1
141
• "r = 1" is know as the city-block distance, or Manhattan distance (used especially for categorical predictors)
• "r = 2" is the Euclidean distance
• "r → +∞" is the Supremum distance:
The problem here is the standardization of the variables because if we take multidimensional "X" (we have more
than one predictor) the distances over one axis may become irrelevant (for example on one axis we have a great
distance and on the other a very small one). The solution to this problem is the standardization of the variables
before starting the analysis. We replace each regressor (predictor) with its standardized version. We rescale the
data so that the mean is zero and the standard deviation is "1":
xh − x̄h
xh →
sn
In k-NN algorithm however the usual choice is:
xh − min(xh )
xh →
max(xh − min(xh ))
The k-NN regression can be performed with a number of different machine learning packages. Here we use the
"caret" package and the function "knnreg".
We use the previous "advertising.csv" dataset, trying to predict the response for the last "5" rows (using the
first "195" examples). In order to check the goodness of fit of our model we adopt the standard regressions
strategy used when dealing with complicated regression models: the idea is to divide the dataset into two
parts:
• The first one (larger) to train the model [the algorithm]
• The second one to test the model (to see if we obtain a good performance)
In this case we have "200" obs: we use "195" rows for the training and "5" for the testing phase [usually the
standard split is 70% for the training set and 30% for the testing set]. If the last "5" predicted values are
close to the observed values then the model has a good fit.
stdize=function(x)return((x-min(x))/(max(x)-min(x)))
sc.Advertising=as.data.frame(lapply(Advertising[,2:4],stdize))
Note that the standardization is only applied to the predictors. Select the examples of the training set:
tr.set.x=sc.Advertising[1:195,]
tr.set.y=Advertising[1:195,5]
tr.set=data.frame(tr.set.x,tr.set.y)
and the predictors of the test set (is given by the last lines of the dataframe):
test.set.x=sc.Advertising[196:200,]
142
Now execute:
knnmodel=knnreg(tr.set.x,tr.set.y)
pred_y = predict(knnmodel, test.set.x)
and check:
pred_y
Advertising$sales[196:200]
And so we obtain the set of 5 predicted values. We have a quite good prediction: for example the first observed
value is "7.6" and we predicted "7.66" (and so on). If we want a numerical summary of our goodness of fit we
can take the square correlation between the two vectors. In this case the default value for the "k" is "k = 5".
We can then also check the "R2 ":
Rsq=cor(pred_y,Advertising$sales[196:200])^2
Response variable with 2 levels (coded with colors green and black):
143
11.18 The effect of "k"
In classification, it is simple to show the effect of "k" on the classification rule.
Increasing "k" yields smoother predictions, since we average over more data. When "k = 1" it yields "y=piecewise
constant labeling", "k = n" predicts "y=globally constant (majority) label".
The Software R package for doing classification with k-nearest neighbors is "class". Here a "knn" function is
available (similar to the "knnreg" we have analyzed earlier). Again, there is large number of functions to perform
k-NN classification. So, always use the documentation to understand the syntax of your function and the data
structure for the input.
144
12 Lab 5
12.1 Multiple Linear Regression - Boston Housing
In this lab we review the basics of Multiple Linear Regression using a classical data set, the "Boston housing"
sample. To use the data, just load the package "MASS", and the dataset named "Boston" (for the description
of the variables type "help(Boston)"). The response variable in this data set is "medv", the median value of
owner-occupied homes (in thousands of dollars).
1. Do some basic descriptive statistics for all the variables (use first the "summary" function). Look at the
boxplots to detect outliers in the (univariate) distributions and to see the symmetry/asymmetry of the dis-
tributions.
The "Boston Housing" dataset was made to predict the price (the "value") of the houses in Boston, starting
from geographical and socio-economic predictors. The response variable is "medv" which is (we can also see
this from the "help" tab) the median value of owner-occupied homes in thousands of dollars. So in our dataset
we have a list of regressors and in the end the response variable.
Here we just import the libraries we need and then compute a basic description of our data (we explore the
random variables). Remember that in regression problems the most important random variable is the response
variable because it is the only variable with a random component (the predictor are fixed).
library(MASS)
library(corrplot)
library(nortest)
###1)
data(Boston)
View(Boston)
From the output ("summary") we notice that the response variable "medv" is highly skewed and asymmetric
(check the values of the quantiles and so on). We can of course see this more clearly by using the boxplot
function. From the plot we also notice that there are several outliers on the right side of the boxplot and the
same conclusion is also given by the plot for the density estimation.
145
As we can see we have a quite strange distribution which doesn’t look like a Gaussian distribution: this is not a
problem since in regression we don’t need to have normal distributions. The response variable is not normally
distributed but the conditional distribution of the response variable given the predictors is. In this case the
only normal distribution we actually need to have is the distribution of the residuals.
We can then explore also the "quantitative" and "categorical" predictors (in the output we just showed one
example for each case):
• The quantitative predictor "crime" is again highly skewed and highly asymmetric (we again use the
"summary" and the "boxplot" functions). We again notice that there are several outliers on the right
side of the box:
• For the categorical predictors we considered the "chas" predictor (it is a dummy variable where "0" means
that the geographical zone is not crossed by the river). Here we can just use the "barplot" or the "table"
functions in order to check the proportion in the division of the observations.
2. Compute the correlation matrix. If you would like to have a plot of the correlation matrix, use the function
"corrplot" in the package "corrplot".
In our dataset we have only numerical variables (even the categorical predictors are dummy variables with
"0" and "1" values) and so we can directly compute the correlation of the entire dataframe. By default the
correlation of the entire dataframe (we use the function "cor") is the correlation between all the possible pairs
of predictors.
#Explore correlations#
cor(Boston)
corrplot(cor(Boston))
Of course, since we have the response variable, 13 predictors and 506 observations, it’s not easy to directly
comprehend the output of the correlation: in order to obtain a clearer output we use indeed a graphical
representation of the correlation by using the "corrplot" function:
146
Here we have a picture representing the correlation between all the variables. The correlation spans in the
interval "[−1, 1]" and we have different colors for negative (red) and positive (blue) correlations (of course
notice that along the diagonal every variable has a perfect correlation with itself). We are then interested in
all the non-diagonal points: for example between "indus" and "tax" we have a strong positive correlation.
Another interesting example is indeed the strong positive correlation between the industry "indus" variable
and the air pollution "nox" variable. It’s also important to notice that in this plot of the correlation matrix
all the variables are represented, also the dummy variables of categorical variables (here we have the "chas"
dummy variable). This means that for this kind of variables the use of the correlation is not totally correct
from a statistical point of view (they are not quantitative).
3. Use the "lm" function to define a model with all the predictors. Look at the "R2 ", the plot of the residuals,
the Fisher and Student’s tests. Remove the non-significant predictors, and look again at the diagnostics. In
your final model, test the normality of the residuals.
Here we start to do the regression. The complete model is obtained with the "lm" function which has a particular
formula: first of all we insert the response variable "medv" and then, after the tilde, all the predictors separated
by a "+" sign. Remember that in regression is not recommended to use the "attach" function because, as we
will see, it creates problems when dealing with several dataframes which use the same variables.
model_c=lm(medv~crim+zn+chas+nox+rm+dis+rad+tax+ptratio+black+lstat,data=Boston)
summary(model_c)
As we can see from the output the model is significant since the p-value is the Fisher Test is very small. The
multiple R-squared in this case is sufficiently large and only few predictors are not significant ("indus" and
"age").
Call:
lm(formula = medv ~ crim + zn + indus + chas + nox + rm + age +
dis + rad + tax + ptratio + black + lstat, data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.595 -2.730 -0.518 1.777 26.199
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
crim -1.080e-01 3.286e-02 -3.287 0.001087 **
zn 4.642e-02 1.373e-02 3.382 0.000778 ***
indus 2.056e-02 6.150e-02 0.334 0.738288
chas 2.687e+00 8.616e-01 3.118 0.001925 **
nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
147
age 6.922e-04 1.321e-02 0.052 0.958229
dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
tax -1.233e-02 3.760e-03 -3.280 0.001112 **
ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
black 9.312e-03 2.686e-03 3.467 0.000573 ***
lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
If we want to create a reduced model starting from this complete model we can then remove these two non-
significant predictors (we will just have 11 predictors instead of 13). This is still not a good model since we
seek for small models and 11 predictors are too many (we will do it later). In order to perform this we just
use the same lines of code but this time we remove the unwanted predictors:
model_r=lm(medv~crim+zn+chas+nox+rm+dis+rad+tax+ptratio+black+lstat,data=Boston)
summary(model_r)
Again we obtain a very similar output. Now that we have our "final" model we compute and test the normality
of the residuals:
Here we have the predicted residuals over the "x" axis and the Studentized residuals over the "y" axis. From
the plot we notice some "shape" and so we can’t be really satisfied from our residuals. Also notice that there
are several outliers outside the standard boundaries "[−2, 2]" (we have a very "large/wide" residuals’ plot).
As we can see from the plot we can easily state that the residuals are not normally distributed but in order to
double-check the non-normality of the residuals we can then use the "ad.test" function:
data: model_r$residuals
A = 10.483, p-value < 2.2e-16
As we predicted the Anderson-Darling test rejects the normality of the residuals (the p-value is very small).
148
4. Based on the correlation matrix, select the four predictors most correlated with the response variable, and
find a possible reduced model as above. If a nonlinear relationship between the response and a predictor is
likely (e.g. if the response and the predictor have different shapes of the distribution), you can add a quadratic
term. If the regressor is "x" the quadratic term is "I(x2 )". As always look at the "R2 ", the plot of the residuals,
etc.
The idea behind this point is that the model-selection based on the p-value (the Student t test) didn’t give a
satisfactory answer because we reduced the model but not in a considerable way as we still have 11 predictors
(we are still in a large model). So we now try a different solution based on this remark: since in terms
of regression the most relevant correlations are the ones in the last row (column) of the matrix, we check
the correlation between the response variable and the predictors. So the idea is to take just the 4 largest
correlations (of course we consider their absolute value) and to define a model where we consider just those
particular predictors.
In order to perform this we don’t display again the whole matrix but we just consider the last column (row)
and check which are the "greatest" values:
cor(Boston)[,14]
As we can see from the output the best correlations (in absolute value) are (in order) "lstat", "rm", "ptratio"
and then "indus". And so now we define a model only based on these four predictors:
model_1=lm(medv~lstat+rm+ptratio+indus,data=Boston)
summary(model_1)
Again from the output of the summary we can check that all the predictors are significant except "indus": this
means that in our final model we will just remove it:
model_2=lm(medv~lstat+rm+ptratio,data=Boston)
summary(model_2)
Notice that in this passage from "11" to just "3" predictors we record a very small decrease in the value of
the multiple R-squared. So we obtained, at the cost of a slight loss of precision, a much leaner and simple
model. The idea of model-selection we will study in the last part of the course will be indeed about the trade-off
between these two extremes: the best (precise) prediction and the simplest model.
The question about "non-linear" relationship is not so relevant for this exercise but it’s important for the
theory of this course. If we find a "non-linear" relation between the response variable and the predictors (for
example the plot of the residuals have a parabolic shape) we can try to add a quadratic term to the predictors:
for example we don’t consider "lstat" but "lstat2 ". The important point here is that the model is still linear
because it is linear on the parameters, but the predictors can be transformed as we want (we can take the
"log", the "sin", the "cos" and so on). For example we now add some quadratic terms by using the function
"I(x2 )" to the last final model we computed:
model_3=lm(medv~lstat+I(lstat^2)+rm+I(rm^2)+ptratio+I(ptratio^2),data=Boston)
summary(model_3)
Here the output changes as the "ptratio" becomes non-significant even in its linear form.
5. Comment the results of your analysis.
As we previously said adding non-linear forms to the model doesn’t change the linearity of the model, as it is
linear in the "β", but we can "change" (with functions) the predictors and we don’t have to set any probability
distribution as they are fixed.
We also noticed that adding or removing predictors from the model can actually change the significance for
other predictors (as we saw before) [this is due to the multicollinearity of the correlation matrix].
149
COMPLETE EXERCISE
library(MASS)
library(corrplot)
library(nortest)
###1)
data(Boston)
View(Boston)
###2)
#Eplore correlations#
dim(Boston)
cor(Boston)
corrplot(cor(Boston))
###3)
#Multiple linear regression - Complete model#
model_c=lm(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat,data=Boston)
summary(model_c)
150
13 Generalized Linear Models
13.1 Introduction
In past lectures we recalled the basics about linear regression. Let us now take up the question of building models
where the standard assumptions of the linear model are impossible or not plausible. For instance:
• A random component:
Yi ∼ N (µi , σ 2 )
• A systematic component:
ηi = xti β
[We will see later that the mathematics of generalized linear models work out nicely only for a special class of
distributions, namely the distributions in the exponential family.]
This is not as big a restriction as it sounds, however, as most common statistical distributions fall into this family,
such as the Normal, Binomial, Poisson, Gamma, and others.
151
13.5 The systematic component – overview
The quantity:
ηi = xti β
is referred to as the linear predictor for observation "i". The linear predictor is the way to use again the techniques
of linear models in GLMs. From:
E(Yi , xi ) = xti β
we need now to move to:
g(E(Yi |xi )) = xti β
We need to introduce the function "g" to take into account the special features of the response variable.
We will see later that again that we obtain some mathematical advantages when writing the distribution of the "Y "
in exponential form and then deriving from that the link function.
µi = γ · exp(δai )
Furthermore, since the outcome is a count, the Poisson distribution seems reasonable. Thus, this model
fits into the GLM framework with a Poisson outcome distribution, a log link, and a linear predictor given
"β0 + β1 ai ".
152
Predator-prey model
The rate of capture of prey, "yi " , by a hunting animal increases as the density of prey, "xi " , increases, but
will eventually level off as the predator has as much food as it can eat. A suitable model is:
αxi
µi =
h + xi
This model is not linear, but taking the reciprocal of both sides:
1 h + xi 1
= = β0 + β1
µi αxi xi
Because the variability in prey capture likely increases with the mean, we might use a GLM with a reciprocal
link and a gamma distribution.
Yi = Bern(pi )
If we write:
E(Yi |xi ) = xt β Yi |xi ∼ Bern(µi )
where for notation-convenience we set "pi = µi = xti β", we can compute the estimate "β̂OLS ", but the estimator
"BOLS " is not minimum-variance. In fact, we have:
V AR(Yi |xi ) = µi (1 − µi )
and, except from the trivial case of constant "µi ", there is heteroskesticity: heteroskesticity is an intrinsic property
of GLM’s.
153
WLS estimator
Define "W = V AR(Y )−1 ", a diagonal matrix with the reciprocal of the variances as diagonal elements.
Then, by solving:
Xn
xi wi (yi − xti β) = 0
i=1
we obtain:
BW LS = (X t W X)−1 X t W Y
where the "BW LS " is the best linear unbiased estimator for the heteroskedastic regression problem and
the "W " is the matrix of the weights, a diagonal matrix with the reciprocal of the variance over the main
diagonal and "0" elsewhere:
... 0
1
σ12
. ..
..
W = .. . .
0 . . . σ12
n
So in general this is the solution of the Least Square estimation for heteroskedastic problems, so when the
variance is difference for each observation.
The problem here is that we assume that the weights are known. In order to compute the estimator (the weighted
least square estimator for the "β") we need to know the vector of regressors "X", the vector of response variables
"Y " and also the weights. So the problem is that in general we have no information over the "σi ".
We now consider the simplest case which is the response variable of a Poisson distribution (we consider this
function since it is the simplest as the mean is equal to the variance):
Yi ∼ P(µi )
µ2i = µi
µi = g −1 (xti β)
But the mean of the response variable is exactly what we want to estimate with our linear model (we don’t know
its value): in this framework we operate with unknown weights.
154
Gaussian regression (VAR depends on the mean)
The cross-section data set "foodexpenditure" contains average weekly expenditure on food and average
weekly income in dollars for households that have three family members. "40" households have been randomly
selected from a larger sample. The data set is from William E. Griffiths, R. Carter Hill and George G. Judge,
Learning and Practicing Econometrics 1993, Wiley (Table 5.2, p. 182). The variables in the data set are:
reg=lm(foodexpenditure$foodexp~foodexpenditure$income,data=foodexpenditure)
plot(foodexpenditure$income,foodexpenditure$foodexp)
abline(reg, col="blue")
155
13.11 Iterative re-weighting
Roughly speaking, if we assume that the variance increases with the expected value we can then suppose that the
standard deviation "σ" is proportional to the mean and so that the variance is proportional to the square of the
mean [this is our assumption: it is given from the previous plot (the megaphone effect). The lines imply that the
"σ" is linear with respect to the expected value "µ". Since we noticed this behavior of our data points the standard
assumption is that the standard deviation is linearly dependent on the mean. This means that the variance is the
square of the expected value]:
σ∝µ V AR(Y ) ∝ µ2
Here we don’t know the mean but we know the function which links the variance to the mean.
So the idea is to use the WLS method iteratively: we start from an arbitrary set of weights (maybe constant)
[this means that the first step is basically an OLS]. With the "β" we can have the first estimate of the expected
values (we use the inverse "g"). Then, by using the expected values, we can refine the estimate of the weights. We
use the second estimate of the weights and we compute again the "β" and so on.
2
Once we fit the model, we have estimates for "{µi }" and therefore, for "{wii }", where "ŵii = µ̂1i ". However, once
we change "{wii }", we change the fit, and thus we change "{µ̂i}", which changes "{wii }", and so on. Graphically we
have:
1
1 ... ...
µ21
... ...
W1 = ... .. −→ B = (X T W X)−1 X T W Y −→ µ = g −1 (X T β) −→ W = ... 1
..
.
1 .
1 1 1 1 2 2
µ2
... ... 1 1
. . . . . . µ2
n
when we compute "µ1 " we then compute the second matrix "W2 " because we have the first estimate of the expected
value (the weights are the reciprocal of the variance and the variance is the square of the mean in our assumption).
In the end we have:
B2 = (X T W2 X)−1 X T W2 Y
and we repeat this process until the convergence.
IRLS algorithm
2. Fit the model, we obtain the estimates "β̂" (estimate of the "β") and "µ̂" (estimate of the expected
value)
3. We use the expected value "µ̂" to recalculate (update) the weighting matrix "W "
4. We repeat these steps until the convergence
This approach is known as the Iteratively Reweighted Least Squares (IRLS) algorithm.
156
Gaussian regression (VAR depends on the mean)
We start from the standard OLS linear model (first step) where the response variable is the "foodexp"
and the predictor is the "income". This implementation assumes that 20 iterations is enough to achieve
convergence. This is fine for this case, but in general, it is better to write a "repeat" or "while" loop
which checks for convergence at each iteration and terminates the loop when convergence is reached (we are
done when estimates are constant and don’t change anymore). Notice again that the weights "w" are the
reciprocal of the fitted values.
plot(foodexp~income,data=foodexpenditure,pch=19,ylab="food expenditure")
abline(fit,col="red")
abline(fit.w,col="blue")
The IRLS method is better because it reduces the weights of the data points with large dispersion. Also notice
that the standard error of the estimator drops from "0.05529" to "0.04643". Furthermore, the "R2" increases from
"0.3171" to "0.4671".
157
13.13 Choice of the weights
In general the choice of the weights is difficult. Here is a statistician (on the right) asking the Delphi’s Pythia (on
the left) what kind of weights should he use.
Remark
The log-likelihood of a sample is simply the sum of the log-likelihoods of the single observations, because of
the independence assumption.
The first derivative with respect to "θi " (i.e. the score function for "θi " ) is:
′
∂ yi − b (θi )
ℓ(θi , ϕ|yi ) = U (θi ) =
∂θi ϕ
158
Remember also that the score function has expected value and variance equal to:
′
E(U (θi )) = 0 V AR(U (θi )) = −E(U (θi ))
the latter being the (i,i)-th component of the Fisher information (the variance of the score function) matrix.
Since the score has mean zero, we find that: ′
E(Yi ) = b (θi )
The second partial derivative of the log-likelihood (which is the first derivative of the score function) is:
∂2 b”(θi )
ℓ(θ i , ϕ|yi ) = −
∂θi2 ϕ
which is (minus) the observed information for the canonical parameter "θi " pertaining the i-th observation.
Exponential families are important, as we already said, because the score function and the information matrix are
very easy to compute as we don’t need to compute the expectation or variance. We have the function "b": by taking
its first derivative we obtain the "mean" of the response variable and by taking its second derivative we directly
define the information matrix. The variance of the score is the information matrix, so it is the second derivative of
"b" divided by "ϕ". So we have that:
′
Yi − b (θi ) b” (θi )
Yi
V AR(U (θi )) = V AR = V AR =
ϕ ϕ ϕ
we also have:
V AR(Yi ) = ϕb” (θi )
The variance of "Yi " is therefore a function of both "θi " and "ϕ". Note that the canonical parameter is a function of
"µi ": ′ ′
µi = b (θi ) θi (µi ) = (b )−1 (µi )
so that we can write:
V AR(Yi ) = ϕb” (θ(µi )) = ϕV (µi )
where "V " is the variance function dictating the link between the variance and the mean.
Remark
Note that here we specify a transformation of the conditional mean "µi ", but we do not make any
transformation of the response (random) variable "Yi ". This means that we have:
which is different from the expected value of the transformation. Remember that expected values and
transformations can be switched only for linear transformations.
The inverse of the link function provides the specification of the model on the scale of "µi ":
µi = g −1 (xti β)
We can perform this because here we don’t have any probability distribution, so we don’t have likelihoods: the
predictors "xi " are not random (we have no random components).
159
Canonical link function
The canonical link is:
g(µi ) = θ(µi )
Of course we can use any function as "link" function but the computation becomes easy if we consider the
expression above, where we have the canonical parameter as a function of the expected value "µi ". This is
the most powerful choice because most of computations become easy (on the contrary in the general case
the computations may become difficult and in some cases the solution can’t be computed explicitly).
• The function derives from the canonical form of the distribution as a member of the exponential family
′
• The function is the inverse of "b "
y=c(0,0,2,1,5,7,12)
y=c(14.1,16.3,16.8,20.4,30.2,38.0,39.9)
plot(y,x,ylab ="N° Credit Payments",xlab="Annual Income")
The response variable is a count and we may suppose that variance increases with the mean, so a Poisson
response variable appears as a suitable choice. So we write the Poisson distribution in the form of exponential
families. In this case we have a sample of "7" observations:
Yi ∼ P(λi ) i = 1, . . . , 7
We then compute the log-likelihood for each i-th element (remember that the density of a Poisson distribution
y
is "fY (y) = e−λ λy! "):
yi θi − b(θi )
ℓ(λi |yi ) = yi log(λi ) − λi − log(yi !) = − c(yi )
ϕ
160
but in the Poisson distribution we know that the dispersion parameter is "ϕ = 1" and so the we can remove
the denominator. We then have that the "mean", the "canonical parameter" and the "µ" are equal to:
g(µi ) = β0 + β1 xi i = 1, . . . , 7
or in matrix form:
g(µ) = Xβ
where "X" is the design matrix with dimensions "7 × 2" (it contains one column for each quantitative
predictor and a column of all "1" for the intersect):
X = (1|x)
as in the multiple regression framework. In Software R we can use the following script:
y=c(0,0,2,1,5,7,12)
x=c(14.1,16.3,16.8,20.4,30.2,38.0,39.9)
X=cbind(rep(1,length(x)),x) #Here we bind the columns#
> X
x
[1,] 1 14.1
[2,] 1 16.3
[3,] 1 16.8
[4,] 1 20.4
[5,] 1 30.2
[6,] 1 38.0
[7,] 1 39.9
We can use the chain rule (derivative of composite functions) to obtain the expression for the i-th contribution to
the score function for "βj ":
∂ ∂ ∂θi ∂µi ∂ηi
ℓ(β, ϕ|yi ) = ℓ(β, ϕ|yi )
∂βj ∂θi ∂µi ∂ηi ∂βj
but (remember that previously we showed some properties):
∂ yi − µi ∂µi ∂ηi
ℓ(β, ϕ|yi ) = = b” (θi ) = V (µi ) = (xi )j = Xij
∂θi ϕ ∂θi ∂βj
If we plug these expressions into the previous derivative we obtain the so called "normal equations" (they derive
from the log-likelihood as they are the derivative with respect to the "β"), which is the last equation we need to
161
solve in order to obtain the MLE for the "β":
n
∂ X ∂µi Xij
ℓ(β, ϕ|y) = (yi − µi ) = 0
∂βj i=1
∂ηi ϕV (µi )
β̂M LE = (X t X)−1 X t y
With similar arguments we can also obtain the MLE estimate "ϕ̂" of "ϕ".
So we have seen that this set of equations is the OLS for the Gaussian distribution: this happens because the
variance function is constant and the derivative of "µ" with respect to "η" is "1". In general we have a WLS (weighted
least squares) problem because we have weights.
Non linearity
The equations:
n
∂ X ∂µi Xij
ℓ(β, ϕ|y) = (yi − µi ) = 0
∂βj i=1
∂ηi ϕV (µi )
incorporate a non-linear part in the dependence "µi = µi (β)", and therefore also in the variance
function"V (µi ) = V (µi (β))". In the sum the weight is represented by the expression " ∂µ
∂ηi ϕV (µi ) = W ".
i 1
We need to use the IRLS method because the weight depends on the mean.
Vector notation
In vector notation:
U = X t W (y − µ) = 0
where "W " is a diagonal matrix with diagonal element:
∂µi 1
wii =
∂ηi ϕV (µi )
13.18 Newton-Raphson
To find the MLE of the parameter means to solve the likelihood equations:
n
∂ X ∂µi Xij
ℓ(β, ϕ|y) = (yi − µi ) = 0
∂βj i=1
∂ηi ϕV (µi )
This is tricky because "β" appears in several places. We use the a special version of the Newton-Raphson
algorithm, which is an iterative algorithm for finding (and approximating) the zero of a convex function, based on
the gradient. Here is a picture on how the Newton-Raphson method works in the univariate case.
162
The basic idea is to take a complicated function and simplify it by approximating it with a straight line:
′
f (x) ≈ f (x(0) ) + f (x(0) )(x − x(0) )
where "x(0) ” is the point we are basing the approximation on. Thus an iteration moves from "x(0) " to:
′
f (x(0) )
x(1) = x(0) −
f (x(0) )
So now we need to use this general rule for a general "x" for the score function as a function of "β". Starting
from a parameter vector "β (0) ", we define a sequence:
I(β) = X t W̃ X
where matrix "W̃ " is a diagonal matrix of weights in some sense, with diagonal elements:
2
1
∂µi
w̃ii =
∂ηi ϕV (µi )
We can now insert our information matrix inside the iterative rule. So we have that the "β" at the "r + 1" iteration
is equal to the "β" at the previous step updated with the rule:
163
With this formula we can define a sequence of vectors of parameters which converge to the exact solution of our
problem, so the vector for which the score is null (we define the MLE of the parameter).
Now we have two weighting matrices (we took the old one from the equation at page 163) but they are close
relatives since they only differ by one exponent:
2
1 1
∂µi ∂µi
w̃ii = wii =
∂ηi ϕV (µi ) ∂ηi ϕV (µi )
and thus:
W = W̃ M
where "M " is a diagonal matrix with generic diagonal element:
∂ηi
mii =
∂µi
The equation:
β (r+1) = (X t W̃ X)−1 (X t W̃ X)β (r) − (X t W̃ X)−1 (X t W )(y − µ(r) )
can be rewritten in terms of the matrix "W̃ ":
β (r+1) = (X t W̃ X)−1 (X t W̃ )Xβ (r) − (X t W̃ X)−1 (X t W̃ M )(y − µ(r) ) = (X t W̃ X)−1 (X t W̃ ){η (r) − M (y − µ(r) )}
So we replaced the expression "W = W̃ M " into the tinted factor: we perform this in order to have the expression
in terms on just one weighting matrix. This is a IRLS problem with respect to the new response variable:
z (r) = η (r) − M (y − µ(r) )
with:
(r) (r) (r) ∂ηi
zi = ηi + (yi − µi ) | (r)
∂µi ηi
where "z" is called the working response variable.
Now, apply the IRLS step. The updated value of "β̂" is obtained as WLS estimate to the regression of "Z"
on "X":
β̂ (r+1) = (XW (r) X)−1 X t W (r) x(r)
(r+1) (r+1)
and then compute "ηi " and "µi ",. . . , until convergence.
In practice, the convergence is reached when the difference between "β̂ (r+1) " and "β̂ (r) " is "small".
164
Credit card payments during Black Friday (II)
Let us consider again the Black Friday data and write explicitly the ingredients of the Fisher scoring
algorithm (remember that for the Poisson distribution "ϕ = 1"):
(r)
• ηi = xti β̂ (r)
(r) (r) (r)
• µi = g (−1) (ηi ) = exp(ηi )
2 2
(r) (r) (r)
• w̃i = ∂µ i
|
∂ηi η (r)
1
(r) = exp(ηi ) 1
(r) = exp(ηi )
i V (µi ) exp(ηi )
y=c(0,0,2,1,5,7,12)
x=c(14.1,16.3,16.8,20.4,30.2,38.0,39.9)
X=cbind(rep(1,length(x)),x)
beta=c(0,0)
#beta=c(log(mean(y)),0) with a clever choice of init values#
for (i in 1:20) #unconditional cycle with 20 iterations#
{
eta=X%*%beta #the "X" is the design matrix#
mu=exp(eta)
tw=exp(eta)
TW=diag(as.vector(tw))
z=eta+(y-mu)*exp(eta)^(-1)
beta=solve(t(X)%*%TW%*%X)%*%t(X)%*%TW%*%z #the WLS solution#
print(list(i,beta))
}
and so we printed a list with the number of iteration and the "β": for each iteration we have the intersect
and the slope. From the printed output we notice that after the 11-th iteration the fourth decimal places
of both "β̂0 " and "β̂1 " are stabilized: this means that we have reached the convergence (the solution). In
order to define a criteria (a sort of "stopping rule" for the cycle) we can define the "error". The error is the
absolute maximum value between two consecutive approximations: if it is is greater than a certain threshold
we repeat the iterations, otherwise we stop the process because we reached the convergence.
beta=c(0,0)
#beta=c(log(mean(y)),0) #with a clever choice of init values#
err=+Inf
i=0
while(err>10^-4) #conditional cycle#
{
i=i+1
eta=X%*%beta
mu=exp(eta)
tw=exp(eta)
TW=diag(as.vector(tw))
z=eta+(y-mu)*exp(eta)^(-1)
beta1=solve(t(X)%*%TW%*%X)%*%t(X)%*%TW%*%z
err=max(abs(beta1-beta))
beta=beta1
}
print(i)
print(beta)
165
The output is:
> print(i)
[1] 11
> print(beta)
[,1]
-2.1347085
x 0.1137496
The coefficients yield a curve of the expected values, we can plot it on the scatter-plot. The best fit is given
by the curve:
So now we will have to replicate the analysis we previously did for the Gaussian linear regression about the
significance of the parameters: we need to have a test for the global model and for each predictor.
Theorem
We know that the "β" is the ML estimate:
·
B ∼ Np+1 (β, I −1 (β))
166
Since the information matrix is estimated by plugging "β̂" and "ϕ̂" into "I(β)" obtaining "Î(β)", we have:
V AR(Bj ) = Î −1 (β)jj
The only difference between the Gaussian theory and the GLM theory is that in the Gaussian model we have
exact distributional results for the Student-t and for the Fisher distribution: this means that we can check the
significance for all the parameters also for small models. In GLM on the other hand we work into the asymptotic
framework so all the tests are valid only for large samples.
H0 : βj = 0 H1 : βj ̸= 0
V AR(Bj ) = Î −1 (β)jj
with asymptotic standard normal distribution under "H0 ". The test can be easily generalized to a multi-parameter
version.
• Obtain the best fitting model in the large model "β̂M LE "
• Obtain the best fitting model in the reduced model "β̂0,M LE "
If the p-value is larger that the threshold we imposed then the Likelihood ratio test accepts the null hypothesis
under "H0 " (which establishes that the reduced model is good enough to describe our data), otherwise we reject it.
167
13.24 Wald test vs Likelihood ratio test
The Wald test and the likelihood ratio test measure the discrepancy between two nested models by checking different
quantities. The following plot (sort of parabola) represents a classic likelihood function for a GLM:
So the log-likelihood ratio measures the difference on the "y" axis, so the difference between the complete and
reduced models. The Wald test on the other hand measures the difference over the "x" axis.
13.25 Deviance
To actually compute the likelihoods it is useful to use the notion of deviance. For a given model compute:
n
X
ℓ(µ̂, ϕ̂, y) = ℓ(µ̂i , ϕ̂i , yi )
i=1
and contrast it with the likelihood of a saturated model [it is the model where we reach the perfect fit] (one
parameter for each observation), where the perfect fit is reached (i.e. "µ̂i = yi "):
n
X
ℓ(y, ϕ̂, y) = ℓ(yi , ϕ̂i , yi )
i=1
Deviance
The following quantity is the deviance of the model:
The deviance is a good measure because it is based on the log-likelihood, which is additive for all the
observations: this means that if we have a sample, the log-likelihood of the sample is equal to the sum of all
the single log-likelihood values.
168
Deviance for the Poisson distribution
Find the deviance for the Poisson distribution.
So we compute:
λyi i
fλ (y) = e−λi
yi !
ℓ = log L = −λi + yi log(λi ) − log(yi !)
So the current model is:
ℓ(yi , ϕ̂ , λ̂i ) = −λ̂i + yi log(λ̂i ) − log(yi !)
=1
169
Exercise (I)
[,1]
-2.1347085
x 0.1137496
x
0.93679732 0.02606235
• The log-likelihood:
> ll=sum(-mu+y*log(mu)-log(factorial(y)))
> print(ll)
[1] -10.59079
• The model deviance [just pay attention to the "0 · log(0)" problem]
rdev=0
for (i in 1:length(y))
{
if(y[i]==0){rdev=rdev-2*(y[i]-mu[i])}
else {rdev=rdev-2*(y[i]-mu[i]-y[i]*log(y[i]/mu[i]))}
}
> print(rdev)
[1] 4.94302
> print(ndev)
[1] 32.85143
170
• The deviance residuals
d.r=c(NA,length(y))
for (i in 1:length(y))
{
if(y[i]==0)
{d.r[i]=sign(y[i]-mu[i])*sqrt(-2*(y[i]-mu[i]))}
else
{d.r[i]=sign(y[i]-mu[i])*sqrt(-2*(y[i]-mu[i]-y[i]*log(y[i]/mu[i])))}
}
Remark: our worked example has only the purpose of doing computations with a very small data set. The results
here can not be used for inference, since the inference for GLM’s is based on asymptotic distributions, and therefore
it requires at least a moderate sample size.
Exercise (II)
Call:
glm(formula = y ~ x, family = poisson(link = "log"))
Deviance Residuals:
1 2 3 4 5 6 7
-1.0845 -1.2291 1.1254 -0.1917 0.6569 -0.6668 0.2769
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.13471 0.93680 -2.279 0.0227 *
x 0.11375 0.02606 4.365 1.27e-05 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Notice that now we don’t have the "t − value" but we have the "z − value" because here in the GLM all the tests
are asymptotic for large samples using the normal distribution (we are no more in the t-distribution).
While the computation of the estimates is good even for small samples, on the contrary the estimation of "z −value",
"P r > (|z|)" and the tests "N ulldeviance" and "Residualdeviance", based on the deviance, are based on large
samples only.
171
13.27 Computational remarks
Recall that, for linear regression, a full-rank design matrix "X" implied that there was exactly one unique solution
"β̂" which minimizes the residual sum of squares. A similar result holds for generalized linear models: if "X" is not
full rank, then there is no unique solution which maximizes the likelihood. However, two additional issues arise in
generalized linear models:
• Although a unique solution exists, the Fisher scoring algorithm is not guaranteed to find it
• It is possible for the unique solution to be infinite, in which case the estimates are not particularly useful and
inference breaks down
172
14 Logistic and multinomial regression
We now investigate special examples of generalized linear model: the logistic and multinomial regressions.
• The Logistic regression is a regression model where the response variable is a binary variable, a
categorical variable with two levels "0" and "1".
• The Multinomial regression is a generalization of the logistic regression for response variables with
"k" levels and in general for general categorical response variables. We now start investigating the simplest
example with is the logistic regression.
For the purpose of using a Bernoulli distribution as the response variable, we conventionally use the two levels "0"
and "1".
Let us consider a simple (it means we have just one response variable and one predictor) logistic regression
with:
• Response variable: "highbld" ("1" if high level, "0" otherwise)
So we now try to investigate if there’s a connection between the lead level in the blood and in the soil:
plot(highbld~soil,data=soillead)
plot(jitter(highbld,factor=0.1)~soil,data=soillead,ylab="(jittered) highbld")
In the scatter-plot of the data we consider the response variable on the "y" axis and the predictor over the
"x" axis. Since in these kinds of graph we usually get some overlapping points we also plot a slightly different
representation (the graph on the right side), which is a jittered version (we add some Gaussian noise in order
to have the data plotted ad different levels) of the plot.
173
So from the plot we notice that when the lead-level in the blood is normal (so the response variable is "0")
we have small values of the predictor, whilst on the contrary we have higher levels for the soil: the data
suggests a sort of "correlation".
Of course we can’t apply the standard linear regression because it would define a straight line (for example
for positive association we would get a positive line from "−∞" to "+∞") over the previous plot which makes
no sense. So when a linear regression is applied, several unwanted things happen:
εi ∼ N (0, σ 2 )
is not reasonable (try to plot the regression line. . . ) because we have different values of the variance
for different values of the mean of the response variable
• Predicted values are not "0" or "1" and can also lie outside the interval "[0, 1]". So if for example we
mistakenly consider the linear regression, the regression line we obtain is:
wmod=lm(highbld~soil,data=soillead)
summary(wmod)
abline(wmod, col="blue")
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3178371 0.0506531 6.275 4.29e-09 ***
soil 0.0002754 0.0000426 6.465 1.65e-09 ***
and for "x = 5000" one obtains "ŷ = 1.6948", which is a value outside our boundaries and so not
reasonable for our case.
174
14.2 Logistic regression parametrization
So we can use our knowledge about the GLM and apply it with the correct distribution for the response variable.
Here the correct response variables is the Bernoulli distribution (it is a distribution with two possible outcomes
which are "0" and "1"):
Yi ∼ Bern(pi )
where the parameter of the distribution "pi = P(Yi = 1)" is the expected value and also the probability of success
of the i-th trial. In other words:
pi = P(Yi = 1|Xi )
From the plots we see that "pi " depends on the value of "xi ". So we model each response variable with a Bernoulli
distribution of probability "pi ": it is the conditional probability of success for the i-th element of our sample given
the value of the predictor "xi ". Remember as we said that "pi " is the expected value of "Yi ":
pi = E(Yi )
Exploiting our examples about the canonical form of the exponential families, we already have the canonical
link function for the Bernoulli distribution:
pi
g(pi ) = log
1 − pi
Thus:
logit(pi ) = ηi = β0 + β1 xi
We now use the "glm" function in Software R in order to compute the estimates of the "β"’s, to make inference.
175
Lead in blood dataset (II)
We store the results into an object called "mod" and then we can explore the results by using the "summary"
function:
mod=glm(highbld~soil,data=soillead,family=binomial(link="logit"))
summary(mod)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.26400 -0.79968 0.06168 0.85754 1.80346
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.5160568 0.3380483 -4.485 7.30e-06 ***
soil 0.0027202 0.0005385 5.051 4.39e-07 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
In this output the "slope" (the value we have for the predictor "soil") is not totally correct since we don’t have
a linear function. Notice that we have no dispersion parameter since in the Bernoulli distribution we know
that "ϕ = 1" in the exponential family representation. We then have the values for the "null deviance"
(the deviance in model with only the intercept) and the "residual deviance" (the deviance in the current
model). Remember that from these two values of deviance we can compute the value of the Likelihood Ratio
Test. The last line is about the convergence of the Fisher scoring: as we said in the previous part there are
cases where the Fisher scoring doesn’t converge even when the "X" matrix is non-singular. So in order to
be sure that the algorithm reached the convergence the output prints the number of iterations.
• Read the parameter estimates, their standard errors, and the corresponding Wald tests (this test considers
one parameter and is used for testing the significance of a predictor)
• Read the value of the dispersion parameter (here equal to "1", since the variance is a function of the mean)
• Deviance to be used in the likelihood ratio test
176
> 1-pchisq(191.82-139.62,1) #Likelihood Ratio Test using the Deviance#
[1] 5.012657e-13
The likelihood ratio test is a one-tailed right test: we need to compute the p-value which is the probability
on the right side (we have to compute "1" minus the cumulative distribution function).
We can perform the same by defining the empty model (the model where we just have the intercept [we have
no predictors]). In this case we compute the GLM with only the intercept and then we perform the "anova"
test using the "chisqd" between the two models:
mod0=glm(highbld~1,data=soillead,family=binomial(link="logit"))
anova(mod0,mod,test="Chisq") #Likelihood Ratio Test using the empty model and Anova#
Model 1: highbld ~ 1
Model 2: highbld ~ soil
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 138 191.82
2 137 139.62 1 52.201 5.009e-13 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
So we have a significant Likelihood Ratio test as well as significant values for the Wald test: the model is
significant. There’s a significant increase in the probability of being "highlead" in the response variable
depending on an increasing predictor (lead level in the soil).
We can then also compute the "predicted values", the "deviance" and the "deviance residuals" by running:
yhat=predict(mod,type="response")
dind=soillead$highbld*log(yhat)+(1-soillead$highbld)*log(1-yhat)
dev=-2*sum(dind)
dev
devres=sign(soillead$highbld-yhat)*dind
177
So we have the predicted expected value for each value of "x". This equation describes a curve called "Logistic
curve", whose plot is:
of course here from the plot we can’t see the values before "0" but we know that this function asymptotically goes
to "0" when approaching "x = −∞". So if we take the estimates "β̂0 " and "β̂1 " in the previous output and we plug
them into the logistic function we obtain this plot which represents the expected value of the "y" as a function of
the values of the predictor.
From:
eηi 1
pi = g −1 (ηi ) = =
1 + eηi 1 + e−ηi
and "ηi = β0 + β1 xi " we see that:
• When "β̂1 " is positive: as "X" gets larger, "p̂i " goes to "1", as "X" gets smaller, "p̂i " goes to "0"
• When "β̂1 " is negative: as "X" gets larger, "p̂i " goes to "0", as "X" gets smaller, "p̂i " goes to "1"
The intercept "β̂0 " is the estimate of the linear predictor when "xi = 0", and thus:
eβ̂0
p̂i =
1 + eβ̂0
is the estimate of the mean response when the predictor is set to "0". From this we don’t get exactly an interpretation
of the "β̂1 " but of its sign as we obtain two different plots: one for positive and one for negative values of the
parameter.
178
So we have two different interpretations:
• Positive sign of "β̂1 " means that the probability of success "1" increases when the "X" increases (the case of
the previous graph)
• Negative sign of "β̂1 " means that the probability of success decreases when the "X" increases (the opposite
case: the horizontal symmetric graph)
Odds of an event
The odds of an event "E" is the ratio between the probability of the event and its own complement to "1":
P(E)
odds(E) =
1 − P(E)
In logistic regression we write (the "Yi " has a Bernoulli distribution with parameter "pi "):
odds(E1 )
O.R.(E1 , E2 ) =
odds(E2 )
P(Yi = 1) P(Yi = 1)
odds(Y1 = 1) = =
1 − P(Yi = 1) P(Yi = 0)
and:
odds(Y1 = 1)
O.R.(Yi = 1, Yj = 1) =
odds(Y2 = 1)
179
This means that:
• O.R. ≥ 0
• If "0.R. = 1", then "Yi " and "Yj " have the same expected value "pi = pj "
• If "O.R. > 1", then the mean of "Yi " is greater than that of "Yj ", so success is most probable on the observation
"i"
• If "O.R. < 1", then the mean of "Yi " is less than that of "Yj ", so success is less probable on the observation "i"
It’s important to move from the sign of "β1 " to the odds ratio because the sign of "β1 " can only be interpreted as
sign: this means we can’t say anything about its absolute value. On the other hand with the odds-ratio we can also
give an interpretation of the absolute value of the parameter.
Now we take two events (probabilities of success given a value "X") and we consider a unitary increase of "X":
• p0 = P(Y = 1|X = x)
• p1 = P(Y = 1|X = x + 1)
So we can now give an interpretation of the "β1 " in terms of its absolute value:
• The "β1 " is the change in the log-odds-ratio when the predictor increases for "1" unit.
• The "eβ1 " is the odds-ratio comparing responses with "1" unit of difference.
> summary(mod)
[1] Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.5160568 0.3380483 -4.485 7.30e-06 ***
soil 0.0027202 0.0005385 5.051 4.39e-07 ***
---
What is the difference between the expected value "P(E)" and the odds-ratio?
In the previous example we have seen that for small values of "P" the probability and the odds ratio are
very similar: this means that the odds-ratio are a good approximation of the probability. In most cases of
logistic regression we have small numbers of success and high numbers of failures so the use of the odds-ratio
is perfectly consistent with the structure of the data.
180
14.6 The deviance
The deviance for the logistic regression is computed from the individual contributions:
0 · log(0) = 0
The data are in the file "islands.txt" and the first rows are here:
islands <- read.delim(". . . /islands.txt")
View(islands)
So first of all we consider the scatter-plot with one predictor at the time (we consider the response variable
on the "y" axis and the predictor on the "x" axis):
plot(jitter(incidence,factor=0.1)~area,data=islands,ylab="(jittered) incidence")
plot(jitter(incidence,factor=0.1)~isolation,data=islands,ylab="(jittered) incidence")
The first graph suggests an "S" shaped logistic curve (the probability of success increases when "x" increases)
[positive association between the "incidence" response variable and the "area" predictor], whilst the second
graph suggests the horizontal-symmetric shape [negative association between the response variable and the
"isolation" predictor].
181
When using a model with two predictors, we expect a positive coefficient for "area" and a negative coefficient
for "isolation":
mod=glm(incidence~area+isolation,data=islands, family=binomial(link="logit"))
summary(mod)
newdat1=data.frame(area=seq(min(islands$area),max(islands$area),len=300))
hatp=predict(mod1, newdata=newdat1, type="response")
plot(incidence~area,data=islands, col="red")
lines(hatp~area, data=newdat1, col="black", lwd=2)
newdat2=data.frame(isolation=seq(min(islands$isolation),max(islands$isolation),len=300))
hatp=predict(mod2, newdata=newdat2, type="response")
plot(incidence~isolation,data=islands, col="red")
lines(hatp~isolation, data=newdat2, col="black", lwd=2)
182
14.8 The output
Call:
glm(formula = incidence ~ area + isolation, family = binomial(link = "logit"), data = islands)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8189 -0.3089 0.0490 0.3635 2.1192
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.6417 2.9218 2.273 0.02302 *
area 0.5807 0.2478 2.344 0.01909 *
isolation -1.3719 0.4769 -2.877 0.00401 **
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
So here we find all the coefficients of the predictors and their corresponding Wald test (test for the significance of
the single coefficient). On the last part of the output we find the deviances which are used in the LR test for the
global significance of the model. Notice that here we have different degrees of freedom: the difference is "2" because
we have "2" parameters in our complete model. We can perform the LR test in two different ways:
The LR test yields a p-value of "2.484e − 09", so the model is globally significant. In general when several predictors
are involved we can use the LR test for any reduced model (not only the empty model). Both predictors are
significant (at the 5% level); the use of the Wald test remains unchanged.
The interpretation of the log-odds-ratio must be taken one predictor at a time, the other remaining being
fixed (it’s similar to the mechanism we use for partial derivatives). For instance:
means that "1km" increase in isolation is associated with a decrease in the odds of seeing a bird by a factor of
"0.2536".
183
14.9 Model with interaction
We can add an interaction parameter by using the multiplicative notation "∗" in the Software R formula:
modwi=glm(incidence~area*isolation,data=islands,family=binomial(link="logit"))
summary(modwi)
so for example in the previous script we take the first parameter for the "area", the second for the "isolation" and
the third parameter for the interaction between the firsts two. This means that in our output we will have one
more line:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.0313 7.1747 0.562 0.574
area 1.3807 2.1373 0.646 0.518
isolation -0.9422 1.1689 -0.806 0.420
area:isolation -0.1291 0.3389 -0.381 0.703
Notice that since we added the interaction between the predictors we have a strange behavior: the introduction of
the interaction led us to a completely non-significant model. The interaction and also the predictors (which were
significant before) are now non-significant [this clearly suggests us to remove the interaction from the model].
This is immediate from the Wald test (last line), but can also try the LR test.
184
14.10 Multinomial logistic regression
When the response variable is categorical with three or more categories (levels) we need to use a multinomial
logistic regression model. We have two types of multinomial response:
The difference between these two is that in ordered random variables the possible outcomes have "natural order":
this means that we can sort them between two extremes. On the contrary unordered random variables can’t be
sorted.
In this class we consider the case of unordered categories and we’ll give some insight on how to deal with ordered
categories.
The British Election Study dataset is a cross-sectional study on the 2019 elections freely available upon
registration at:
https://www.britishelectionstudy.com/
The whole answers are in the file "bes_rps_2019_1.1.1.dta", a foreign data format from STATA (use
the R-Studio import menu to import the data). In the dataset there are several possible predictors but here
we just consider a simple regression:
• Response variable: "b01 " the party voted in the 2019 election
• Predictor: "Age" the age (here negative values are missing data)
The response variable "b01 " has indeed several levels (it’s pretty complicated):
Labels:
value label
-999 Not stated
-2 Prefer not to say/Refuse
-1 Don’t know
1 Labour Party
2 Conservative Party
3 Liberal Democrats
4 Scottish National Party
5 Plaid Cymru
6 Green Party
7 United Kingdom Independence Party (UKIP)
8 Brexit Party
9 Other
10 An independent candidate
11 Specified name- no party mentioned
12 Spoilt ballot paper
13 None
If we look at the data and consider the frequencies of the outcomes we notice that the greatest part of the
sample is concentrated in only 3 parties. So in order to simplify a bit this example (reduce its complexity)
we restrict it to the main categories "1 - Labour", "Party, 2- Conservative Party", "3 - Liberal
Democrats". After removing the missing values we have a dataset with 2543 voters. We now want to study
the effect of the "age" on the voting preferences.
We now have to incorporate a dataset of this kind into the logistic regression.
185
14.11 The multinomial model
Let us consider a response variable "Yi " for the i-th statistical unit with "k" categories and probabilities:
pi,1 , . . . , pi,k
We choose a reference category and then we define a number of models (2 in our case) to compare each category to
the baseline category. So to extend the logistic regression model to the multinomial case, we need to fix a reference
or baseline category. By default we choose the last category "k" as the "reference category" and we write "k − 1"
logistic regressions:
pi,1
log = β0,1 + β1,1 x1,i + β2,1 x2,i + . . .
pi,k
pi,2
log = β0,2 + β1,2 x1,i + β2,2 x2,i + . . .
pi,k
pi,k−1
log = β0,k−1 + β1,k−1 x1,i + β2,k−1 x2,i + . . .
pi,k
When "k ≥ 3" we have that the denominator is not the complement of the numerator. We also obtain different
coefficients for each model: this means that we obtain different "β0 " for each model we have.
The interpretation of the coefficients is given by the log-odds-ratios:
• "β1,1 " is the change in the log-odds of category "1" as opposed to (vs) category "k" for a "1" unit increase of
the predictor "x1 "
• ...
• "β1,k−1 " the change in the log-odds of category "k − 1" as opposed to category "k" for a "1" unit increase of
the predictor "x1 "
• ...
186
British Election Study (II)
We use the "multinom" function in the "nnet" Software R package. First, we need to choose the level of
our outcome that we wish to use as our baseline and specify this in the "relevel" function. Then, we run our
model using "multinom". The "multinom" function does not include p-value calculation for the regression
coefficients, so we calculate p-values using Wald tests (here z-tests).
So since we have 3 levels it means that we have to define 2 logistic regression equations: from the first one
we obtain the probability of the first level (in this case the "Labour Party"), from the second one we obtain
the probability of the second level (the "Conservative Party") and the last one is given by the difference.
bes2=bes_rps_2019_1_1_1[(bes_rps_2019_1_1_1$b02>=1)&(bes_rps_2019_1_1_1$b02<=3)
&(is.na(bes_rps_2019_1_1_1$b02)==F),]
table(bes2$b02)
bes3=bes2[bes2$Age>0,]
table(bes3$b02)
bes3$b02= factor(as.character(bes3$b02))
bes3$b02= relevel(bes3$b02, ref = 3)
mod= multinom(b02~Age, data=bes3)
summary(mod)
Call:
multinom(formula = b02 ~ Age, data = bes3)
Coefficients:
(Intercept) Age
1 1.9009766 -0.02082697 #
2 0.3198647 0.01677515 ##
Std. Errors:
(Intercept) Age
1 0.1978886 0.003611568 #
2 0.1976460 0.003425222 ##
187
For the Wald tests we have:
z = summary(mod)$coefficients/summary(mod)$standard.errors
p = (1 - pnorm(abs(z), 0, 1)) * 2
> p #p-values#
(Intercept) Age
1 0.0000000 8.081969e-09
2 0.1055825 9.704498e-07
Remember that the test statistic for the Wald test is very simple as we just need to divide the estimate by
the standard error: this means that we can take all the values from the first output. If we perform this we
obtain 4 values for the test statistic and 4 p-values for "β0,Labour ", "β1,Labour ", "β0,Cons " and "β1,Cons ". From
this we can see for example that the predictor "Age" is significant for both the log-odds-ratio equations.
It’s important to underline that we introduced the Wald test for the logistic regression in order to check the
significance of the predictors. Now if we decide that the "age" is significant (non-significant) we must consider
(remove) it in both equations simultaneously. This means that it’s not so important to have the single p-value
for all the parameters because (for example) the whole parameter "age" is considered or removed. Problems
arise for example when our predictor is significant in just one of the equations and we don’t know if we have
to consider it or remove it. We have to change strategy: we consider the likelihood-ratio test.
So we consider the "residual deviance" and then we define the empty-model as the model where we remove
the "age" predictor. We compute the residual deviance of the null-model and then perform the likelihood-
ratio test. The degrees of freedom in the difference between the number of parameters so in this case it is
equal to "4 − 2 = 2".
Here there are the graphs of log-odds-ratios as a function of age (red for "Labour", blue for "Conservative"):
age.d=seq(18,100,0.1)
beta0.1=summary(mod)$coefficients[1,1]
beta1.1=summary(mod)$coefficients[1,2]
beta0.2=summary(mod)$coefficients[2,1]
beta1.2=summary(mod)$coefficients[2,2]
plot(age.d,(beta0.1+beta1.1*age.d),type="l",col="red",ylim=c(-0.5,2.1),ylab="log-odds")
lines(age.d,(beta0.2+beta1.2*age.d),col="blue")
188
And here are the graphs of the predicted probabilities as a function of age (red for "Labour", blue for
"Conservative, yellow for "Lib-Dem").
p1=exp(beta0.1+beta1.1*age.d)/(1+exp(beta0.1+beta1.1*age.d)+exp(beta0.2+beta1.2*age.d))
p2=exp(beta0.2+beta1.2*age.d)/(1+exp(beta0.1+beta1.1*age.d)+exp(beta0.2+beta1.2*age.d))
p3=1-p1-p2
plot(age.d,p1,type="l",col="red",ylim=c(0,1),ylab="pred prob")
lines(age.d,p2,col="blue")
lines(age.d,p3,col="darkgoldenrod1")
where we have that the predicted probability of the function "age" [blue function] is:
We have the same for the "Labour" and we can compute the yellow line by difference:
p̂ = 1 − p̂Labour − p̂Cons
189
14.12 Ordered multinomial model
We now investigate how to analyze multinomial regression models when the underlying distribution has ordered
categories: the idea here is to use the cumulative probabilities instead of simple probabilities.
When there is an underlying ordering to the categories, a convenient parameterization is to work with cumulative
probabilities. For instance suppose that the response variable has "k = 4" ordered levels "1, 2, 3, 4" ordered from
lowest to highest. For the i-th individual we have:
As in the previous example, for "k" categories, only "k − 1" equations are needed, because the last cumulative
probability is always "γi,k = 1". We then write a model equation for each cumulative probability:
γi,1
log = β0,1 + hi
1 − γi,1
...
γi,k−1
log = β0,k−1 + hi
1 − γi,k−1
where "β0,k " is the intercept of each equation and "hi " is a common linear predictor:
hi = β1 xi,1 + · · · + βp xi,p
So in this case we are now moving in the middle between the logistic regression and the quantitative regression (for
example the Gaussian): we have again only one coefficient for each predictor. This means that for example we have
the same "β1 " coefficient for all the equations: we can then perform the Wald test for the single parameter.
Common linear predictor means that the log-odds-ratios for threshold category membership are independent
of the predictors. For instance the log-odds-ratio for categories "1" and "2":
odds of ≤ 2
(β0,2 + hi ) − (β0,1 + hi ) = log
odds of ≤ 1
is constant for all values of the predictors. Similarly for the other log-odds-ratios. This is called the proportional
odds assumption. Of course we can remove this by taking different slopes in the defining equations, but the
model becomes less parsimonious and less easy to apply.
190
15 Lab 6
15.1 Logistic Regression
15.1.1 Exercise 1 - Credit card default
The data to be analyzed are in the file "default.txt" in the data folder on AulaWeb and the description of the
data is in the file default "info.txt" in the same folder. The data file contains "10′ 000" observations of credit card
holders with 4 variables: "default", "student", "balance", and "income".
1. Load the data in Software R. First, give a univariate analysis of all the variables.
First of all as always we load the data into the software:
Here we have that the response variable is "default" which is described by 3 predictors which are "student"
(a categorical variable), "balance", and "income". The main thing to notice here (but we will deal with it
later in the exercise) is the fact that our response variable is a character variable with two possible outcomes
which are "yes" and "no": this is a "problem" since the GLM function only accepts values of "0" and "1" (this
is not a problem with the Student-t function because it’s not an issue for the linear predictor as it can handle
also categorical variables [Software R automatically creates the Dummy variable for the "k − 1" levels]).
We can then analyze our variables by running the following code:
2. Fit a logistic regression model to classify the response variable default with respect to the explanatory variables
"student", "balance", and "income" (re-code the variables if needed). Evaluate the significance of the model
and of the regressors, and define an appropriate reduced model in the case of non-significant regressors.
Here we have a possible solution for the re-definition of the response variable:
default$def=rep(0,length(default$default))
for (i in 1:length(default$default))
{if (default$default[i]=="Yes"){default$def[i]=1}}
So we defined a vector of "0"s and then we check when the "default" is equal to "yes". If so we set the value
equal to "1" or "0" otherwise. Remember that there are also special functions and packages which allows to
191
re-define the categorical variables automatically. The only important point is that the new variable must be
included into the dataframe, so we can use it with the GLM function.
We can define the GLM to explain the response variable: now the response variable has been recalled with the
values of "0" and "1" since the GLM functions needs a response variable of this kind (it can’t handle character
variables). So the generalized linear model is:
mod=glm(def~student+balance+income,data=default,family=binomial(link="logit"))
summary(mod)
Call:
glm(formula = def ~ student + balance + income, family = binomial(link = "logit"),
data = default)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4691 -0.1418 -0.0557 -0.0203 3.7383
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
studentYes -6.468e-01 2.363e-01 -2.738 0.00619 **
balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
income 3.033e-06 8.203e-06 0.370 0.71152
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The first important part of our output is the table of coefficients where we have one description line
for each predictor (each explanatory variable). Since "student" is a categorical variable we have created a
Dummy variable (for the other two quantitative predictors it’s just a standard output). From the sign of the
estimates we can see that the "direction of the dependence" of our response variable "default" with respect to
the predictors. This means that the response variable is positively correlated with "income" and "balance",
whilst is negatively correlated with "student". In the last column we encounter the p-values of the Wald test:
we skip the intercept since it’s not interesting (we consider it even when it’s not significant) and so we just
consider the last 3 predictors. From the p-values we notice that they are all significant expect for the "income"
(this means that if we want to create a reduced model we will have to discard this last predictor).
The second part of the output is about the deviance: these values allow us to perform the Likelihood-ratio test.
This test is not automatically displayed in the output and so we have to compute it by taking the difference
between the "null deviance" and the "residual deviance" with a chi-squared distribution (in this case we have 3
degrees of freedom):
p=1-pchisq(2920.6-1571.5,3)
The result we obtain from the test is "0" which means that we have a high-significant model (in this case we
mean less that the threshold of "5%"). Of course here our result is an approximation (since the function goes
to infinity) but we can say that the two models almost have a perfect fit.
We now define the reduced model as the model where we discard the non-significant "income" predictor:
192
mod.red=glm(def~student+balance,data=default,family=binomial(link="logit"))
summary(mod.red)
Call:
glm(formula = def ~ student + balance, family = binomial(link = "logit"),
data = default)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4578 -0.1422 -0.0559 -0.0203 3.7435
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.075e+01 3.692e-01 -29.116 < 2e-16 ***
studentYes -7.149e-01 1.475e-01 -4.846 1.26e-06 ***
balance 5.738e-03 2.318e-04 24.750 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Here, as we expected from the chi-squared results we obtain, we basically obtained the exact same value for the
"residual deviance".
3. To validate the model, compute the predicted classification, classifying "1" when the probability is greater
than "0.5" and "0" otherwise. Compute in particular the percentages of false-positives and false-negatives
observed. Try with different cutoffs, for example "0.2" and "0.8".
This point is about the validation of the model: when performing linear-regression you perform the tests and
then check the value of the R-squared (which is in some sense the correlation between the predicted values and
the observed values). Here we can’t perform this since the response variable has only two possible outputs which
are "0" and "1" while the prediction is computed on the mean of the response variable (so it’s a probability).
So the idea here is that we have to find a way to check if our model correctly predicts the outcomes of the
response variable. In order to do this we need to separate the probabilities: we set a threshold and if the
probabilities are higher than that particular value we assign the value of "1" or "0" otherwise.
193
So to recap: the observed response variable can assume just the values "0" and "1" but our fitted values don’t
as they follow a function of this kind (see the graph). We have to define a particular threshold to discriminate
the values:
We can for example define a particular rule (a cut-off) for values of "0.2", "0, 5", "0.8" (and so on) and check
the results. For example if we want to set a cut-off of "0.5" (the simplest threshold) we have to write:
y.pred=(mod$fitted.values>0.5)
table(y.pred,default$default)
what we obtain is a logical vector which is "true" for "yes" and "false" for the "no". We then have to cross-
classify the predicted values with the observed values (this is given by the "table" function):
y.pred No Yes
FALSE 9627 228
TRUE 40 105
From this output we can then compute the "false positive rate" (the proportion of "TRUE" estimates that
are "No" in the predicted) and "false negative rate" (the proportion of "FALSE" estimates that are "Yes" in
the predicted.
t=table(y.pred,default$default)
false.neg=t[1,2]/rowSums(t)[1] #228/(228+9627)#
false.pos=t[2,1]/rowSums(t)[2] #40/(40+105)#
false.neg
false.pos
We can also try to change the cut-off value to investigate if we can obtain better performance in sense of
reducing the proportion of "false positive" (it is the higher proportion in this case). Remember that when we
move the threshold we improve one classification (for example reduce the false positive rate) but we obtain a
worse classification of the other one (we increase the false negative rate).
194
The output we obtain is:
> false.neg
FALSE
0.02313546
> false.pos
TRUE
0.2758621
4. This item is an introduction to cross-validation (we will develop this topic in the last section). Split the
dataset into two datasets (with "5′ 000" observations each) and repeat item 2 using the first dataset and
item 3 using the second one.
Cross-validation is a set of procedures we use to have a better evaluation of the statistical model. The best
method is to script the dataset into two parts:
• The first part is used to estimate the parameters
• The second part is used to check the goodness of fit of the model (!NOT ON THE SAME DATA!)
So the idea is to repeat the previous point but on different datasets. We then split our "10′ 000" observations
into two subsets (in this case we chose the 50/50 proportion): with the first set we estimate the parameters
and with the second one we compute the false positive and false negative rate. This means that the predicted
values are computed on the second subset of the dataset by using the coefficients computed in the first one. So
first of all we randomly sample the first "5′ 000" lines from the dataset and then split the main dataset into
two subsets:
selection=sample(1:10000,5000,F)
training.set=default[selection,]
test.set=default[-selection,]
Now we use the first set to compute the coefficients of the model and then we use the second one to perform
the tests (table of cross-classification). Remember that every run gives slightly different results!
Here we compute the parameters on the training set:
mod.tr=glm(def~student+income+balance,data=training.set,family=binomial(link="logit"))
summary(mod.tr)
and we obtain:
. . .
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.013e+01 6.568e-01 -15.429 <2e-16 ***
studentYes -6.001e-01 3.175e-01 -1.890 0.0587 .
income -7.254e-06 1.147e-05 -0.633 0.5270
balance 5.503e-03 3.117e-04 17.653 <2e-16 ***
. . .
195
We now use this set of coefficients to define the predicted values on the test subset:
coef=mod.tr$coefficients
test.set$pred=rep(NA,5000)
for (i in 1:5000)
{
lin.pred=coef[1]+coef[2]*(test.set$student[i]=="Yes")+coef[3]*test.set$income[i]
+coef[4]*test.set$balance[i]
test.set$prob[i]=exp(lin.pred)/(1+exp(lin.pred))
}
test.set$pred=(test.set$prob>0.5)
table(test.set$pred,test.set$default)
The only problem from the computational point of view is the fact that we have a categorical predictor. If we
only have quantitative predictors the machinery is easy since we just have to define the linear function and
then change it by using the inverse of the link-function. This isn’t the case since "student" is a categorical
predictor and we can’t perform the multiplication. We then have to define two different functions in order to
incorporate the predictor.
By using the coefficients we found we can then compute the linear predictors for each line of our test-set.
From this we compute the inverse of the link function and then the estimate of the probability. We find the
fitted value, we define a threshold and then compute the prediction and the cross-classification.
196
16 Regression for count data
16.1 Introduction
In the previous chapter our response variable was a finite random variable (binary random variable [with the theory
of the logistic regression] and then multinomial random variable [multinomial logistic regression]). We now move
to the framework of "count": our response variable is an integer variable without any bounds.
When the response variable is a count, the first option is to model it with a Poisson distribution (we already saw
this when we dealt with the Fisher Scoring).
Remember
• The Poisson distribution belongs to the exponential family. If:
Y ∼ P(λ)
• The special property of the Poisson distribution is that the variance is equal to the mean:
E(Y ) = λ V AR(Y ) = λ
This is a good property but also a limitation since we can only consider counts for which the variance
must be at least similar to the expected value (mean). This means that when the difference between
these two values becomes relevant we are forced to consider another distribution.
Given a sample "(yi , xi )" with "i = 1, . . . , n", the model (the link function of the expected value):
is equal to the linear predictor. This function is defined as Poisson regression model.
Some remarks:
• This function is simpler than the logit function of the logistic regression: instead of a "logit function" we have
now the "log" and still we have
λi > 0 =⇒ −∞ < log(λi ) < +∞
thus the linear predictor (the "log(λi )") can range over the whole real numbers.
• In Poisson regression we assume that "log(λi )" is linearly related to the predictors.
197
• The Poisson distribution is heretoskedastic by definition because the variance is equal to the expected value:
this means that if the expected value increases so does the variance.
Let us consider a simple Poisson regression to model the number of matings of the elephants as a function
of their age.
• Response variable: "N umber_of _M atings"
plot(Number_of_Matings~Age_in_Years,data=elephant)
and from the plot we already observes that the number of matings (count response variable) seems to
increase with the age (also notice that the variance seems to increase with the expected value: again a
sort of megaphone effect). This means that a Poisson distribution may be a good representation (a good
probability model) for this dataset:
• Elephants with around 30 years have between 0 and 3 mates
198
16.3 Poisson regression parametrization
Let us consider the response variables as:
Yi ∼ P(λi )
where "λi = E(Yi )" is the mean response for the i-th trial (here we are using the Poisson distribution inside the
GLM framework).
GLM model
g(λi ) = log(λi ) = β0 + β1 xi
mod=glm(Number_of_Matings~Age_in_Years,data=elephant, family=poisson(link="log"))
summary(mod)
Call:
glm(formula = Number_of_Matings ~ Age_in_Years, family = poisson(link = "log"),
data = elephant)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.80798 -0.86137 -0.08629 0.60087 2.17777
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.58201 0.54462 -2.905 0.00368 **
Age_in_Years 0.06869 0.01375 4.997 5.81e-07 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
(Dispersion parameter for poisson family taken to be 1)
As usual we have the Wald tests: in this case we have one test for the intercept (usually it is not relevant
and we skip the analysis of this since we never remove the intercept from the model even when it is non-
significant) and one for the predictor (which is significant).
From the table of the deviances we can also compute the likelihood ratio test. Since the dispersion parameter
is equal to "ϕ = 1", in order to compute the likelihood ratio test it’s enough to compute the difference between
the deviances and then compare them to the chi-square distribution with one degree of freedom (complete-
null).
199
From the output we have that:
• The dispersion parameter is not estimated, since Poisson doesn’t have an unknown dispersion parameter
(the dispersion parameter is equal to "1" since the variance is equal to the expected value)
• The likelihood ratio test gives:
> 1-pchisq(75.372-51.012,1)
[1] 7.991084e-07
• The Wald test here is equivalent to the likelihood ratio test, because we work with one predictor. From
this we have that "Age" is a significant predictor of the number of mates: we confirm the positive
association.
ŷ = eβ̂0 +β̂1 x
For instance, if "x = 30" we have:
> exp(-1.58201+0.06869*30)
[1] 1.613959
This means that for an age of "30" the mean number of mates is "1.61", and also the variance is "1.61". The plot
of the (exponential since the "β1 > 0" ) curve is:
newdat=data.frame(Age_in_Years=seq(min(elephant$Age_in_Years),max(elephant$Age_in_Years),
len=300))
hatlambda=predict(mod, newdata=newdat, type="response")
plot(Number_of_Matings~Age_in_Years,data=elephant, col="red")
lines(hatlambda~Age_in_Years, data=newdat, col="black", lwd=2)
200
16.5 Interpretation of the parameters
In the logistic regression framework we have seen that, when we interpret the parameters, the absolute value value
of the parameters is obtained by the mean of the odds-ratio.
Here we are in a simpler case as we don’t have to compute the odds-ratio: we just need to consider the expected
value "λ" for a given level of the predictor and then same lambda for the same predictor increased by a unit. We
then consider the ratio between these two values to obtain (notice that the predicted values are the exponential
of the linear predictor):
λ(x + 1) eβ0 +β1 (x+1) eβ0 eβ1 x eβ1
= = = eβ1
λ(x) eβ0 +β1 x eβ0 eβ1 x
This means that when we consider the expected value for a predictor increase by a unit we have:
λ(x + 1) = λ(x)eβ1
So the difference between the linear regression and the Poisson regression is that in linear regression the
slope "β1 " represents a linear increase of the response variable "y" given a unit increase of the "x". On the contrary
in the Poisson case we have a multiplicative effect: when we increase the predictor by a unit the response variable
is multiplied by a factor equal to "eβ1 ". From this we notice again that if "β1 > 0" then the factor "eβ1 > 1" and
then we have a positive association between the "x" and the "y". Otherwise if "β1 < 0" the factor "0 < eβ1 < 1" and
so we have an inverse association.
An increase of the "X" by "1" unit has a multiplicative effect on the mean by "eβ1 ".
• If "β1 > 0", the expected value of "Y " increases with "X"
• If "β1 < 0" the expected value of "Y " decreases as "X" increases
The "eβ0 " is the mean of "Y " when "X = 0". For the "elephant" data we have (we look again at the table of the
coefficients):
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.58201 0.54462 -2.905 0.00368 **
Age_in_Years 0.06869 0.01375 4.997 5.81e-07 ***
• "β1 > 0" so the number of mates increases with the age (from the Wald test this increase is significant). More
precisely, the increase of "1" year in age yields an increase of the number of mates by a multiplicative factor
of "e0.06869 "
• The "β0 " is not meaningful in this example since the age "0" makes no sense.
201
We now consider an example with several predictors:
The data is in the file "Video_Games.csv": it is a quite big dataset and it also contains several other
variables. Note that the "User_Score" has a lot of missing values.
Remark: first of all we have to visualize the data and perform a data-cleaning. For instance take:
library(readr)
Video_Games <- read.csv(". . ./Video_Games.csv")
View(Video_Games)
sort(table(Video_Games$Publisher))
There are "581" different publishers (a categorical predictor): this means that we have to define "k − 1"
dummy variables. This is not actually difficult to implement but it’s quite difficult to interpret as the
number is too high. In order to overcome this issue we then consider only the 9 most represented (with at
least 3% of the games).
Video_Games_red=Video_Games[(Video_Games$Publisher=="Sega" |
Video_Games$Publisher=="Sony Computer Entertainment" |
Video_Games$Publisher=="Nintendo" |
Video_Games$Publisher=="THQ" |
Video_Games$Publisher=="Konami Digital Entertainment" |
Video_Games$Publisher=="Ubisoft" |
Video_Games$Publisher=="Namco Bandai Games" |
Video_Games$Publisher=="Activision" |
Video_Games$Publisher=="Electronic Arts"),]
table(Video_Games_red$Publisher)
barplot(table(Video_Games_red$Publisher),las=2)
202
Moreover we make "integer" the response variable "sales" (we multiply for "1000"):
Video_Games_red$NA_Sales=1000*Video_Games_red$NA_Sales
Video_Games_red$Global_Sales=1000*Video_Games_red$Global_Sales
Now that we have a "clean" dataset we consider the GLM function define the model. First We consider the
Poisson regression with two predictors "Genre" and "Publisher":
mod=glm(Global_Sales~Genre+Publisher,data=Video_Games_red,family=poisson(link="log"))
summary(mod)
Call:
glm(formula = Global_Sales ~ Genre + Publisher, family = poisson(link = log),
data = Video_Games_red)
Deviance Residuals:
Min 1Q Median 3Q Max
-90.65 -23.09 -13.05 2.04 633.99
Coefficients:
Estimate Std. Error z value
(Intercept) 6.386848 0.001482 4309.64
GenreAdventure -0.562623 0.003060 -183.88
GenreFighting 0.421894 0.002232 188.98
GenreMisc -0.040151 0.001754 -22.89
GenrePlatform 0.472822 0.001653 285.97
GenrePuzzle -0.291474 0.002796 -104.25
GenreRacing 0.327363 0.001790 182.85
GenreRole-Playing 0.232490 0.001812 128.33
GenreShooter 0.657660 0.001615 407.21
GenreSimulation 0.090501 0.002180 41.52
GenreSports 0.225532 0.001512 149.21
GenreStrategy -0.504285 0.003259 -154.75
PublisherElectronic Arts 0.088556 0.001552 57.05
PublisherKonami Digital Entertainment -0.703406 0.002247 -313.02
PublisherNamco Bandai Games -0.907799 0.002356 -385.28
PublisherNintendo 1.306718 0.001461 894.46
PublisherSega -0.513940 0.002271 -226.30
PublisherSony Computer Entertainment 0.205951 0.001760 117.03
PublisherTHQ -0.398753 0.002113 -188.72
PublisherUbisoft -0.309771 0.001887 -164.17
Since both the predictors are categorical, we have a lot of parameters in our output (for each predictor we
have to define Dummy variables).
Notice that in our output we didn’t consider the p-values: this is because when we are dealing with categorical
predictors, and each predictor gives rise to a set of parameter (it’s not interesting to discriminate the
significance of each parameter). Remember that we perform the tests in order to obtain a reduced model
(to eliminate predictors from the complete model).
203
In this particular case we can’t remove a single predictors (for example "PublisherElectronic Arts") and keep
maintaining all the other predictors of the same category ("publisher"): if we want to remove a predictor we
have to remove the entire category (in this case we have two "blocks").
And so from the output we can see that the predictor "publisher" is highly significant (the p-value is less
than the machine precision):
If for example we want to investigate a model with the interaction between the predictors we just have to
replace the "+" with a "∗" inside the script:
mod.int=glm(Global_Sales~Genre*Publisher,data=Video_Games_red,family=poisson(link="log"))
summary(mod.int)
Here we don’t print the output since is quite long. If we want to check the significance of the interaction
we again use the "anova" function using this last model as the complete model and the model without
interaction as the reduced model:
anova(mod,mod.int,test="Chisq")
From the output we see that the interaction is highly significant but it’s not easy at all to give an interpre-
tation of the results as we need to check every possible level.
As an exercise, perform the other tests. What are the equations for this model, i.e. how to use the coefficients to
obtain predicted values? What does it happens if you add the interaction term(s)?
Repeat the analysis by using also the "User_Score" predictor (be careful as there are several "missing
values").
204
16.6 Poisson regression for rates: the offset
In some applications, your count response variable is the numerator of a rate. This means that in this
case it’s not important to have the absolute value but the proportion of our count with respect to some quantities
(different for each observation).
Remark: remember that the Poisson distribution is a good approximation of the binomial distribution when "n"
is large and "p" is small.
In such a case the response "Y " has a Poisson distribution, but we want to model:
Yi
ti
where "ti " is index of the time or space. If we want to model the expected value of the rate "Y /t", by means of a
set of predictors "X1 , . . . , Xp ", the link function must be the logarithm of the expected value of the rate "Y /t". The
Poisson GLM regression model for the expected rate of the occurrence of event is:
µi
log = ηi = βo + β1 xi,1 + . . . , +βp xi,p
ti
COVID-19 mortality
We want to model the COVID-19 mortality rate in various European regions as a function of the air pollution.
All the relevant data are in the Excel file "COVID19_Europeandata_new.xlsx". The columns we use
here are:
• "COVID19_D" (number of deaths): this isn’t our real response variable as the number of deaths de-
pends on the number of inhabitant of a particular region (the correct response variable is the proportion
between the number of deaths and the population).
• "NO2_avg" (average NO2 concentration, as an indicator of the air pollution).
library(readxl)
COVID19_European_data_new<-read_excel(".../COVID19_European data_new.xlsx",sheet="Data")
View(COVID19_European_data_new)
After removing rows with missing values of the response variable, we perform Poisson regression
with the offset:
dat2=COVID19_European_data_new[is.na(COVID19_European_data_new$COVID19_D)==F,]
dim(dat2)
205
So in order to obtain the model with the offset we just need to add a further information into the "glm"
function:
mod_wo=glm(COVID19_D~NO2_avg,data=dat2,family = poisson(link ="log"),
offset=log(POPULATION))
summary(mod_wo)
Call:
glm(formula = COVID19_D ~ NO2_avg, family = poisson(link = "log"), data = dat2,
offset = log(POPULATION))
Deviance Residuals:
Min 1Q Median 3Q Max
-38.602 -8.289 -4.897 1.506 111.521
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.167130 0.006608 -1235.98 <2e-16 ***
NO2_avg 0.051509 0.001457 35.34 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
From the table of coefficients we see that the "pollution" is significant: this means that we have a significant
(and positive) effect of the air pollution over the Covid-19 mortality.
Note that using a linear model on the proportion (so we don’t consider the offset) gives a worse model:
mod_an=lm((COVID19_D/POPULATION)~NO2_avg,data=dat2)
summary(mod_an)
Call:
lm(formula = (COVID19_D/POPULATION) ~ NO2_avg, data = dat2)
Residuals:
Min 1Q Median 3Q Max
-0.0003046 -0.0002246 -0.0001161 0.0001264 0.0042280
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.248e-04 2.575e-05 8.730 <2e-16 ***
NO2_avg 1.407e-05 5.940e-06 2.368 0.0181 *
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 0.000331 on 816 degrees of freedom
(222 observations deleted due to missingness)
Multiple R-squared: 0.006827,Adjusted R-squared: 0.00561
F-statistic: 5.609 on 1 and 816 DF, p-value: 0.0181
206
16.7 Overdispersion
A strong limitation of the Poisson distribution is that the (conditional) variance is equal to the (conditional) mean.
When the variance exceeds the mean (expected value), we say that there is overdispersion (of course there’s also
the phenomenon of underdispersion but it is not that usual). There are at least two possible solutions for dealing
with overdispersion:
• To use a Negative Binomial distribution for the response variable (it has 2 parameters and we can set
the expected value and variance at difference values):
r(1 − p) r(1 − p)
E(Y ) = V AR(Y ) =
p p2
• To use a Zero-Inflated Poisson distribution for the response variable (we use this distribution when we
have an excess of zeros in our distribution):
Remember that in both cases we move outside the framework of exponential families so the estimation
of the parameters and the hypothesis test is more complicated with respect to the strategy we saw for the GLM
theory.
The parameter "θ" is a positive parameter and has the role of a dispersion parameter, but it is not the
dispersion parameter of the exponential family.
The parameter "θ" is estimated separately and then used in the distribution of the Negative Binomial as a known
parameter. In this sense, the Negative Binomial is a one-parameter exponential family. The idea is to consider the
two values of separately: we estimate (fix) the "θ" and then estimate the GLM with only the mean "µ". From the
previous equation:
µ2
V AR(Y ) = µ +
θ
we see that:
• The variance is always greater than the mean because "θ" is a dispersion parameter (it is non-negative)
and the second term is positive (the Negative Binomial is used only for overdispersion, not for under-
dispersion)
• As the parameter goes to infinity "θ → ∞" the Negative Binomial distribution has a limit which is the Poisson
distribution (the second term goes to "0" and so the variance becomes equal to the expected value)
• As the parameter goes to zero "θ → 0" we have that the variance goes to infinity "V AR(Y ) → ∞"
Depending on where our estimate "θ̂" is placed in the real line we can decide if the Poisson distribution is enough
to describe the data or not (large values of "θ̂" describe a Poisson model choice: we have a small overdispersion):
207
In practice, the Negative Binomial regression works as a Poisson regression, the canonical link function is "log(µ)".
The Negative Binomial regression is not in the standard Software R packages: one possible implementation of the
Negative Binomial regression is with the function "glm.nb" in the package "MASS".
Let us reconsider the dataset on "Videogames" under the Negative Binomial regression. So first of all we
import the data and then we compute the GLM:
library(readr)
Video_Games <- read_csv(". . ./Video_Games.csv")
View(Video_Games)
library(MASS)
mod.nb1=glm.nb(Global_Sales~Genre+Publisher,data=Video_Games_red)
summary(mod.nb1)
......
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6823 -1.1216 -0.5844 0.0623 6.4889
Coefficients:
Estimate Std. Error z value
(Intercept) 6.359054 0.043378 146.596
GenreAdventure -0.530008 0.067314 -7.874
......
(Dispersion parameter for Negative Binomial(0.758) family taken to be 1)
Null deviance: 12116.2 on 7792 degrees of freedom
Residual deviance: 9295.1 on 7773 degrees of freedom
AIC: 114705
Number of Fisher Scoring iterations: 1
Theta: 0.7580 #Dispersion parameter#
Std. Err.: 0.0105
2 x log-likelihood: -114663.3600
The important point here is the line concerning the dispersion parameter. It is estimated at "θ̂ = 0.7580"
showing strong overdispersion (it is small). This result tells us that is better to maintain the Negative
Binomial distribution instead of considering the Poisson distribution. In the same line we also notice that
"ϕ̂ = 1". In the last line we also obtain the "2 x log − likelihood" value used for the likelihood ratio test.
If we want to compare our model with any other model we have to compare the log-likelihood (we can use
the "logLik" function). For example we compare our model with the Poisson model:
l1=logLik(mod.nb1)
l2=logLik(mod)
chi2 = as.numeric(-2*(l2-l1))
pchisq(chi2,1,lower.tail=F)
From the output we see that the chi-square test is equal to "0": this means that the reduced model is not a
good model and so we have to stick to the Negative Binomial distribution (we reject the Poisson).
If the p-value is greater than the threshold of "5%" we can assume the Poisson distribution.
208
Negative Binomial regression
209
16.9 ZIP regression
ZIP regression
Let us consider the following data concerning the ear infections in swimmers. The relevant variables here
are:
• Response variable "Infections": number of infections
mod=glm(Infections~Swimmer+Location,data=earinf,family=poisson(link="log"))
summary(mod)
table(earinf$Swimmer,earinf$Location) #to count the observed counts vs the predicted#
Call:
glm(formula = Infections ~ Swimmer + Location,
family = poisson(link = "log"), data = earinf)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1266 -1.5652 -1.2137 0.5128 6.2538
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3058 0.1059 -2.887 0.00389 **
SwimmerOccas 0.6130 0.1050 5.839 5.24e-09 ***
LocationNonBeach 0.5087 0.1028 4.948 7.49e-07 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 824.51 on 286 degrees of freedom
Residual deviance: 764.65 on 284 degrees of freedom
AIC: 1143
Number of Fisher Scoring iterations: 6
and we notice that both predictors are significant. From the following output we can compute the expected
value of each Poisson. Since the two predictors are categorical with 2 levels we only have 4 Poisson
distributions for the response variable.
210
So we compute the parameter for each one:
par(mfrow=c(2,2))
barplot(table(factor(earinf$Infections[(earinf$Swimmer=="Freq"
&earinf$Location=="Beach")],levels=0:17)),
ylim=c(0,50),xlab="Number of infections",ylab="Counts",space=0,main="Freq+Beach")
points((0:17)+0.5,72*dpois(0:17,0.7365),pch=18,cex=1.5)
barplot(table(factor(earinf$Infections[(earinf$Swimmer=="Freq"
&earinf$Location=="NonBeach")],levels=0:17)),
ylim=c(0,40),xlab="Number of infections",ylab="Counts",space=0,main="Freq+NonBeach")
points((0:17)+0.5,71*dpois(0:17,1.2249),pch=18,cex=1.5)
barplot(table(factor(earinf$Infections[(earinf$Swimmer=="Occas"
&earinf$Location=="Beach")],levels=0:17)),
ylim=c(0,45),xlab="Number of infections",ylab="Counts",space=0,main="Occas+Beach")
points((0:17)+0.5,75*dpois(0:17,1.3596),pch=18,cex=1.5)
barplot(table(factor(earinf$Infections[(earinf$Swimmer=="Occas"
&earinf$Location=="NonBeach")],levels=0:17)),
ylim=c(0,30),xlab="Number of infections",ylab="Counts",space=0,main="Occas+NonBeach")
points((0:17)+0.5,69*dpois(0:17,2.2612),pch=18,cex=1.5)
par(mfrow=c(1,1))
211
What we notice is that the predicted value of "0 infection" is largely underestimated (this means that there’s
an excess of zeros).
So when there is an excess of observed zeros as (luckily) in this case, we may use a Zero Inflated Poisson (ZIP)
model.
Remember
A Zero Inflated Poisson distribution is the mixture of a constant (non-random) variable in "0" (with
a given probability) and a Poisson random variable with density "P(λ)".
212
ZIP model
The idea is that we have two nested GLM: first we implement a logistic regression which tells us if we are
in the "0"s or if we are in a Poisson regression. Then, for the elements coming from a Poisson distribution,
we compute the Poisson regression to perform the estimation.
So there are two steps: the first one is a logistic regression to discriminate the observations ("0" or
Poisson), and then we estimate the "λ" with the standard Poisson regression [the problem here is
not with the positive counts which of course come from the Poisson distribution but with the "0" which may
come both from the Poisson or be a structural zero].
So the ZIP model is: (
0 with probability pi
Yi |xi ∼
P(λi ) with probability 1 − pi
The ZIP model with this equation is estimated with a two-step procedure:
1. First we estimate the probability "pi " with a logistic regression model:
logit(pi ) = xti γ
2. Second we estimate the parameter "λi " with a Poisson regression model:
log(λi ) = xti β
To perform ZIP regression in Software R, we use the function "zeroinfl" in the package "pscl" (analog of
the "glm" function for the ZIP). We now use a ZIP regression model on the ear infection data:
library(pscl)
mod.zi=zeroinfl(Infections~Swimmer+Location,data=earinf,dist="poisson")
summary(mod.zi)
Note that we must specify "dist="poisson"", because the same function also models zero inflation in
negative binomials and other discrete distributions.
The output is:
Call:
zeroinfl(formula = Infections ~ Swimmer + Location, data = earinf, dist = "poisson")
Pearson residuals:
Min 1Q Median 3Q Max
-0.9896 -0.7199 -0.5957 0.3513 7.4226
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.6079 0.1224 4.965 6.87e-07 ***
SwimmerOccas 0.5154 0.1229 4.194 2.74e-05 ***
LocationNonBeach 0.1221 0.1151 1.061 0.289
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.3868 0.2441 1.585 0.11295
SwimmerOccas -0.1957 0.2748 -0.712 0.47640
LocationNonBeach -0.7543 0.2695 -2.799 0.00513 **
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Number of iterations in BFGS optimization: 13
Log-likelihood: -475.2 on 6 Df
213
So as we can see the output has exactly two sections: the first one is about the estimation of the parameters
of the Poisson distribution (the standard Poisson regression), whilst the second one is the logistic regression
where we decide if the single observation must be considered a structural "zero" or a "0" coming from the
Poisson distribution. The parameter estimates are (we define for each of the 4 distribution the probability
of "0" and then the expected value of the Poisson distribution):
• For "Frequent/Beach":
e0.3868
⋄ p̂ = (1+e0.3868 ) = 0.5955
⋄ λ̂ = e0.6079 = 1.8366
• ...
table(earinf$Swimmer,earinf$Location)
c1=mod.zi$coefficients[1]
c2=mod.zi$coefficients[2]
par(mfrow=c(2,2))
barplot(table(factor(earinf$Infections[(earinf$Swimmer=="Freq"
&earinf$Location=="Beach")],levels=0:17)),
ylim=c(0,50),xlab="Number of infections",ylab="Counts",space=0,main="Freq+Beach")
points((0:17)+0.5,72*(0.5955*c(1,rep(0,17))+(1-0.5955)*dpois(0:17,1.8366)),
pch=18,cex=1.5)
barplot(table(factor(earinf$Infections[(earinf$Swimmer=="Freq"
&earinf$Location=="NonBeach")],levels=0:17)),
ylim=c(0,40),xlab="Number of infections",ylab="Counts",space=0,main="Freq+NonBeach")
points((0:17)+0.5,71*(0.4091*c(1,rep(0,17))+(1-0.4091)*dpois(0:17,2.0751)),
pch=18,cex=1.5)
barplot(table(factor(earinf$Infections[(earinf$Swimmer=="Occas"
&earinf$Location=="Beach")],levels=0:17)),
ylim=c(0,45),xlab="Number of infections",ylab="Counts",space=0,main="Occas+Beach")
points((0:17)+0.5,75*(0.5476*c(1,rep(0,17))+(1-0.5476)*dpois(0:17,3.0750)),
pch=18,cex=1.5)
barplot(table(factor(earinf$Infections[(earinf$Swimmer=="Occas"
&earinf$Location=="NonBeach")],levels=0:17)),
ylim=c(0,30),xlab="Number of infections",ylab="Counts",space=0,main="Occas+NonBeach")
points((0:17)+0.5,69*(0.3628*c(1,rep(0,17))+(1-0.3628)*dpois(0:17,3.4743)),
pch=18,cex=1.5)
par(mfrow=c(1,1))
214
The fit is now improved as the "0" are correctly estimated (the boxes are the ZIP distribution [the Poisson
distributions with excess of zeros]):
215
17 Lab 7
17.1 Regression for count data
This is an exercise on the Negative Binomial regression.
In this exercise we model the spatial distribution of AirBnB’s in Nanjing (China) using several possible predictors.
The city has been divided into "177"square zones (see the picture in the next page) and for each zone the data in
the file "nanjing.xlsx" records several variables.
Sun S, Zhang S, Wang X (2021) Characteristics and influencing factors of Airbnb spatial distribution in China’s
rapid urbanization process: A case study of Nanjing. PLoS ONE 16(3): e0248647
[https://doi.org/10.1371/journal.pone.0248647]
• Response variable - "Num_airbn" : the number of AirBnB’s (the number of red dots)
216
1. Import the data into a Software R data frame. Give some descriptive analysis of the variables.
What we do now is importing the data and giving a general description of the dataset (we use the "summary"
function for the quantitative variables, we plot the histogram or the boxplot [to check if there are outliers] and
so on):
library(readxl)
nanjing <- read_excel(". . ./nanjing.xlsx")
View(nanjing)
summary(nanjing$Num_airbn)
boxplot(nanjing$Num_airbn,main="Num_airbn")
summary(nanjing$Subway_Station)
boxplot(nanjing$Subway_Station,main="Subway_Station")
summary(nanjing$Living_Facility)
boxplot(nanjing$Living_Facility,main="Living_Facility")
summary(nanjing$Cultural_Attraction)
boxplot(nanjing$Cultural_Attraction,main="Cultural_Attraction")
summary(nanjing$Percentage_construction_area)
boxplot(nanjing$Percentage_construction_area,main="Percentage_construction_area")
2. Use a Negative Binomial "glm" to model the number of AirBnB’s as a function of the other four variables,
and define a reduced model if you think that one or more predictors can be removed.
We now perform the "negative binomial GLM" (we run this on the complete model) [since it is a Negative
Binomial we are using it’s specific function which doesn’t need any specification for the distribution] and we
perform the log-likelihood (we will use it later):
library(MASS)
mod=glm.nb(Num_airbn~Subway_Station+Living_Facility+Cultural_Attraction+
Percentage_construction_area,data=nanjing)
summary(mod)
llmod=logLik(mod)
217
From the output we see that there are 3 predictors with significant p-values, whilst the predictor "cultural_attraction"
is not: this means that we can define a reduced model removing this last predictor.
Call:
glm.nb(formula = Num_airbn ~ Subway_Station + Living_Facility +
Cultural_Attraction + Percentage_construction_area, data = nanjing,
init.theta = 0.7070913205, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.30795 -1.03389 -0.54729 0.05536 3.12145
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.526e+00 2.799e-01 12.596 < 2e-16 ***
Subway_Station -6.689e-04 8.232e-05 -8.126 4.44e-16 ***
Living_Facility 3.019e-03 8.122e-04 3.717 0.000201 ***
Cultural_Attraction -1.386e-04 9.058e-05 -1.529 0.126142
Percentage_construction_area 1.417e-02 3.260e-03 4.347 1.38e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(0.7071) family taken to be 1)
Null deviance: 425.37 on 176 degrees of freedom
Residual deviance: 208.30 on 172 degrees of freedom
AIC: 1562.5
Number of Fisher Scoring iterations: 1
Theta: 0.7071
Std. Err.: 0.0775
2 x log-likelihood: -1550.4550
Call:
glm.nb(formula = Num_airbn ~ Subway_Station + Living_Facility +
Percentage_construction_area, data = nanjing, init.theta = 0.6962030633,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3144 -1.0097 -0.5880 0.0817 3.2812
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.255e+00 2.368e-01 13.748 < 2e-16 ***
Subway_Station -6.928e-04 8.147e-05 -8.504 < 2e-16 ***
Living_Facility 3.009e-03 8.091e-04 3.718 2e-04 ***
Percentage_construction_area 1.572e-02 3.281e-03 4.792 1.65e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(0.6962) family taken to be 1)
Null deviance: 419.50 on 176 degrees of freedom
Residual deviance: 207.99 on 173 degrees of freedom
218
AIC: 1562.8
Number of Fisher Scoring iterations: 1
Theta: 0.6962
Std. Err.: 0.0760
2 x log-likelihood: -1552.8060
As we can see from the output, in the reduced model all the 3 predictors are significant
3. For your reduced model, compute the log-likelihood of the model, the log-likelihood of the null model, and do
the likelihood ratio test.
We now perform the likelihood ratio test to check the significance of our reduced model: we need to compare
the reduced model with the null model (we need the log-likelihood). So we run the "glm.nb" function for the
null-model (which is the model with no predictors: "∼ 1"). We then compute the log-likelihood for this model
and we compare it with the first one (reduced model):
mod.null=glm.nb(Num_airbn~1,data=nanjing)
llmod.null=logLik(mod.null)
llmod.r
llmod.null
-2*(llmod.null-llmod.r) #is positive#
p=1-pchisq(as.numeric(-2*(llmod.null-llmod.r)),3)
The first log-likelihood (reduced model) is based on 5 degrees of freedom (3 predictors + 2 parameters
of the Negative Binomial distribution).
The second log-likelihood (null model) is based on only 2 degrees of freedom (2 parameters of the
Negative Binomial distribution) since in this case we don’t have the predictors.
The difference between the log-likelihoods is positive "143.4091" and is based on 3 degrees of freedom (5-2).
In the end the p-value we obtain is equal to "0".
4. Using the variables in your reduced model, compare with a Poisson model defining the appropriate likelihood
ratio test.
In this point we compare our reduced model with a Poisson distribution. We can foresee that the Poisson
distribution is not a good idea here by looking at the value of "θ" in our output. The value for the dispersion
parameter for the Negative Binomial is indeed "θ = 0.6962", which is a very small value "θ < 1". This is not
a good result since in our theory we studied that the Poisson distribution is the limit of the Negative Binomial
distribution as "θ → ∞": this means that the Poisson distribution is not a good approximation of our model.
We can conclude the same by performing the test: first of all we define the Poisson model (we use the standard
"glm" function):
mod.pois=glm(Num_airbn~Subway_Station+Living_Facility
+Percentage_construction_area,data=nanjing,family=poisson(link="log"))
llmod.pois=logLik(mod.pois)
llmod.r
llmod.pois
p=1-pchisq(as.numeric(-2*(llmod.pois-llmod.r)),1)
The p-value we obtain is "0" which confirms what we previously said about the Poisson distribution. So in our
example the Negative Binomial distribution is needed to model the overdispersion of the response variable.
Here the Negative Binomial distribution is the big model and the Poisson distribution is the reduced model:
since the p-value is small we reject the reduced model (Poisson) and we accept the large model (Negative
Binomial) with the two parameters of the Negative Binomial.
219
5. In the reduced model compute the predicted values and plot "predicted vs observed".
We can find the fitted values in the output of the "glm" function. We can plot the figure:
mod.r$fitted.values
nanjing$Num_airbn
plot(nanjing$Num_airbn,mod.r$fitted.values)
cor(nanjing$Num_airbn,mod.r$fitted.values)
From the graph we notice that there’s some overdispersion and so there are several outliers on the right
direction. We also computed the correlation and we see there’s a high correlation between the observed and
the predicted.
1. Define the model and check its significance with the likelihood ratio test
2. Check the significance of the predictor with the Wald test and with the likelihood ratio test for categorical
predictors (as we has seen for the Logistic regression // count data)
3. Check the overdispersion by comparing (in this case) the Negative Binomial distribution and the Poisson
distribution
220
18 Cross validation and model selection
18.1 Introduction
Model selection is a statistical technique used to guide reduced model in regression of classification problems where
the number of predictors is large.
For a small number of predictors in the complete model we can perform the Student-t test for the regression or the
Wald test for the GLM and then perform a model selection at the end. In general this is not possible when the
number of predictors is large.
So the idea is that we need to work in a sort of "tradeoff" of two opposite situations: we can select a very small
model (we have a very handy model but we lose information) or a very big model (the complete model has the
entire information but it’s not easy to handle).
When a lot of predictors are available, the great temptation is to use all of them in order to create a big model
which fit our data with great precision. But in practice a small model may be more appropriate for two reasons:
Remember our early discussion on the figure below: in the first figure we have a lot of information but its chaotic
whilst the second one is more clear but has far less information:
We are now ready to approach this problem with a strong statistical background.
221
Single regression example
In this simple regression case we have one regressor "x" and one response variable "y". We now consider the
7 points in the figure below:
x=c(1,2,3,4,5,6,7)
y= c(-1.31,1.49,2.30,3.70,0.18,4.91,1.30)
plot(x,y,pch=20,cex=2)
From the mathematics we know that for "7" points we can define a polynomial with degree 6 in order to
obtain a perfect fit (a perfect interpolation) [since we have just one predictor we use the powers to define
multiple parameters]:
coe=mod7$coefficients
x1=seq(0.5,7.5,0.01)
y1=coe[1]+coe[2]*x1+coe[3]*x1^2+coe[4]*x1^3+coe[5]*x1^4+coe[6]*x1^5+coe[7]*x1^6
plot(x,y,pch=20,cex=2)
lines(x1,y1,col="red",lwd=2)
This means that in this example the sum of the errors is null and the R2 is equal to:
X
e2i = 0 R2 = 1
so we have the best possible model. However it’s not very practical to use such a polynomial to predict
values of the response variable: in this case a simple linear function is better (we consider a simpler model).
Of course the regression line seems to be a worse model since each data point presents some error but we
moved to a much simpler model.
222
mod1=lm(y~x)
plot(x,y,pch=20,cex=2)
lines(x1,y1,col="red",lwd=2)
abline(mod1,col="blue",lwd=2)
But if we add few new points (the 3 new observations in green) it’s now not obvious that our red line is still
the best model: it seems that for these new observation the regression line is a better approximation.
xnew=c(1.5,3.5,6.5)
ynew=c(1.13,2.06,4.49)
plot(x,y,pch=20,cex=2)
lines(x1,y1,col="red",lwd=2)
abline(mod1,col="blue",lwd=2)
points(xnew,ynew,col="green",pch=17)
So for model building the red line is better as it has a perfect fit, whilst if we seek the best prediction
the blue line is better as it has a lower error.
Here we compute the test error for polynomials of different degrees:
mod1=lm(y~x) #Degree1#
coe=mod1$coefficients
yhattest=coe[1]+coe[2]*xnew
MSE1=mean((ynew-yhattest)^2)
mod1=lm(y~x+I(x^2)) #Degree2#
coe=mod1$coefficients
yhattest=coe[1]+coe[2]*xnew+coe[3]*xnew^2
MSE2=mean((ynew-yhattest)^2)
mod1=lm(y~x+I(x^2)+I(x^3)) #Degree3#
coe=mod1$coefficients
yhattest=coe[1]+coe[2]*xnew+coe[3]*xnew^2+coe[4]*xnew^3
MSE3=mean((ynew-yhattest)^2)
223
mod1=lm(y~x + I(x^2) + I(x^3) + I(x^4)) #Degree4#
coe=mod1$coefficients
yhattest=coe[1]+coe[2]*xnew+coe[3]*xnew^2+coe[4]*xnew^3+coe[5]*xnew^4
MSE4=mean((ynew-yhattest)^2)
18.2 Overfitting
This is a general problem in the model selection framework and is named as overfitting: it means that if we
consider a large number of predictors and parameters of course we have a good performance on the data used to
compute the parameters but the same model becomes worse in terms of forecasting new observations. The best way
to solve this problem is to perform the model-building and the model-testing on two different sets of data. This
means that we divide our dataset in two subsets:
• The first subset is used to perform model-building, so to compute the parameter estimation
• The second subset is used to perform model-testing, so to check the goodness of fit using the parameters
computed on the first subset
[We already saw one example of cross-validation in the lab of the Logistic regression]
Overfitting
Overfitting means that the fit is very good (and actually exact in our previous example) but the model
has bad performance in terms of prediction.
How to overcome this problem?
224
In the validation-set approach the basic idea is to use two different data sets to estimate the parameters of
the model and to measure the fit. We divide the data into:
• A training set, used for parameter estimation: using this set we compute the estimates "β̂0 , β̂1 , . . . , β̂p "
• A test set, used for model fitting evaluation: using this set we compute the mean square error, id est the
mean of:
(Y − (β̂0 + β̂1 x1 + · · · + β̂p xp )2 )
We separate the sets because, in order to avoid overfitting, we compute the parameter estimates on the training set
and then we use the estimates in the test set. The important point is that when we use the estimates of the error
"(β̂0 + β̂1 x1 + · · · + β̂p xp )" we operate on the test set but we use the values coming from the training set.
In the small introductory example, compute the mean square error (the mean of the squared residuals)
on the test set (green data points) for all possible polynomial models (the actual data are in the R script).
The mean square error on the test set captures both the two components of bias and variance: the rule here
is to consider as the "best model" the model with the lowest least square error.
• For a low number of parameters "p", the model is simple, and so it cannot capture the full complexity of the
data: this is called bias.
• For a great number of parameters "p", the model is complex, and it tends to "overfit" to spurious properties
of the data: this is called variance.
Best model
The rule is that the best model is the model with lowest MSE on the test set.
In most cases the bias-variance trade-off is thought of by drawing a picture like this (notice that the "error" in
prediction is the sum of the two other components: by minimizing this "error" we obtain a compromise between
the low and high complexity):
225
• The training error is the average error on the data points used to estimate the parameter (and it is what
we have computed in all our models until now).
• The test error is the average error in estimating the response variable of the test set using the model
parameters computed with the training set.
The two errors are often quite different (look at our 7 points example), and in particular the training error can
dramatically underestimate the test error.
Auto
We use the dataset "Auto" in the package "ISLR". The aim is to predict the fuel consumption "mpg"
(miles per gallon) as a function of the engine power.
• There are "392" observations and we split them into two parts (50/50 here) and we sample "196" rows
from the entire set (we use the sampling method since in most datasets the data is sorted in some way:
this method prevents us from considering observations non randomly):
library(ISLR)
View(Auto)
train=sample(1:392,196,replace=F)
• We use the training set to build the model [here we just used one predictor to have a simple model]:
lm.fit=lm(mpg~horsepower,data=Auto,subset=train)
• We use use the "predict" function to estimate the response and then we compute the MSE (the Mean
Square Error on the test set is the difference between the observed response variable and the predicted
response variable [we predict it in the previous step]) of the "196" observations in the validation set:
mean((Auto$mpg-predict(lm.fit,Auto))[-train]^2)
With my random selection, the result is "24.33445" but of course you’ll have different values due to the
random split.
Since we just considered one predictor we can add some complexity to the model by adding powers: we now
consider a polynomial model of degree-2:
As we perform this we notice we obtained a slight improvement in the model since we obtained a lower MSE
of "21.97939" (we obtained a better model). So we have an improvement, but what happens if we add terms
to the polynomial model? Here for example we plot the MSE for different degrees (1:10):
MSEs=rep(NA,10)
for (i in 1:10)
{
lm.fit=lm(mpg~poly(horsepower ,i) ,data=Auto ,subset =train )
MSEs[i]=mean((Auto$mpg -predict(lm.fit ,Auto))[-train ]^2)
}
plot(1:10,MSEs,type="b",pch=20)
226
We can see the results in the following plot:
From the plot we see that our goodness of fit increases until the 2-degree: beyond the second degree we
don’t have any relevant improvement. This means that it’s not useful to consider the higher degrees since
they only complicate the model without increasing the goodness of the model.
We can also try this procedure for different selections of the training set.
• The validation estimate of the test error can be highly variable, depending on precisely which
observations are included in the training set and which observations are included in the validation set (it
depends on the choice of the sets)
• In the validation approach, only a subset of the observations (those that are included in the training
set rather than in the validation set) is used to fit the model.
This suggests that the validation set error may tend to overestimate the test error for the model fit on the entire
data set. A possible solution is the K-fold cross validation process.
227
We divide our dataset into "K" equally sized parts (approximately) and then use "K − 1" blocks for the training
phase and the last one for the validation part [we repeat this process "K" times in order to use the whole dataset
in both phases].
Let the "K" parts be "C1 , . . . , Ck " where "Ck " denotes the indices of the observations in part "k". There are "nk "
observations in part "k". If "n" is a multiple of "K", then "nk = n/K ".
K-fold computation
where:
1 X
M SEk = (yi − ŷi )2
nk
i∈Ck
and "ŷi " is the fit for observation "i" computed from the data with part "k" removed.
The idea is that the cross validation is simply the (weighted) mean of the MSE in each of the configurations: indeed
we compute the MSE of each model in each possible configuration (in our previous example we had 5 configurations).
where "ŷi " is the predicted value from the original least square problem (with the full data set), and "hi,i " is the
i-th diagonal element of the hat matrix "H = X(X t X)−1 X t ".
LOOCV is simple to apply but typically the estimates from each fold are highly correlated and hence their average
can have a high variance. A good trade-off is to choose "K = 5" or "K = 10".
18.6.1 Recap
In standard regression (least squares algorithm) the idea was to use the whole set of information (the whole dataset)
to perform model estimation (to compute the estimates): this method can lead us to the overfitting problem (as we
have previously seen complex models give us the perfect fit but when we try to add new observations we notice that
their forecasting-performance is not good). So the idea is: instead of using the whole dataset to perform the training
and testing phases we divide it into several (also just two) subsets and perform these two operations separately.
228
18.7 Cross-validation in Software R
We use the function "train" in the package "caret":
library(caret)
train.control=trainControl(method = "cv", number = 10)
model=train(mpg~poly(horsepower ,2), data = Auto, method = "lm", trControl = train.control)
print(model)
Linear Regression
392 samples
1 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 353, 353, 354, 353, 352, 353, ...
Resampling results:
RMSE Rsquared MAE
4.332105 0.6973377 3.257559
Tuning parameter ’intercept’ was held constant at a value of TRUE
train.control=trainControl(method = "loocv")
model=train(mpg~poly(horsepower ,2), data = Auto, method = "lm",
trControl = train.control)
print(model)
Linear Regression
392 samples
1 predictor
No pre-processing
Resampling: Leave-One-Out Cross-Validation
Summary of sample sizes: 391, 391, 391, 391, 391, 391, ...
Resampling results:
229
18.8 Cross-validation for classification
Cross-validation can be applied in the same way also for classification problems (i.e. in logistic regression). The
only difference is that the fit is evaluated through the misclassification proportion.
Classification problem
Compute:
K
X nk
CV(K) = Errk
n
k=1
where:
1 X
Errk = 1(yi ̸= ŷi )
nk
i∈Ck
library(e1071)
train_control = trainControl(method = "cv", number = 10)
model = train(y~x,data=d,trControl = train_control,method = "glm",family=binomial())
print(model)
100 samples
1 predictor
2 classes: ’0’, ’1’
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 89, 91, 91, 90, 90, 89, ...
Resampling results:
Accuracy Kappa
0.7226263 0.4300388
230
19 Model selection
19.1 Introduction
The problem
In model selection the idea is to find the smallest set of predictors which provide an adequate description of
the data.
When dozens, or hundreds, of predictors are available, we need to choose among the candidate predictors the
best subset to define a suitably small statistical model.
Computational issue
Model selection can be challenging: if we have "p" candidate predictors, there are potentially "2p " models to
consider (i.e. each term being in or out of a given model) [so it means that this approach is possible only
for small model].
1. Start with the largest possible model (the complete model with, the one where there are all the
candidate predictors)
2. Choose a measure to quantify what makes a good model (a measure for the goodness of fit)
3. Remove the "worst predictor"
4. Continue to remove terms one at a time while the removal still provides a "better model" (a better fit)
5. When the removal of another predictor would give a worse model, we stop the algorithm
x1 , x2 , . . . xp
at the first step we consider all the sub-models with one predictor removed, so for example we have:
231
The forward selection algorithm proceeds in the opposite direction:
1. Start with the smallest possible model (the empty model: only the intercept is in).
2. Choose a measure to quantify what makes a good model.
3. Add the "best predictor".
4. Continue to add terms one at a time while the addition still provides a "better model"
5. When the addition of the next predictor would give a worse model, then stop.
It is the same procedure we saw for the backward algorithm but in this case we proceed in the opposite direction.
From linear regression we know a goodness of fit measure, the "R2 ":
RSSC
R2 = 1 −
V AR(Y )
but "R2 " is not a good choice because it chooses systematically the largest model (we need something that works
also for the GLM framework). Other possible choices are:
where "dfC = n − p − 1" are the degrees of freedom of the complete model and "dfR = n − 1" are the degrees
of freedom of the empty model. We search for a large "Radj
2
".
• The cross-validation criterion. With LOOCV:
n 2
1 X yi − ŷi
CV(n) =
n i=1 1 − hi,i
Note that the term "−2ℓ(y, ŷ)" is in favor of large models, while "2(p + 1)" is in favor of small models. It is a
correction of the deviance: it is a compromize between the first part (which gives an index for the goodness
of fit) and the second one which takes into account the model complexity (we have a graph similar to the one
we saw for the bias-variance tradeoff). So when we look for a "good model" from the likelihood point of view
we want to obtain a small value of the AIC.
232
• The Bayesian information criterion (BIC). It is very similar to the AIC but it also takes into account the
dimension of the dataset (it is a penalty factor):
BIC tends to favor smaller models than AIC, since the penalty for large "p" is heavier. Of course we again
look for a small value of BIC.
The dataset contains the crime rate "RATE" in 47 US States in 1960, together with 13 possible predictors
(this means that we have "213 " possible sub-models):
• "Age": the number of males of age 14-24 per 1000 population
• "S": indicator variable for Southern states (0 = No, 1 = Yes)
• "Ed": Mean number of years of schooling (x10) for persons of age 25 or older
• "Ex0 ": 1960 per capita expenditure on police by state and local government
• "Ex1 ": 1959 per capita expenditure on police by state and local government
• "LF ": Labor force participation rate per 1000 civilian urban males age 14-24
• "U2 ": Unemployment rate of urban males per 1000 of age 35-39
• "W ": Median value of transferable goods and assets or family income in tens of dollars
• "Pov": The number of families per 1000 earning below 1⁄2 the median income
The data are in the file "uscrime.dat", where the response variable is the variable "Crime_rate": we
consider the Gaussian random variable so we use the standard linear regression. The first rows are here:
For the model selection we use the "step" function. For the backward selection we define the complete model
"mod.c" and the direction:
library(readr)
uscrime <- read.delim(".../uscrime.dat")
uscrime=as.data.frame(uscrime)
View(uscrime)
mod.c=lm(RATE ~ . ,data=uscrime) #This is the complete model#
summary(mod.c)
AIC(mod.c)
233
So with the "backward" option we start from the complete model and then we compute all the possible
sub-models and we reduce the model (we iterate this until we obtain a worse result).
mod.sel=step(mod.c,direction="backward")
Start: AIC=301.66
RATE~Age+S+Ed+Ex0+Ex1+LF+M+N+NW+U1+U2+W+Pov
Df Sum of Sq RSS AIC
- NW 1 6.1 15885 299.68
- LF 1 34.4 15913 299.76
- N 1 48.9 15928 299.81
- S 1 149.4 16028 300.10
- Ex1 1 162.3 16041 300.14
- M 1 296.5 16175 300.53
<none> 15879 301.66
- W 1 810.6 16689 302.00
- U1 1 911.5 16790 302.29
- Ex0 1 1109.8 16989 302.84
- U2 1 2108.8 17988 305.52
- Age 1 2911.6 18790 307.57
- Ed 1 3700.5 19579 309.51
- Pov 1 5474.2 21353 313.58
From the output we obtain the column of the "AIC" values: the row called "<none>" represents the complete
model whilst all the other rows represent the sub-models (the models where we remove one single predictor).
They are sorted in ascending order so we can easily look for the model with the smallest "AIC" value (in this
case the "N W " model) [this means that the first predictor that we will remove from our complete model will
be "N W "]. In the next step we repeat the same procedure considering the new sub-model we have defined:
Step: AIC=299.68
RATE~Age+S+Ed+Ex0+Ex1+LF+M+N+U1+U2+W+Pov
Df Sum of Sq RSS AIC
- LF 1 28.7 15913 297.76
- N 1 48.6 15933 297.82
- Ex1 1 156.3 16041 298.14
- S 1 158.0 16043 298.14
- M 1 294.1 16179 298.54
<none> 15885 299.68
- W 1 820.2 16705 300.05
- U1 1 913.1 16798 300.31
- Ex0 1 1104.3 16989 300.84
- U2 1 2107.1 17992 303.53
- Age 1 3365.8 19251 306.71
- Ed 1 3757.1 19642 307.66
- Pov 1 5503.6 21388 311.66
As we did before we remove again a predictor and in this case the best choice is to remove the "LF " predictor
[and we continue operating in this way].
234
Step: AIC=297.76
RATE~Age+S+Ed+Ex0+Ex1+M+N+U1+U2+W+Pov
Df Sum of Sq RSS AIC
- N 1 62.2 15976 295.95
- S 1 129.4 16043 296.14
- Ex1 1 134.8 16048 296.16
- M 1 276.8 16190 296.57
<none> 15913 297.76
- W 1 801.9 16715 298.07
- U1 1 941.8 16855 298.47
- Ex0 1 1075.9 16989 298.84
- U2 1 2088.5 18002 301.56
- Age 1 3407.9 19321 304.88
- Ed 1 3895.3 19809 306.06
- Pov 1 5621.3 21535 309.98
Step: AIC=295.95
RATE~Age+S+Ed+Ex0+Ex1+M+U1+U2+W+Pov
Df Sum of Sq RSS AIC
- S 1 104.4 16080 294.25
- Ex1 1 123.3 16099 294.31
- M 1 533.8 16509 295.49
<none> 15976 295.95
- W 1 748.7 16724 296.10
- U1 1 997.7 16973 296.80
- Ex0 1 1021.3 16997 296.86
- U2 1 2082.3 18058 299.71
- Age 1 3425.9 19402 303.08
- Ed 1 3887.6 19863 304.19
- Pov 1 5896.9 21873 308.71
Step: AIC=294.25
RATE~Age+Ed+Ex0+Ex1+M+U1+U2+W+Pov
Df Sum of Sq RSS AIC
- Ex1 1 171.5 16252 292.75
- M 1 563.4 16643 293.87
<none> 16080 294.25
- W 1 734.7 16815 294.35
- U1 1 906.0 16986 294.83
- Ex0 1 1162.0 17242 295.53
- U2 1 1978.0 18058 297.71
- Age 1 3354.5 19434 301.16
- Ed 1 4139.1 20219 303.02
- Pov 1 6094.8 22175 307.36
235
Step: AIC=292.75
RATE~Age+Ed+Ex0+M+U1+U2+W+Pov
Df Sum of Sq RSS AIC
- M 1 691.0 16943 292.71
<none> 16252 292.75
- W 1 759.0 17011 292.90
- U1 1 921.8 17173 293.35
- U2 1 2018.1 18270 296.25
- Age 1 3323.1 19575 299.50
- Ed 1 4005.1 20257 301.11
- Pov 1 6402.7 22654 306.36
- Ex0 1 11818.8 28070 316.44
Step: AIC=292.71
RATE~Age+Ed+Ex0+U1+U2+W+Pov
Df Sum of Sq RSS AIC
- U1 1 408.6 17351 291.83
<none> 16943 292.71
- W 1 1016.9 17959 293.45
- U2 1 1548.6 18491 294.82
- Age 1 4511.6 21454 301.81
- Ed 1 6430.6 23373 305.83
- Pov 1 8147.7 25090 309.16
- Ex0 1 12019.6 28962 315.91
Step: AIC=291.83
RATE~Age+Ed+Ex0+U2+W+Pov #FINAL MODEL#
Df Sum of Sq RSS AIC
<none> 17351 291.83
- W 1 1252.6 18604 293.11
- U2 1 1628.7 18980 294.05
- Age 1 4461.0 21812 300.58
- Ed 1 6214.7 23566 304.22
- Pov 1 8932.3 26283 309.35
- Ex0 1 15596.5 32948 319.97
What we obtained now in the last output is that the "complete model" (the model from which we don’t
remove any additional predictor) is the best one. This means that, using the AIC criteria, every sub-models
is worse than the current complete one (we stop here: this is our final model).
From the different outputs we displayed you may notice that for the same model you can obtain different values of
"AIC": this is because there are several different definitions for the likelihood and sometimes different functions use
different definitions. This is not a huge problem if you stay constant using the same function (so using the same
definition of likelihood) [here we are not interested in the single value of "AIC" as we are only interested in the
sorting of the predictors].
236
Example 1
• Use the forward selection algorithm or the mixed one (both addition and removal are checked at
each step)
mod0=lm(RATE~1,data=uscrime)
mod.sel=step(mod0,scope=formula(mod.c),direction = "forward")
In the forward selection we start from the null model (model with only the intercept) and we need to
define the "scope", which is the formula of the largest possible model.
See the difference in the output in our example (sometimes they can lead to different results!).
Notice that there’s also a third possible procedure which is a mixture of the previous two: in this case at each
iteration we try to remove or add one predictor.
mod0=lm(RATE~1,data=uscrime)
mod.sel=step(mod0,scope=formula(mod.c),direction = "both")
The main difference we have to underline is that here, once a predictor has been removed from the model, it isn’t
"lost" forever as it can be re-integrated in further models, whilst in the backward method it is removed forever.
237
20 Lab 8
20.1 Model selection
Since the model selection based on stepwise is quite simple we have two different datasets: one example under the
standard linear model with a Gaussian response variable and the other one bsaed on the logistic regression.
For this exercise we use the dataset "Credit" in the "ISLR" package. Load the package "ISLR" and the dataset
is automatically available. The aim here is to predict the "Balance" by means of several quantitative predictors:
"Age", "Cards" (number of credit cards), "Education" (years of education), "Income" (in thousands of dollars),
"Limit" (credit limit), "Rating" (credit rating). We also have four categorical predictors: "Gender", "Student"
(student status), "Married" (marital status), and "Ethnicity" (Caucasian, African American or Asian).
library(ISLR)
View(Credit)
summary(Credit$Limit)
boxplot(Credit$Limit,main="Limit")
summary(Credit$Rating)
boxplot(Credit$Rating,main="Rating")
summary(Credit$Cards)
boxplot(Credit$Cards,main="Cards")
summary(Credit$Age)
boxplot(Credit$Age,main="Age")
summary(Credit$Education)
boxplot(Credit$Education,main="Education")
table(Credit$Student)
table(Credit$Married)
table(Credit$Ethnicity)
238
Here we have the boxplots of the quantitative variables:
2. Fit a linear regression model with all the predictors. Note that the first column is simply a row-label, so that
it must be excluded from the analysis.
The first step is to compute the complete model. Since we have a linear model we use the "lm" function in
order to perform a linear regression.
mod=lm(Balance~Income+Limit+Rating+Age+Cards+Education+Gender+Student+Married+Ethnicity,
data=Credit)
summary(mod)
Call:
lm(formula = Balance ~ Income + Limit + Rating + Age + Cards +
Education + Gender + Student + Married + Ethnicity,
data = Credit)
. . .
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -479.20787 35.77394 -13.395 < 2e-16 ***
Income -7.80310 0.23423 -33.314 < 2e-16 ***
Limit 0.19091 0.03278 5.824 1.21e-08 ***
Rating 1.13653 0.49089 2.315 0.0211 *
Age -0.61391 0.29399 -2.088 0.0374 *
Cards 17.72448 4.34103 4.083 5.40e-05 ***
Education -1.09886 1.59795 -0.688 0.4921
GenderFemale -10.65325 9.91400 -1.075 0.2832
StudentYes 425.74736 16.72258 25.459 < 2e-16 ***
MarriedYes -8.53390 10.36287 -0.824 0.4107
EthnicityAsian 16.80418 14.11906 1.190 0.2347
EthnicityCaucasian 10.10703 12.20992 0.828 0.4083
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 98.79 on 388 degrees of freedom
Multiple R-squared: 0.9551, Adjusted R-squared: 0.9538
F-statistic: 750.3 on 11 and 388 DF, p-value: < 2.2e-16
239
From the coefficient-table we obtained one coefficient for each quantitative predictor ("income", "limit", "rat-
ing", "age", "cards" and "education"), one Dummy variable for each categorical binary predictor ("gender",
"student" and "married") and then two Dummy variables for the categorical predictor "ethnicity" which has
3 different levels. Here we are not really interested in the p-values and in the significance of the predictors
since we are not going to perform a "by-hand" selection of the predictors (we are going to use the stepwise
algorithm).
3. Apply the stepwise algorithm to find reasonable reduced models. Use both the forward and the backward
directions, and try with both the "AIC" and the "BIC" criteria. Compare the results.
In order to perform this we use the "step" function (we use the complete model as input and then perform the
iterations). First of all we perform the backward-stepwise algorithm with the default penalty function "AIC"
(remember that the "backward" is the default direction of the function):
mod.bA=step(mod)
summary(mod.bA)
After several iterations the final model we obtain is the following where we have 6 predictors:
Call:
lm(formula = Balance ~ Income + Limit + Rating + Age + Cards +
Student, data = Credit)
Residuals:
Min 1Q Median 3Q Max
-170.00 -77.85 -11.84 56.87 313.52
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -493.73419 24.82476 -19.889 < 2e-16 ***
Income -7.79508 0.23342 -33.395 < 2e-16 ***
Limit 0.19369 0.03238 5.981 4.98e-09 ***
Rating 1.09119 0.48480 2.251 0.0250 *
Age -0.62406 0.29182 -2.139 0.0331 *
Cards 18.21190 4.31865 4.217 3.08e-05 ***
StudentYes 425.60994 16.50956 25.780 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We can perform the same (we iterate in the same direction) by using the "BIC" criterion. We define the
number of rows of our dataframe "n" and then we define the criterion by writing "k = log(n)" (the default
setting is "k = 2" as it corresponds to the Akaike information criterion):
n=dim(Credit)[1]
mod.bB=step(mod,k=log(n))
summary(mod.bB)
240
And so the output we obtain is the following:
Call:
lm(formula = Balance ~ Income + Limit + Cards + Student, data = Credit)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.997e+02 1.589e+01 -31.449 < 2e-16 ***
Income -7.839e+00 2.321e-01 -33.780 < 2e-16 ***
Limit 2.666e-01 3.542e-03 75.271 < 2e-16 ***
Cards 2.318e+01 3.639e+00 6.368 5.32e-10 ***
StudentYes 4.296e+02 1.661e+01 25.862 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
What we obtain is a smaller model with respect to the previous output: the BIC has a higher penalty for large
models (it tends to select smaller models).
On the other hand if we want to perform the forward stepwise method we need to define the starting point
(which is the null model [the model which only contains the intercept]) and then the scope (the maximum
possible formula we can use, so the complete model) . First of all we perform the algorithm using the "AIC":
mod.null=lm(Balance~1,data=Credit)
mod.fA=step(mod.null,formula(mod),direction="forward")
summary(mod.fA)
Step: AIC=3679.89
Balance ~ Rating + Income + Student + Limit + Cards + Age
So as we can see from the "summary" in our final model we have "6" predictors:
Call:
lm(formula = Balance ~ Rating + Income + Student + Limit + Cards +
Age, data = Credit)
. . .
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -493.73419 24.82476 -19.889 < 2e-16 ***
Rating 1.09119 0.48480 2.251 0.0250 *
Income -7.79508 0.23342 -33.395 < 2e-16 ***
241
StudentYes 425.60994 16.50956 25.780 < 2e-16 ***
Limit 0.19369 0.03238 5.981 4.98e-09 ***
Cards 18.21190 4.31865 4.217 3.08e-05 ***
Age -0.62406 0.29182 -2.139 0.0331 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We can of course perform the same by using also the "BIC" criterion:
mod.null=lm(Balance~1,data=Credit)
mod.fB=step(mod.null,formula(mod),direction="forward",k=log(n))
summary(mod.fB)
We obtain:
Step: AIC=3706.46
Balance ~ Rating + Income + Student + Limit + Cards
and the "summary" we see that in this case we obtained a final model with "5" predictors:
Call:
lm(formula = Balance ~ Rating + Income + Student + Limit + Cards,
data = Credit)
. . .
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -526.15552 19.74661 -26.645 < 2e-16 ***
Rating 1.08790 0.48700 2.234 0.026 *
Income -7.87492 0.23145 -34.024 < 2e-16 ***
StudentYes 426.85015 16.57403 25.754 < 2e-16 ***
Limit 0.19441 0.03253 5.977 5.10e-09 ***
Cards 17.85173 4.33489 4.118 4.66e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
So, based on the criterion (penalty applied) and method we chose (the direction of the algorithm), we obtained
different solution (different number of predictors) for our dataset.
242
20.1.2 Exercise 2 - Cancer remission
In this exercise we use the data in the file "cancer_remission.txt" available in the Aulaweb folder. The data
come from an experiment to predict the "cancer_remission" (response variable with values "1" for remission and
"0" otherwise) on the basis of six possible predictors (the other columns in the data file).
table(cancer_remission$remiss)
summary(cancer_remission$cell)
boxplot(cancer_remission$cell)
summary(cancer_remission$smear)
boxplot(cancer_remission$smear)
summary(cancer_remission$infil)
boxplot(cancer_remission$infil)
summary(cancer_remission$li)
boxplot(cancer_remission$li)
summary(cancer_remission$blast)
boxplot(cancer_remission$blast)
summary(cancer_remission$temp)
boxplot(cancer_remission$temp)
243
2. Fit a logistic regression model with all the predictors.
For our regression we apply the function "glm":
mod=glm(remiss~cell+smear+infil+li+blast+temp,data=cancer_remission,
family=binomial(link="logit"))
summary(mod)
Call:
glm(formula = remiss ~ cell + smear + infil + li + blast + temp,
family = binomial(link = "logit"), data = cancer_remission)
. . .
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 58.0385 71.2364 0.815 0.4152
cell 24.6615 47.8377 0.516 0.6062
smear 19.2936 57.9500 0.333 0.7392
infil -19.6013 61.6815 -0.318 0.7507
li 3.8960 2.3371 1.667 0.0955 .
blast 0.1511 2.2786 0.066 0.9471
temp -87.4339 67.5735 -1.294 0.1957
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 34.372 on 26 degrees of freedom
Residual deviance: 21.751 on 20 degrees of freedom
AIC: 35.751
3. Apply the stepwise algorithm to find reasonable reduced models. Use both the forward and the backward
directions, and try with both the "AIC" and the "BIC" criteria. Compare the results.
As we did in the previous exercise we now perform the stepwise algorithm to find the reduced models (we use
both "AIC" and "BIC" criteria). We start with the backward stepwise with AIC:
mod.bA=step(mod)
summary(mod.bA)
The last reduced model we obtain is the following (we have 3 predictors ):
Step: AIC=29.95
remiss ~ cell + li + temp
Df Deviance AIC
<none> 21.953 29.953
- temp 1 24.341 30.341
- cell 1 24.648 30.648
- li 1 30.829 36.829
Call:
glm(formula = remiss ~ cell + li + temp, family = binomial(link = "logit"),
244
data = cancer_remission)
. . .
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 67.634 56.888 1.189 0.2345
cell 9.652 7.751 1.245 0.2130
li 3.867 1.778 2.175 0.0297 *
temp -82.074 61.712 -1.330 0.1835
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
n=dim(cancer_remission)[1]
mod.bB=step(mod,k=log(n))
summary(mod.bB)
The last iteration we obtain is the following. In this case we obtained a reduced model with just one predictor:
Step: AIC=32.66
remiss ~ li
Df Deviance AIC
<none> 26.073 32.665
- li 1 34.372 37.668
Call:
glm(formula = remiss ~ li, family = binomial(link = "logit"),
data = cancer_remission)
. . .
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.777 1.379 -2.740 0.00615 **
li 2.897 1.187 2.441 0.01464 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 34.372 on 26 degrees of freedom
Residual deviance: 26.073 on 25 degrees of freedom
AIC: 30.073
Number of Fisher Scoring iterations: 4
245
We now perform the forward stepwise method. First of all we consider the "AIC":
mod.null=glm(remiss~1,data=cancer_remission,family=binomial(link="logit"))
mod.fA=step(mod.null,formula(mod),direction="forward")
summary(mod.fA)
Df Deviance AIC
<none> 26.073 30.073
+ cell 1 24.341 30.341
+ temp 1 24.648 30.648
+ infil 1 25.491 31.490
+ smear 1 25.937 31.937
+ blast 1 25.981 31.981
Call:
glm(formula = remiss ~ li, family = binomial(link = "logit"),
data = cancer_remission)
. . .
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.777 1.379 -2.740 0.00615 **
li 2.897 1.187 2.441 0.01464 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 34.372 on 26 degrees of freedom
Residual deviance: 26.073 on 25 degrees of freedom
AIC: 30.073
Number of Fisher Scoring iterations: 4
And in the end we consider the forward stepwise with the "BIC":
mod.null=glm(remiss~1,data=cancer_remission,
family=binomial(link = "logit"))
mod.fB=step(mod.null,formula(mod),direction="forward",k=log(n))
summary(mod.fB)
246
The last iteration is:
Step: AIC=32.66
remiss ~ li
Df Deviance AIC
<none> 26.073 32.665
+ cell 1 24.341 34.228
+ temp 1 24.648 34.535
+ infil 1 25.491 35.378
+ smear 1 25.937 35.825
+ blast 1 25.981 35.869
And from the summary we see again that we obtained a model with one predictor:
Call:
glm(formula = remiss ~ li, family = binomial(link = "logit"),
data = cancer_remission)
. . .
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.777 1.379 -2.740 0.00615 **
li 2.897 1.187 2.441 0.01464 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 34.372 on 26 degrees of freedom
Residual deviance: 26.073 on 25 degrees of freedom
AIC: 30.073
Number of Fisher Scoring iterations: 4
247
21 Last farewell . . .
First of all I would like to thank professor Rapallo for his magnificent lectures during this global pandemic, and then
I thank you students who enjoyed these notes. It has been a pleasure and an honor offering you this manuscript
and I hope they will prove profitable.
So remember that, as Socrates said,
One last warm farewell from your dearest comrade G.M., aka Peer2PeerLoverz.
248