0% found this document useful (0 votes)
8 views

Statistical-Models

The document is a comprehensive guide on statistical models, covering both non-parametric and parametric models, as well as various types of random variables. It includes sections on statistical inference, point estimation, hypothesis testing, and the properties of discrete and continuous random variables. Additionally, it provides practical examples and applications in statistical software.

Uploaded by

muezfitwi21
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Statistical-Models

The document is a comprehensive guide on statistical models, covering both non-parametric and parametric models, as well as various types of random variables. It includes sections on statistical inference, point estimation, hypothesis testing, and the properties of discrete and continuous random variables. Additionally, it provides practical examples and applications in statistical software.

Uploaded by

muezfitwi21
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 248

Statistical Models

Giorgio Mantero

15-02-2022

1
Contents
1 Introduction to Statistical models and some recalls 9
1.1 Some experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Sample spaces and σ-fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Statistical models (non-parametric) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Statistical models (parametric) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Parametric simple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Statistical models (nonparametric) II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7 Nonparametric statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.8 Statistical models in mathematical statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.8.1 Population, sampling and sampling schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.8.1.1 Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.8.1.2 Sampling schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.8.1.3 Cluster sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.9 Aims of statistical inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.10 Point estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.11 Estimator and estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.12 Sample mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.13 Unbias - Consistent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.13.1 Unbiased estimators and consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.13.2 Estimation of the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.13.3 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.13.4 CI for the mean of a normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.14 Sum up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.15 Testing statistical hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.15.1 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.15.2 Role of the hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.15.3 Level of the test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.15.4 One-tailed and two-tailed tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.15.5 Test statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.15.6 Rejection region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.15.7 Large sample theory for the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.15.8 p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2 Discrete random variables 25


2.1 Why random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Expected value and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Cumulative distribution function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1 Example in Software R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Bernoulli random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2
2.5.1 Bernoulli scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6 Binomial random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.7 Geometric random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.8 Negative binomial random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.9 Generalization of "N Bin(r, p)" to real "r" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.10 Poisson random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.10.1 Poisson is the limit of the binomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.10.2 Over-dispersed Poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.11 Mixture of two random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.12 Mixture of several random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3 Lab 1 40
3.1 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4 Continuous random variables 46


4.1 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.1 Expected value and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Cumulative distribution function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Quantiles and median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Uniform random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Exponential random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5 Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.6 Gamma random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.7 Gaussian (normal) random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.8 Gaussian random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.8.1 Gaussian random variables in Software R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.9 Mixture of two random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.10 Mixture of several random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.11 Kernel density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.11.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.11.2 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.11.3 Kernel functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.11.4 KDE construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.11.5 Bias-variance trade-off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.12 Checking for normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.12.1 Q-Q plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.12.2 Statistical tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.13 Anderson-Darling test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.14 Anderson-Darling test of normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5 Lab 2 65
5.1 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3
6 Multivariate random variables 69
6.1 Bivariate random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1.1 Marginal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1.3 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1.4 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1.5 Covariance and correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1.6 Properties of the covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2 The multinomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2.1 Generate data from a multinomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3 Chi-square goodness-of-fit test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.4 Bivariate continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.4.1 Conditional distributions and conditional expectation . . . . . . . . . . . . . . . . . . . . . . 77
6.4.2 Independence and correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.5 Multivariate normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.6 Multivariate standard normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.6.1 Figures of multivariate normal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.6.2 Property - Linear combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.6.3 Property - Subsets of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.6.4 Property - Zero-correlation implies independence . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.7 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.7.1 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.7.2 Generate multivariate normal data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7 Lab 3 85
7.1 Multivariate random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

8 Likelihood 91
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.2.1 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.3 Maximum likelihood estimation [MLE] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.3.1 MLE: example for a Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.4 Log-likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.4.1 Properties of the MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
8.4.2 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.5 Score function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.6 Score and MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.6.1 Role of score in MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.6.2 Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.6.3 Variance and information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.6.4 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4
8.7 The Cramer-Rao lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.7.1 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.7.2 Another information identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.7.3 Asymptotic distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.7.4 Multiple parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.8 Exponential families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8.8.1 Exponential families – in view of GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.9 Properties of the score statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.9.1 Score statistic for exponential families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.9.2 Mean and variance for exponential families . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.10 Link functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.10.1 Benefits of canonical links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

9 Simulation and the bootstrap 110


9.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
9.2 Generating randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
9.3 Linear Congruential Generator (LCG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
9.4 Checking randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
9.5 Sampling from a finite set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
9.5.1 Function "sample()" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
9.6 Sampling from a parametric (discrete) distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
9.7 Sampling from a finite set - the multinomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . 114
9.8 Sampling from a (continuous) parametric distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 115
9.9 The empirical cumulative distribution function (cdf) . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
9.10 The bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
9.10.1 A general picture for the bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
9.10.2 How to do it . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
9.11 Estimating the bias of an estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
9.12 Estimating the standard deviation of an estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
9.13 Advantages of the bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
9.14 Confidence intervals: the basic method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
9.15 Confidence intervals: the percentile method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
9.16 Multivariate and time series and more . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9.17 Software R package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

10 Lab 4 123
10.1 Simulation - The bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

11 Regression: parametric and nonparametric approaches 130


11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
11.2 Linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
11.3 Estimation of "β" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
11.4 Estimation of "β" with Gaussian residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

5
11.5 Properties of the LS estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
11.6 The Gauss-Markov theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
11.7 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
11.8 F-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
11.9 The Student’s "t" test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
11.10"R2 " coefficient (coefficiente di determinazione) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
11.11The plot of the residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
11.12Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
11.13Working with qualitative predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
11.14Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
11.15Nonparametric approach: the k-NN method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
11.16Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
11.16.1 k-NN in Software R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
11.17The k-NN for classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
11.18The effect of "k" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
11.18.1 With Software R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
11.19Pros and cons of k-NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

12 Lab 5 145
12.1 Multiple Linear Regression - Boston Housing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

13 Generalized Linear Models 151


13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
13.2 A review of the Gaussian model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
13.3 Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
13.4 The random component – overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
13.5 The systematic component – overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
13.6 The link function – overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
13.7 How to extend the LM theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
13.8 Weighted least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
13.9 WLS – Known weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
13.10WLS – Unknown weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
13.11Iterative re-weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
13.12OLS vs IRLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
13.13Choice of the weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
13.14Back to GLM’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
13.15Random component – Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
13.16The systematic component and the link function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
13.17Solving the likelihood equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
13.18Newton-Raphson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
13.19Fisher scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
13.20Fisher scoring with IRLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

6
13.21Asymptotic distribution of the MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
13.22The Wald test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
13.23The likelihood ratio test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
13.24Wald test vs Likelihood ratio test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
13.25Deviance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
13.26Deviance residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
13.27Computational remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

14 Logistic and multinomial regression 173


14.1 Logistic regression – Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
14.2 Logistic regression parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
14.3 Read the output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
14.4 Interpretation of the parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
14.5 The odds ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
14.6 The deviance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
14.7 Multiple logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
14.8 The output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
14.9 Model with interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
14.10Multinomial logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
14.11The multinomial model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
14.12Ordered multinomial model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
14.13Ordered multinomial model with Software R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

15 Lab 6 191
15.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
15.1.1 Exercise 1 - Credit card default . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

16 Regression for count data 197


16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
16.2 Poisson regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
16.3 Poisson regression parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
16.4 Predicted values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
16.5 Interpretation of the parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
16.6 Poisson regression for rates: the offset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
16.7 Overdispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
16.8 Negative Binomial regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
16.9 ZIP regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

17 Lab 7 216
17.1 Regression for count data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
17.1.1 Exercise 1 - AirBnB’s in Nanjing (China) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

18 Cross validation and model selection 221


18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

7
18.2 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
18.3 Bias-variance trade-off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
18.4 Drawbacks of the validation-set approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
18.5 K-fold cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
18.6 Special case: LOOCV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
18.6.1 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
18.7 Cross-validation in Software R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
18.8 Cross-validation for classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
18.9 Cross-validation for classification in Software R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
18.10Cp , AIC, BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

19 Model selection 231


19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
19.2 Backward and forward stepwise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
19.3 How to measure a "good model" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
19.3.1 Mixed selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
19.4 Forward vs Backward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

20 Lab 8 238
20.1 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
20.1.1 Exercise 1 - Credit card balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
20.1.2 Exercise 2 - Cancer remission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

21 Last farewell . . . 248

8
1 Introduction to Statistical models and some recalls
1.1 Some experiments
The first ingredient of a statistical model is the set of all possible observed outcomes of the random variables
involved in the problem. This is the sample space "Ω" of a random experiment.

Die roll
Roll a die:
Ω = {1, 2, 3, 4, 5, 6}

Coin toss
Toss a coin 10 times:
Ω = {H, T }10 or Ω = {0, 1}10
In this case we are considering the set of all possible vectors of length "10" of "0" and "1".

For infinite discrete measures:

Record of failures
Record the number of failures in a internet connection in a given time interval:

Ω=N

For continuous measures

Record the stock price (I)

Record the price of 100 stocks:


Ω = (0, +∞)100

Record the stock price (II)

Record the price of a given stock for 100 working days:

Ω = (0, +∞)100

1.2 Sample spaces and σ-fields

Remember
Given a sample space, the probability is defined over the events, i.e. subsets of the sample space.

Sample space and σ-field

Given a sample space "Ω", a σ-field of "Ω" is a family "F" of subsets such that:
• ∅∈F [Event = element of the σ-field F]
• If "E ∈ F", then "Ē ∈ F" ["E" is each element-event and "Ē" is its complement]

• If "(Ei )∞
i=1 ∈ F", then "∪i=1 Ei ∈ F"

If we have a family of events then its union is again an event.

9
Typical examples:

Discrete case
For discrete experiments the discrete σ-field is the field of all the possible subsets. All the subsets of Ω are
events:
F = ℘(Ω)

Continuous case

For discrete experiments: "F" is the Borel σ-field (the σ-field generated by intervals). It contains:
• All intervals (open, closed, . . . )

• All half-lines (open or closed)


• Every union or intersection of the above

The construction of a σ-field is exact the definition of the domain of a probability function.

1.3 Statistical models (non-parametric)

Remember

A probability distribution on "(Ω, F)" is a function "P":

P : F −→ R

such that:
• "0 ≤ P(E) ≤ 1" for all "E ∈ F"

• "P(Ω) = 1"
P∞
• "(Ei )∞
i=1 ∈ F, P(∪i=1 Ei ) =

i=1 P(Ei )" : for disjoint events the probability of the union is the sum of
the probabilities

In statistics the function "P" (probability distribution) is usually unknown.

1.4 Statistical models (parametric)


To simplify the analysis, we can consider parametric families of probability distributions.

Parametric statistical model

A (parametric) statistical model is a triple:

(Ω, F, (Pθ )θ∈Θ )

or
(Ω, F, (Fθ )θ∈Θ )
where "F" denotes the cumulative distribution function of a random variable.

10
The probability distribution has known shape with unknown parameters:

Gaussian model 1
For quantitative variables:
X ∼ N (µ, σ 2 )
known "σ 2 ", is a 1-parameter statistical model with "θ = µ", and "Θ = R".

Gaussian model 2
For quantitative variables:
X ∼ N (µ, σ 2 )
both "µ, σ 2 " unknown, is a 2-parameter statistical model with "θ = (µ, σ 2 )", and "Θ = R × [0, +∞)".

1.5 Parametric simple regression

Remember

Remember that, given two quantitative variables "X" (regressor) and "Y", the regression line:

Y = b0 + b1 X

is the least squares solution, i.e. it minimizes X


ε2

The model

Y = β0 + β1 X + ε
is a 3-parameter statistical model with "θ = (β0 , β1 , σ 2 )" where "σ 2 " is the variance of "ε".

1.6 Statistical models (nonparametric) II


If no knowledge on "F" is available or reasonable, we use a nonparametric statistical model.

Nonparametric statistical model

A (nonparametric) statistical model is a triple:

(Ω, F, (F )F ∈D )

Usually some restrictions on "F" are made so that "F" is not as general as possible, but belongs to a set "D"
of distributions ["D" is the set of all the possible probability distributions].

1.7 Nonparametric statistical models


Remark: non-parametric models differ from parametric models in that the model structure is not specified a
priori but is instead determined from data. The term "non-parametric" is not meant to imply that such models
completely lack parameters but that the number and nature of the parameters are flexible and not fixed in advance.
For example a histogram is a simple nonparametric estimate of a probability distribution. We use it to plot and
estimate the density of a random variable: the histogram is plotted without any assumption of the form of the
distribution:

11
Density estimation based on 500 sampled data (temperature in 500 weather stations) [a very naive nonparametric
density estimation: the histogram]:

12.07 7.70 10.62 13.57 8.07 12.23 8.61 12.25 18.17 11.10. . .

We use a Gaussian model: the estimated the parameters are

x̄ = 11.253 s = 3.503

Parametric density estimation:

A bit more refined nonparametric density estimation:

12
1.8 Statistical models in mathematical statistics
Before analyzing situations where several variables (response and predictors) are involved we need to deepen our
mathematical knowledge about probability and statistics, and we start by considering only one variable at a time.
We start here with toy examples to fix the mathematical background. We will come back to the problems with
several variables in the second part of the lectures.

1.8.1 Population, sampling and sampling schemes

Population is the (theoretical) set of all individuals (or experimental units) with given properties. The population
is the target set of individuals of a statistical analysis. For instance, the following are examples of populations:

• the set of all inhabitants of Genova


• the set of all student of our Department
• the set of all people with a given disease
• ...

1.8.1.1 Samples

Sample

A sample is a subset of a given population.

The analysis of the whole population can be too expensive, or too slow, or even impossible. Statistical tools allow
us to understand some features of a population through the inspection of a sample, and controlling the sampling
variability. This is the basic principle of inference: we analyze the sample in order to generalize the information
we have on the sample to the whole population under the assumption that the sample is a good representation of
the entire set of the population.

1.8.1.2 Sampling schemes How to choose a sample? The sampling scheme affects the quality of the results
and therefore the choice of the sampling scheme must be considered with great care. Usually, one has to find a
tradeoff between two opposite requirements:

• To have an easy sampling scheme


• To have a sampling scheme which minimizes the sampling error

Sampling schemes are usually divided into two broad classes:

• Probability sampling schemes


• Nonprobability sampling schemes

13
Among probability sampling schemes (techniques based on randomization):

• Simple random sampling (without replacement): the elements of the samples are selected like the
numbers of a lottery.

• Simple random sampling (with replacement): the elements of the samples are selected like the num-
bers of a lottery, but each experimental unit can be selected more than once. Although this seems to be
a poor sampling scheme, it leads to mathematical easy objects (densities, likelihoods, distributions of the
estimators. . . ), so that it is commonly used in the theory.

• Stratified sampling: the elements of the sample are chosen in order to reflect some major features of the
population (remember our discussion on "controlling for confounders"): the units are sampled proportionally
from each group:

• Systematic sampling: systematic sampling (also known as interval sampling) relies on arranging the study
population according to some ordering scheme and then selecting elements at regular intervals through that
ordered list.

• Cluster sampling: The sample is formed by clusters in order to speed up the data collection phase. In some
cases, cluster sampling has a two-stage procedure if a further sampling scheme is applied to each cluster.

14
1.8.1.3 Cluster sampling Among nonprobability sampling schemes:

• Accidental sampling (sometimes known as grab, convenience or opportunity sampling) is a type of non-
probability sampling which involves the sample being drawn from that part of the population which is close
to hand.
• Voluntary sampling: the sample is selected on a voluntary basis. This is quite common in social media
surveys. It is difficult to make inference from this sample because it may not represent the total population.
Often, volunteers have a strong interest in the main topic of the survey.

1.9 Aims of statistical inference


Statistical inference deals with the:

• definition of a sample from a population

• analysis of the sample


• generalization of the results from the sample to the whole population

Remark: in our theory, only samples with independent random variables will be considered.

1.10 Point estimation


Let us consider a population where a random variable "X" of interest is defined. We assume that the random
variable X has a density (discrete or continuous) denoted with "fX ". The sample is a sequence of random vari-
ables "X1 , . . . Xn " i.i.d. [independent and identically distributed] from "fX ". The simplest technique is parametric
estimation, where the density "fX " has a fixed shape and the unknowns are the parameters of the density. For
instance, for continuous random variables, you can fix a normal distribution with unknown mean "µ" and variance
"σ 2 ".

1.11 Estimator and estimates


Let us call "θ" the (unknown) value of the parameter of interest.

Estimator
An estimator of the parameter "θ" is a function of a known sample:

T = T (X1 , . . . , Xn )

Note that:

• The estimator "T" is a function of "X1 , . . . , Xn " and not (explicitly) of "θ"
• The estimator "T" is a random variable

When the data on the sample are available, i.e., when we know the actual values "x1 , . . . , xn ", we obtain an estimate
of the parameter "θ".

Estimate
The estimate is a number:
θ̂ = t = t(x1 , . . . , xn )

15
1.12 Sample mean
To estimate the mean "µ" of quantitative random variable "X" based on a sample "X1 , . . . , Xn " we use the sample
mean, defined as:
n
X1 + · · · + Xn 1X
X̄ = = Xi
n n i=1

1.13 Unbias - Consistent


We know that:
1
E(X̄) = (E(X1 + · · · + Xn )) =
n
1
= (E(X1 + · · · + E(Xn ))) =
n
1
= nµ = µ
n
which means that the expected value of the sample mean is the population mean: we have the unbias property.

X1 + · · · + Xn
 
V AR(X̄) = V AR =
n
1
= 2 (V AR(X1 ) + · · · + V AR(Xn )) =
n
1
= 2 nσ 2 =
n
σ2
=
n
which tells us that the variance of the sample mean goes to "0" when "n" goes to infinity: we have the consistency
property:
lim V AR(X̄) = 0
n→∞

For a Gaussian random variable X we have also:


σ2
 
X̄ ∼ N µ,
n
that is, the sample mean is again a Gaussian random variable with expected value "µ" and variance "σ2/n".
Here are the plots of the densities of the sample mean for sample sizes "n = 2, ; 8, 32" (true mean "µ = 5").

1.13.1 Unbiased estimators and consistency

Unbiased estimator
An estimator "T" is an unbiased estimator of a parameter "θ" if its expected value is equal to the parameter
itself:
E(T ) = θ ∀θ

16
The mean square error or "T" is defined as:
 
M SE(T ) = E (T − θ)2

and it is equal to the variance "V AR(T )" for unbiased estimators. The rule for the MSE is "the lower the better":

Consistent estimator
An estimator is consistent if:
lim V AR(Tn ) = 0
n→∞

1.13.2 Estimation of the mean

The estimatore "X̄" sample mean is:

• Unbiased
• Consistent

for the population mean, for whatever underlying distribution, provided that the mean and variance of "X" exist.

1.13.3 Confidence intervals

Confidence interval

A confidence interval (CI) for a parameter "θ" with a confidence level "1 − α ∈ (0, 1)" is a real interval
"(a, b)" such that:
P(θ ∈ (A, B)) = 1 − α

From the definition one easily obtains:


/ (a, b)) = α
P(θ ∈
and thus "α" is the probability of error. The default value for "α" is 5% (sometimes "α = 10%" or "α = 1%" used
too).

1.13.4 CI for the mean of a normal distribution

Let "X1 , . . . , Xn " be a sample of Gaussian random variables with distribution "N (µ, σ 2 )" (both parameters are
unknown). In such a case, the variance is estimated with the sample variance:
n
1 X
S2 = (Xi − X̄)2
n − 1 i=1

and:
X̄ − µ
T = S/√n

follows a Student’s "t" distribution with "(n − 1)" degrees of freedom.

17
It is easy to derive the expression of the CI for the mean:

1 − α = P(−t α2 < T < t α2 )


 √ 
n(X̄ − µ)
= P − t α2 < < t α2
S
= ...
 
S S
= P X̄ − t α2 √ < µ < X̄ + t α2 √
n n

thus  
S S
CI = X̄ − t α2 √ , X̄ + t α2 √
n n

Exercise - Confidence Interval for the mean


Let us suppose we have collected data on a sample of size "12", recording the scores at the final exam:
23 30 30 29 28 18 21 22 18 27 28 30

Under the normality assumption, we compute:

x̄ = 25.33 s2 = 21.70

and therefore the 95% confidence interval for the mean is:

CI = (22.37, 28.29)

where the relevant quantile of the "t" distribution is

> qt(0.975,11)
[1] 2.200985

18
1.14 Sum up
Our main goal is the estimation of parameters: the basic idea is that we have a population "Ω" (a large set of
individual statistical units). Making inference about a population means that we pick a sample and on that sample
we perform data analysis. Then we generalize the data analysis we did on the sample to the whole population
(inference).
"Parameter estimation" means that we assume that our data has a given particular analytical shape (for instance
a Gaussian distribution), and that the only "unknown" is the parameter (or parameters). "Parametric statistics"
means that we fix "f " and that θ is unknown.
How to measure the performance of an estimator:

• Unbiasedness of the estimator: the expected value of the estimator is equal to the parameter. The estimator
is centered around the parameter:
E(T ) = θ

• Consistency of the estimator: the variance MSE (Mean square error) is equal to:

M SE = E((T − θ)2 )

we want to pick an estimator with lowest MSE possible. When the sample size goes to infinity:

n→∞

we MSE (variance) goes to zero:


lim V AR(Tn ) = 0
n→∞

• Confidence interval: an easy way to put both these two pieces of information together is to use a CI for
the parameter: we give information about the position and the precision. We define a "Confidence Interval"
as the same time as:
– Point estimation
– Precision of the estimate

The definition for a general parameter, CI with level "(1 − α)", is an interval "(A, B)" such that:

P(θ ∈ (A, B)) = 1 − α

CI for the mean of a normal distribution:


S
X̄ ± tα/2 √
n
we see that the interval is centered around the estimate (sample mean): the lower and the upper limits depend on
the precision of the estimate.

19
1.15 Testing statistical hypotheses
A statistical test is a decision rule. It is a completely different way of doing estimation w.r.t. confidence interval:
this means that in the end we have two opposite options which are "accept" or "reject" an hypothesis.

1. In the first stage we state an hypothesis about the value of our parameter (or the distribution) under investi-
gation

2. In the second stage we collect the data on a sample and do inference


3. Based on the data we collected we decide whether the hypothesis can be accepted or not (reject).

1.15.1 Hypotheses

There are two hypotheses in a test:

• Null hypothesis "H0 "

• Alternative hypothesis "H1 "

Example - statistical hypotheses (I)

We want to check if the mean score in an exam is higher than the historical value "23.4", after the imple-
mentation of new online teaching material.

H0 : µ ≤ 23.4 H1 : µ > 23.4

or
H0 : µ = 23.4 H1 : µ ̸= 23.4
In this case the two alternatives don’t have the exact same meaning even if, from the mathematical pov,
both can be used. In this case, as we want to check if the online material has increased the mean-score of
the class, is better to use the first statement.

Remember that when we set up a test our objective (our willing) is to obtain the result of "accepting "H1 ".

1.15.2 Role of the hypotheses

A statistical test is conservative: when constructed for a given nominal significance level, the true probability of
incorrectly rejecting the null hypothesis is never greater than the nominal level. We take "H0 " unless the data are
strongly in support of "H1 ". The statement to be checked is usually placed as the alternative hypothesis.
Statistical tests are very useful when dealing with regression models (testing the significance of variables).

Example - statistical hypotheses (II)

We want to check if the mean score in an exam is higher than the historical value "23.4", after the imple-
mentation of new online teaching material. The correct hypotheses here are:

H0 : µ ≤ 23.4 H1 : µ > 23.4

20
1.15.3 Level of the test

Any testing procedure has two possible errors:

• We reject "H0 " when "H0 " is true (Type I error)


• We accept H0 " when H0 " is false (Type II error)

The test is conservative so we fix the probability of Type I error (reject "H0 ") [of course if we fix the Type I we
can’t fix the Type II]:
α = PH0
We fix the Type I since we first set the value of "α" (we control the probability of a false positive) [false positive
means that we introduce into the model a non-significant regressor]. Remember that these two errors compensate
each other: if we reduce one type of error we increase the other one as these two kind of errors are related.

1.15.4 One-tailed and two-tailed tests

Remember
For composite "H0 " we can reduce it to a simple one by taking the value of "H0 " nearest to "H1 ".

Thus, there are three possible settings:

• One-tailed left test


H0 : µ = µ0 H1 : µ < µ0

• One-tailed right test


H0 : µ = µ0 H1 : µ > µ0

• Two-tailed test
H0 : µ = µ0 H1 : µ ̸= µ0

1.15.5 Test statistic

The decision rule is given by a test statistic, which is a function of the sample:

Test statistic
A test statistic is a function "T" dependent on the sample "X1 , . . . , Xn " and the parameter "θ". The
distribution of "T" must be completely known "under H0 " (so when it’s true).

Thus:
T = T (X1 , . . . , Xn , θ)
Remark: note that "T" is not in general an estimator of the parameter "θ".

21
In the case of the mean of normal distributions we have a sample "X1 , . . . , Xn " from "N (µ, σ 2 )" with both "µ"
and "σ 2 " unknown.
H0 : µ = µ0 = 23.4 H1 : µ > 23.4
we could use the sample mean "X̄" but the distribution of "X̄" under "H0 " (so when it is true) is:

σ2
 
X̄ ∼ N µ0 ,
n

and "σ 2 " isn’t known (in general). But a good choice is:

X̄ − µ0
T = S/√n
∼ t(n−1)

This is why statistical test is "easier" than confident intervals. Here we have no "unknowns": we don’t have to "come
back" to "µ" as we can directly apply the formula to decide if "H0 " can be accepted or not.

1.15.6 Rejection region

The philosophy of the test statistics is as follows: if the observed value is "sufficiently far from" to "H0 " in the
direction of "H1 ", then we reject the null hypothesis; otherwise we don’t reject "H0 ". The possible values of T are
divided into two subsets:

• Rejection region [its form depends on the type of the alternative hp we set in our test]
• Acceptance region (or better, a non-rejection region)

For scalar parameters such as the mean of a normal distribution we have three possible types of rejection regions:

• R = (∞, a) for one tailed left tests

• R = (b, +∞) for one tailed right tests


• R = (−∞, a) ∪ (b, +∞) for two-tailed tests

The actual critical values are determined by:

PH0 (T ∈ R) = α

For the Student’s t test the critical values can be found on the Student’s t tables (a = −tα/2 , b = +tα/2 ).

22
Exercise - t-test (test statistic)

In our previous example


H0 : µ = µ0 = 23.4 H1 : µ > 23.4
suppose that on a sample of size 12 we observe the following scores:

23 30 30 29 28 18 21 22 18 27 28 30
We have "x̄ = 25.33", "s2 = 21.70" and for a one-tailed right test (level 5%):

R = (1.7959; +∞)

Since "t = 1.4378" (T = X̄−µ


S/√n ), we cannot reject "H0 ": there is no enough evidence against "H0 ".
0

Notice that the sample means records a great increase (almost 2 points) but is still not enough to accept the
alternative hypothesis. Probably here the "problem" is the small sample size as we have very few observations:
with more observations the result may have become significant and we may reject the null hypothesis.

1.15.7 Large sample theory for the mean

The sample mean parameter has a special property established through the Central Limit Theorem:

CLT - Central Limit Theorem

Given a sample "X1 , . . . , Xn " i.i.d. from a distribution with finite mean "µ" and variance "σ 2 ", we have:

X̄ − µ
σ/√n
−→ N (0, 1)

Thus, for large "n", the distribution of the sample mean is approximately normal. We will come back later on the
Central Limit Theorem and its usefulness for statistical models.

1.15.8 p-value

All statistical software don’t compute the rejection region, but the "p-value" instead: we take the distribution of the
test statistic, we take the p-value of the test inside the distribution and we compute the probability of obtaining,
under "H0 ", values of the test statistic as extreme as the results actually observed.

P-value
The "p-value" is the probability of obtaining under "H0 " test results at least as extreme as the results actually
observed

In the case of one-tailed test the standard theory was about selecting a critical value "tα " in order to determinate
the region of acceptance and rejection. The p-value works differently: it doesn’t compute the "tα " but instead

23
computes the probability of the test statistic on the observed value in the direction of "H1 ".
The practical rule is:

• if the "p-value" is less than "α", then reject "H0 "


• otherwise, accept "H0 "

There are two main advantages when working with the p-value instead of the rejection region and test-statistics:

• First of all there are many ways to compute the same test (we can use several different test statistics
with different rejection regions): we don’t need to define a priori what kind of test we are going to adopt as
we will get the same p-value.
• We don’t need to define a priori the "α": the p-value is computed without fixing any probability of error,
so we can compute it and then check if it is less or greater than the standard thresholds.

This means that we can compute the "p-value" without saying what kind of test we use and without defining the
level of the test: it is completely independent.

Exercise - t-test (test statistic)

In our example we use the "t−test" [the test for the mean of a normal distribution in all possible combinations]
where we set:
• "X" which is the data we provide (in this case our 12 scores)

• "µ = 23.4" which represents our selected null hypothesis


• "alternative = ”greater”" is our selected alternative hypothesis (in this case we have a one-tailed right
test)
As we said before here we don’t need to specify any "α":

> x=c(23, 30 ,30, 29, 28, 18, 21 ,22 ,18, 27, 28, 30)
> t.test(x,mu=23.4,alternative="greater")

One Sample t-test


data: x
t = 1.4378, df = 11, p-value = 0.08916
alternative hypothesis: true mean is greater than 23.4
95 percent confidence interval:
22.9185 Inf
sample estimates:
mean of x
25.33333

we see we obtain:

• "t = 1.4378" which represents the test statistic (which is the same we previously computed)
• "df = 11" which represents the degrees of freedom ("n − 1")
• "p − value = 0.08916" which represents the p-value

Small p-values are significant tests while large p-values aren’t significant. As our p-value is greater than "α" our
test is not significant and so we accept the null hypothesis "H0 ". Another advantage of the p-value is that we can
detect if we are in borderline situations (it means that different values of "α" can make the difference in accepting
of rejecting): this happens when my p-value is very close to the value "α".

24
2 Discrete random variables
2.1 Why random variables
Consider a random experiment modeled by a sample space "Ω": remember that the sample space is the set of
all possible outcomes in some experiment. Random variables, like probabilities, were originated in gambling and
therefore, some terminology comes from gambling. The actual value of a random variable is determined by the
sample point "ω ∈ Ω" that prevails, when the underlying experiment is conducted. We cannot know a priori
the value of the random variable, because we do not know a priori which sample point "ω" will prevail when the
experiment is conducted. We try to understand the behavior of a random variable by analyzing the probability
structure of that underlying random experiment. The typical random experiment in statistics is sampling.

2.2 Random variables


Remember

A random variable (r.v.) translates each possible outcome into a real number, so mathematically is a
function:
X : Ω −→ R

When the image of "X" is finite or countably infinite, the r.v. is a discrete random variable. For example we
can think about:

• The outcome of a dice [finite value]


• The number of Heads when flipping a coin 20 times [finite value]
• The number of defects from a production line [theorically infinite value]
• The number of children per family [theorically infinite value]
• ...

2.3 Expected value and variance


• The density of a discrete r.v. "X" is simply the computation of the probability of each possible value:
fX (X) = P(X = x)

• The expected value is the sum of all possible values multiplied by the density of probability
X
E(X) = XfX (X)
X

• The variance is: X


2
σX = V AR(X) = (X − E(X))2 fX (X)
X

Notice that as we compute the square value, variance is never negative.


• The standard deviation is
σX =
p
V AR(X)

25
2.4 Cumulative distribution function
The cumulative distribution functions takes any possible real number and computes the probability of values lower
than or equal to "x". The cdf has two main properties:

• The definition is the same for discrete and continuous r.v.


• This function is very useful especially for continuous r.v. as it allows us to compute probability without using
the integrals.

The cumulative distribution function (cdf ) is defined by:


X
FX (x) = P(X ≤ x) = fX (t) x∈R
t≤x

Remark: notice that "FX " is defined for all "x ∈ R".

Properties of the cumulative distribution function (cdf)

• 0 ≤ FX (X) ≤ 1
• "FX " is non-decreasing because, as long as we move towards "+∞", we constantly increase the value
• " lim FX (X) = 0" and " lim FX (X) = 1"
x→−∞ x→+∞

• "FX " is right-continuous

2.4.1 Example in Software R

Let us define a r.v. "X" with density given by

x 1 2 3 4 5 6 7
fX (X) 0.1 0.3 0.2 0.1 0.1 0.1 0.1

Exercise - cumulative distribution function (cdf)

Use "R" to compute the expected value and the variance.

x=c(1,2,3,4,5,6,7) #vectors of values#


p=c(0.1,0.3,0.2,0.1,0.1,0.1,0.1) #vector of the probabilities#

ev= sum(x*p) #expected value#


vr= sum((x-ev)^2*p) #variance#
sd=sqrt(vr) #standard deviation#

Fx=cumsum(p) #cumulative sum (cdf)#

26
Exercise - barplot cumulative distribution function (cdf)

To plot the density and the cdf use for instance:

barplot(p,names.arg=x)
plot(x,Fx,pch=19)

#OR ALTERNATIVELY#
plot(stepfun(x,c(0,Fx)))

2.5 Bernoulli random variables


Consider a binary r.v. and denote the two outcomes with "1" ("success") and "0" ("failure"):

Bernoulli distribution

A r.v. "X" follows the Bernoulli distribution with parameter "p ∈ (0, 1)" if its density is:

fX (0) = 1 − p fX (1) = p

and we write:
X ∼ Bern(p)
The expected value and variance are indeed:

E(X) = p V AR(X) = p(1 − p)

2.5.1 Bernoulli scheme

Let us consider a set of sequence of binary random variables. So we take an experiment with two possible outcomes
and we repeat it:
X1 , X2 , . . . , Xn , . . .
of Bernoulli random variables i.i.d (independent [the experiments don’t influence each other] and identically
distributed [we repeat the exact same experiment]). Such a sequence of i.i.d. random variables is named a
Bernoulli scheme.
Questions:

1. What is the probability of "x" successes in the first "n" trials? [Binomial distribution]
2. What is the probability that the first success appears after "x" failures? [Geometric distribution]
3. What is the probability that the r -th success appears after "x" failures? [Negative binomial distribution]

27
2.6 Binomial random variables
The first question is easy as essentially we can consider at a Bernoulli scheme:

X1 , X2 , . . . , Xn , . . .

where the number of successes in the first "n" trials is:

X = X1 + · · · + Xn

the probability of observing "x" successes is:


 
n x
fX (X) = P(X = x) = p (1 − p)n−x
x

where

• "px " is the probability that the first "x" trials are 1’s
• "(1 − p)n−x " is the probability that the remaining "n − x" trials are 0’s
 
n
• " " is the number of combinations of "x" successes within the "n" trials.
x

Binomial distribution
A discrete r.v. "X" has binomial distribution with parameters "n" and "p" if its density is:
 
n x
fX (X) = p (1 − p)n−x x ∈ {0, . . . , n}
x

and we write:
X ∼ Bin(n, p)
The two parameters are:
• the number of trials "n" (integer "n ≥ 1")
• the probability of success at each trial "p ∈ (0, 1)"
The expected value and the variance are indeed:

E(X) = np V AR(X) = np(1 − p)

Remember that the binomial random variable "X" is simply the sum of "n" i.i.d. Bernoulli r.v.’s.

Exercise - Binomial random variable


A questionnaire is formed by 10 questions with 3 choices each. If we put random answers, what’s the
probability to correctly answers at least 9 questions?

1
 
X ∼ Bin 10,
3
We need "P(X ≥ 9)":
  9  1   10  0
10 1 2 10 1 2 ∼
P(X = 9) + P(X = 10) = + = 0.0003556
9 3 3 10 3 3

28
Software R for the binomial distribution provides 4 functions:

• dbinom: density

• pbinom: cumulative distribution function


• qbinom: quantile function (pseudo-inverse cdf)
• rbinom: random sampling (random generator)

The previous computation (where we have "X", "n" and "p" respectively as parameters) is

dbinom(9,10,1/3) + dbinom(10,10,1/3)

#OR ALTERNATIVELY#
1 - pbinom(8,10,1/3)

Exercise - binomial distribution

Use the following Software R code to plot the densities of different binomial r.v.’s by changing "n" and/or
"p".

x = 0:20
y = dbinom(x,20,0.5)
plot(x,y,pch=15)

#OR#
barplot(y,names.arg=x)

29
2.7 Geometric random variables
We now answer the second question: let us consider the number of failures "X" until the first success in a Bernoulli
scheme with parameter "p" (this is the waiting time of the first success). For all "x ≥ 0", we have:

P(X = x) = p(1 − p)x

Geometric distribution
A discrete r.v. "X" has geometric distribution with parameter "p" if its density is:

fX (x) = P(X = x) = p(1 − p)x x≥0

and we write:
X ∼ Geom(p)
The expected value and the variance are indeed:
1−p 1−p
E= V AR(X) =
p p2

Property - Lack of memory

It means that, in a sequence of independent trials, the information about the number of failures until a given
time has no information about the future: the classic example is the "lottery".
Let "X ∼ Geom(p)", we have:
P(X = x + h|X > h) = P(X = x)
We compute the probability given "X > h", which means that in the first "h" trials we know we had no
successes (all failures).

The memory-less property (also called the forgetfulness property) means that a given probability
distribution is independent of its history: any time may be marked down as time zero. If a probability
distribution has the memory-less property the likelihood of something happening in the future has no relation
to whether or not it has happened in the past. The history of the function is irrelevant to the future.

Notice that in most probability textbooks a slight different definition of the geometric density is used:

fX (X) = p(1 − p)X−1 x≥1

30
Exercise - Geometric density - Cumulative distribution function

1 Plot the density of the "Geom(1/6)" (use "0 ≤ x ≤ 20")

x=0:20
y=dgeom(x,1/6) #density#
yc=pgeom(x,1/6) #cumulative distribution function#
par(mfrow=c(1,2))

plot(x,y,pch=19,main="Density")
plot(stepfun(x,c(0,yc)),lwd=2,main="CDF")
par(mfrow=c(1,1)) #to reset the plot window#

2 Plot the density of the "Geom(1/2)" (use "0 ≤ x ≤ 20") and the corresponding CDF’s

x=0:20
y=dgeom(x,1/2) #density#
yc=pgeom(x,1/2) #cumulative density distribution#
par(mfrow=c(1,2))

plot(x,y,pch=19,main="Density")
plot(stepfun(x,c(0,yc)),lwd=2,main="CDF")
par(mfrow=c(1,1)) #to reset the plot window#

31
Exercise - Geometric density - Cumulative distribution function

Toss a die until it gives "1" (this is a Bernoulli scheme with "p = 1/6").
• What is the probability that the game ends exactly in the sixth trial?
• What is the probability that the game ends exactly by the sixth trial?

With the R functions "dgeom" and "pgeom":

dgeom(5,1/6)
[1] 0.0669796

pgeom(5,1/6)
[1] 0.665102

2.8 Negative binomial random variables


Let us consider the last question about the Bernoulli scheme:

"what is the probability of "x" failures before the r-th success?"

We need:

• a success in the trial "r + x"


• "(r − 1)" successes in the first "r + x − 1" trials

Negative binomial distribution (Pascal random variable)

A discrete r.v. "X" has the negative binomial distribution with parameters "r" and "p" if its density is:

r+x−1
 
fX (X) = P(X = x) = (1 − p)x pr x≥0
r−1

and we write:
X ∼ N Bin(r, p)
So the construction of the density of this r.v. is basically the same we performed for the geometric distribu-
tion, but now we set the success as the "r-th" element.
The expected value and the variance are indeed:

r(1 − p) r(1 − p)
E(X) = V AR(X) =
p p2

Why the expression "negative binomial"?

(x + r − 1) . . . (r) x (−r)(−r − 1)(−r − 2) . . . (−r − x + 1)


 
x −r
= (−1) = (−1)
(x)! (x)! x

Waiting for the "r-th" success is just like waiting for the first one, then "lack-of-memory" and then starting over
with a new process and so on. The negative binomial distribution is simply the sum of "r" geometric distributions
and, based on the lack-of-memory, the sum is the sum of independent random variables.

32
2.9 Generalization of "N Bin(r, p)" to real "r"
It is useful to express the factorial using the "Γ" function, which is a function that interpolates the values of the
factorial (we can extend the concept of factorial to every real positive number):

Gamma function
The Gamma function is defined by Z ∞
Γ(s) = ts−1 e−t dt
0
for any positive real number "s > 0".

Notice that when "s" is integer:


Γ(s) = (s − 1)!
the factorial of "(s − 1)". It is possible to rewrite the density of "N Bin(r, p)" as:

Γ(r + x)
fX (X) = P(X = x) = (1 − p)x pr x≥0
Γ(r)x!

Remark: since "r" will be regarded as a dispersion parameter in a statistical model, it is crucial to have this
generalized definition available.

2.10 Poisson random variables


Poisson distribution
A discrete r.v. "X" has the Poisson distribution with parameter "λ > 0" if its density is:

λX
fX (X) = e−λ x≥0
x!
and we write:
X ∼ P(λ)
The expected value and the variance are indeed:

E(X) = λ V AR(X) = λ

It is a very simply distribution as the expected value is equal to the variance (this can be both an advantage or
disadvantage): since there’s only one parameter, if you fix the expected value you are also fixing the variance.

33
Exercise- Poisson density - Poisson distribution

The annual number of earthquakes in Italy is Poisson r.v. with parameter "λ = 4". Compute

1 The probability of observing no earthquakes in a year


2 The probability of observing more than 6 earthquakes in a year.
With the R functions "dpois" and "ppois":

dpois(0,4)
[1] 0.01831564

1-ppois(6,4)
[1] 0.110674

1-sum(dpois(0:6,4))
[1] 0.110674

Mimic the exercise on the geometric distribution to plot the densities of some Poisson distribution. The
plots of "P(5)" and "P(2)":

2.10.1 Poisson is the limit of the binomial

The "Poisson distribution" is a good approximation of the "binomial distribution" when:

• "n" is large
• "p" is small

34
Theorem - Sequence of binomial random variables

Let "(Xn )n≥1 " be a sequence of binomial r.v.’s "Xn ∼ Bin(n, pn )" with parameters "n" and "pn ". We know
that if "npn → λ > 0" when "n → ∞" then:
λx
lim P(Xn = x) = e−λ
n→∞ x!

Exercise - Binomial and Poisson densities


Plot on the same graph the density of a binomial and its Poisson approximation. Try with different values
of "n" and "p".

n=30
p=0.1
x=0:n #vary if needed#
y1=dbinom(x,n,p)
lambda=n*p
y2=dpois(x,lambda)

plot(x,y1,col="red",pch=19)
points(x,y2,col="black",pch=22)

From the plot we can see that we obtained a good approximation even when "n" is not that large.

So the idea of the Poisson distribution is to have an approximation of the binomial distribution. This is valid as
"n → ∞" but it’s a reasonable approximation also for smaller "n".

2.10.2 Over-dispersed Poisson

While Poisson is the first choice to model count data, it has the constraint:

V AR(X) = E(X) = λ

It is especially useful for discrete data over an unbounded positive range whose sample variance exceeds the sample
mean. In such cases, the observations are over-dispersed with respect to a Poisson distribution, for which the
mean is equal to the variance. The two parameters "r" and "p" can be adjusted to reach the desired mean and
variance. The data is "over-dispersed" (property) when the variance of the sample is greater than the expected
value.
For the opposite case (under-dispersion), when the variance is less than the expected value, we can’t use again
the negative binomial distribution as the probability "p ∈ (0, 1)" and this implies that the variance is larger than
the expected value (it is the denominator). This is a problem but in the majority of cases based on real data we
only find over-dispersion (which we can model) and not under-dispersion.

35
Exercise 1 - Poisson and negative binomial distributions

1 Prove that parameters of the negative binomial density such that:

E(X) = 3 V AR(X) = 6

are indeed:
r=3 p = 0.5

We remember that the expected values and variance are indeed:

r(1 − p) r(1 − p)
E(X) = V AR(X) =
p p2
so we can easily compute "p" as the ratio:
r(1−p)
E(X) p
= r(1−p)
=p
V AR(X)
p2

2 Plot (in the same graph) the density of "P(3)" and of the negative binomial above, with the following
Software R code:

x=0:15
y1=dpois(x,3)
y2=dnbinom(x,3,0.5)

plot(x,y1)
points(x,y2,pch=16)

The plot of the density of "P(3)" and of the "N Bin(3, 0.5)" are:

The negative binomial (black dots) has a larger variance as we have a greater probability on the tail, while
the Poisson (white dots) is more concentrated around the mean.

36
Exercise 2 - Poisson and negative binomial distributions

1 Find the parameters of the negative binomial density such that:

E(X) = 10 V AR(X) = 14

We remember that the expected values and variance are indeed:

r(1 − p) r(1 − p)
E(X) = V AR(X) =
p p2
so we can easily compute "p" as the ratio:
r(1−p)
E(X) p 10
= =p= = 0.714286
V AR(X) r(1−p) 14
p2

We then obtain:
p = 0.714286 r = 25

2 Plot (in the same graph) the density of "P(10)" and of the negative binomial above, with the R code
as above (use the x-range "0 ≤ x ≤ 20").

x=0:20
y1=dpois(x,10)
y2=dnbinom(x,25,0.714286)

plot(x,y1)
points(x,y2,pch=16)

3 Try with other values of mean and variance (for instance, try "E(X) = 10" and "V AR(X) = 100").

...

2.11 Mixture of two random variables


Mixture of two random variables
Given two r.v.’s "X1 " and "X2 " with density "fX1 " and "fX2 " respectively, a mixture of "X1 " and "X2 " is the
r.v. with density:
αfX1 (x) + (1 − α)fX2 (x)
with "α ∈ (0, 1)" being the mixture parameter.

Mixtures are used in statistics when, inside our population, we have sub-populations with different behaviors. The

37
best way to describe this kind of populations is by using the mixtures: this means that inside our populations we
have sub-classes with different distributions or with the same distribution but with different parameters. They also
model two-stage experiments.

2.12 Mixture of several random variables


The generalization to more than two variables is straightforward:

Mixture of several random variables


Given r.v.’s "X1 , . . . , Xk " with densities "fX1 , . . . , fXk ", a mixture of "X1 , . . . , Xk " is the r.v. with density

α1 fX1 (x) + · · · + αk fXk (x)

with "αj ∈ (0, 1)" and " j aj = 1".


P

We could for example take into consideration the mixture of two different Poisson distributions. In this case we
consider two populations with the same weight "α = 12 ":

Exercise - mixture of discrete random variables

Let "X ∼ P(5)" and "Y ∼ P(2)". The density of the mixture with "α = 12 " is:

1 −5 5x 1 2x
f (x) = e + e−2
2 x! 2 x!
1 Plot the mixture density above

x=0:20
y=1/2*exp(-5)*(5^x)/(factorial(x))+1/2*exp(-2)*(2^x)/(factorial(x))

plot(x,y)

2 Try with different parameters "λX " and "λY "

...

38
Exercise - Zero-inflated Poisson
In some cases, one component of the mixture is deterministic. This is for instance the case of the Zero-
Inflated Poisson. Zero-inflated means that we observe a Poisson distribution but we have an excess of zero.
The ZIP distribution is defined as the mixture:

fX (X) = αδ0 (x) + (1 − α)fY (x)

where:
• "δ0 (x)" is a (non-random) zero
• "fY (x)" is a "P(λ)" distribution

The moments of the Zero-Inflated Poisson distribution are:


α
E(X) = (1 − α)λ = µ V AR(X) = µ + µ2
1−α
Try to compute the first two moments above using the definitions and plot the density of the Zero-Inflated
Poisson distribution in the cases:
• α = 0.1 λ=3
• α = 0.4 λ=6

Data of this kind are very common in count-data. A real example of this distribution may be "car accidents":
car accidents follows a Poisson distribution plus an excess of zero for those who don’t drive cars.
Remember that, for example, for an "α = 0.2" we will get a greater "zero" element as it also contains the
probability of the Poisson distribution.

39
3 Lab 1
3.1 Discrete random variables
• Poisson approximation to the binomial distribution
1. Fix "n = 50" and "p = 0.05". Plot on the same graph the density of the binomial "Bin(n, p)" and the
relevant approximating Poisson density [take "0 : 20" as domain of the density]. How much probability
is lost with this choice?
n=50 #number of obs#
x=0:20 #domain of the density#
p=0.05 #probability#
y1=dbinom(x,n,p) #binomial density 1#
plot(x,y1,col="red") #plot of the binomial density 1#
lambda=n*p #definition of the parameter lambda#
y2=dpois(x,lambda) #Poisson density 1#
points(x,y2,col="green") #plot of the Poisson#
1-sum(y1) #the probability lost: 1-sum of probabilities#

As you can see from the code the lost probability is a very small number: this means that we practically
have no more probability left in the range [21,50]. We can then reduce our domain to [0,20] to have a
better representation of our density. From the graphical point of view notice that beyond "9" we only obtain
the green points as the two distributions coincide (remember that this is a graphical approximation).
2. Vary the parameters "n" and "p", choosing a small "p" in order to improve the approximation.
n=500 #increased number of obs (x10)#
x=0:20
p=0.005 #decreased probability (:10)#
y1=dbinom(x,n,p)
plot(x,y1,col="red")
lambda=n*p
y2=dpois(x,lambda)
points(x,y2,col="green")
1-sum(y1)

40
Since we want to obtain a better approximation we increase the number of observations and we decrease
the probability. Indeed the approximation "Binomial to Poisson" is good as long as "n → ∞" and "p → 0".
In this case we try to replicate the example by multiplying and dividing for a "10" factor. From the plots
we notice that there’s a smaller difference in the densities (almost all the points graphically coincide).
Since the graphical feedback isn’t a great solution from a statistical point of view, the best way to decree
the goodness of the approximation is again to define a vector of the difference between the two densities.
For the Poisson distribution we know that the mean is equal to the variance (this isn’t a good property
when modeling discrete data).
3. When the differences become too small, define a vector "d" of difference and plot it (with barplot).
Observe that, for fixed "λ" the maximum difference decreases when "n" increases (and thus "p" decreases).
difference=y2-y1 #vector of difference (Poisson-Binomial)#
barplot(difference, names.arg=x)

When the approximation is very good the two plots are basically the same: in order to check the difference
we have to set a "difference" variable between the two densities. If we look at the same example but with
the initial values of "n" and "p" we obtain the same plot (the same shape) but with higher error: we
obtain the same result but with a "x10" error (read the values on the "y" axis).
["names.arg": this parameter is a vector of names appearing under each bar in bar chart]
4. Do you find common behaviors in the plots obtained in item "3)" for different values of "n" and "p"? Can
you explain this? [Hint: compare the formulae of the two variances].
Notice from the plot that the binomial has a slightly lower variance with respect to the Poisson as red
points are a bit higher around the mean and lower over the tails. This happens because the two variances
are indeed:
Bin V AR(X) = np(1 − p) P ois V AR(X) = np
The approximated Poisson is always an overestimation of the original variance of the binomial distribution
as it suppresses the "(1 − p)" factor.

41
COMPLETE EXERCISE
###1)
n=50 #number of obs#
x=0:20 #domain of the density#
p=0.05 #probability#

y1=dbinom(x,n,p) #binomial density 1#


plot(x,y1,col="red") #plot of the binomial density 1#

lambda=n*p #definition of the parameter lambda#


y2=dpois(x,lambda) #Poisson density 1#
points(x,y2,col="green") #plot of the Poisson#

1-sum(y1) #the probability lost: 1-sum of probabilities#

###2)
n=500 #increased number of obs#
x=0:20
p=0.005 #decreased probability#

y1=dbinom(x,n,p)
plot(x,y1,col="red")

lambda=n*p
y2=dpois(x,lambda)
points(x,y2,col="green")

1-sum(y1)

###3)
difference=y2-y1 #vector of difference (Poisson-Binomial)#
barplot(difference, names.arg=x)

42
• Poisson vs Negative binomial
1. In the file "discrete.txt" you find two variables "x" and "y" on a sample of 100 individuals. Import the
dataset in Software R.
In this case we can simply use the "import dataset" setting or we can also use the console command below
(of course remember to select the right directory):
data=read.table("E:/. . ./discrete.txt", header=T)
View(discrete)

2. Use "x". Use the barplot with the following code to have the distribution of the data in the range "0 : 20":

barplot(table(f actor(x, levels = 0 : 20))/100)

Then use the function "points" to overlay the density of the best Poisson approximation, and the best
Negative binomial approximation (use different "pch" and/or different colors).
In this case we use the "attach()" function. By doing this the database is attached to the R search path.
This means that the database is searched by R when evaluating a variable, so objects in the database can
be accessed by simply giving their names. This means that if for example we have a dataframe "df" which
contains different variables (x1 , x2 , . . . , xn ), if we apply this function "attach(df) we obtain:

mean(x1 ) −→ mean(df $x1 )

so we don’t need to specify the name of the dataframe everytime we use a function. At the end of the
analysis we will of course use the "detach()" function. This workflow is useful just when operating with
one dataframe at the time as it becomes impossible to use when operating with several dataframes.
We then convert the variable "x" into a "factor": this allows us to perform graphical representations of
the data with labels also when we register "0" frequencies.
attach(data) #we attach the dataframe#

min(x) #find the minimum value of the "x"#


max(x) #find the maximum value of the "x"#
mean(x) #compute the mean#
var(x) #compute the variance#

barplot(table(factor(x,levels=0:20))/100)

lambda=mean(x)
points(dpois(0:20,lambda),pch=17,col="red")

p=mean(x)/var(x)
r=mean(x)*p/(1-p)
points(dnbinom(0:20,r,p),pch=19,col="blue")

From the values of "mean(x)" and "var(x)" we notice that our data suffer from some overdispersion: the
variance is significantly larger than the mean: from our theory the Negative Binomial should be a better
approximation of our data than the Poisson distribution.
This is the best approximation of our data using the Poisson distribution but as we can see from the
graphical point of view the 2 maxima don’t really correspond to each other.
As we can see the Negative Binomial distribution has a larger variance (flatter than the Binomial).

43
3. Repeat for "y"
We then repeat the same path for the "y":
mean(y)
var(y)

lambda=mean(y)
p=mean(y)/var(y)
r=mean(y)*p/(1-p)

barplot(table(factor(y,levels=0:20))/100)
points(dpois(0:20,lambda),pch=17,col="red")
points(dnbinom(0:20,r,p),pch=19,col="blue")

detach(data)

44
COMPLETE EXERCISE

###1)
data=read.table("E:/. . ./discrete.txt", header=T)
View(data)

###2)
attach(data) #we attach the dataframe#

min(x) #find the minimum value of the "x"#


max(x) #find the maximum value of the "x"#
mean(x) #compute the mean#
var(x) #compute the variance#

barplot(table(factor(x,levels=0:20))/100)

lambda=mean(x)
points(dpois(0:20,lambda),pch=17,col="red")

p=mean(x)/var(x)
r=mean(x)*p/(1-p)
points(dnbinom(0:20,r,p),pch=19,col="blue")

###3)
mean(y)
var(y)

lambda=mean(y)
p=mean(y)/var(y)
r=mean(y)*p/(1-p)

barplot(table(factor(y,levels=0:20))/100)
points(dpois(0:20,lambda),pch=17,col="red")
points(dnbinom(0:20,r,p),pch=19,col="blue")

detach(data)

45
4 Continuous random variables
4.1 Continuous random variables
Random variable in continuous case

The definition is exactly the same we previously saw: a random variable (r.v.) is a function:

X : Ω −→ R

The difference is in the image of "X": in the continuous case the image lies in a continuous set. When the
image of "X" is not countable (typically, the real line or a real interval) the r.v. is a continuous random
variable.

For example:

• The waiting time of some logistic processes (e.g. queues)


• The price of a stock
• The number of inhabitants of a country

Remark: notice that (last example) continuous random variables are used to model counts when numbers are
large.

4.1.1 Expected value and variance

The density of a continuous random variable "X" is a function:

fX : R −→ R

such that the probability of each interval is exactly the integral of the density between the extreme points, for all
"a, b ∈ R" (the area under the curve):
Z b
R(a ≤ X ≤ b) = fX (x) dx
a

So of course when moving from the discrete to the continuous case we swap from "sums" to "integrals".

• The expected value is: Z


E(X) = xfx (x) dx
R

• The variance is: Z


2
σX = V AR(X) = (x − E(X))2 fx (x) dx
R

• and the standard deviation is:


σX =
p
V AR(X)

46
4.2 Cumulative distribution function
In the continuous case the cumulative distribution function (cdf) plays a much important role. It is defined by
the probability of the left-tail for each point:
Z x
FX (x) = P(X ≤ x) = fX (t) dt x∈R
−∞

Remark: notice that "FX " is defined for all "x ∈ R".

Properties of the Cumulative random function

• 0 ≤ FX (x) ≤ 1

• "FX " is non-decreasing


• " lim FX (x) = 0" and " lim FX (x) = 1"
x→−∞ x→+∞

• "FX " is right-continuous

An example from the well-celebrated fundamental theorem of integral calculus (density of the standard Normal
distribution):
Z b
P(a ≤ X ≤ b) = fX (x) dx = FX (b) − FX (a)
a

This intuition translates into a property which says that the probability of an interval is computed from
the density with an integral. If we know the cumulative distribution function we can compute it: it is enough to
take the difference of the cumulative distribution functions in the upper and lower bounds (FX (b) − FX (a)). This
means that if the cumulative distribution function is available we don’t have to deal with integrals when computing
probabilities.

4.2.1 Quantiles and median

The quantile function is the (generalized) inverse of the cdf.

Definition
The quantile function of r.v. "X" is:

QX (p) = inf {x ∈ R : p ≤ FX (x)}

for all "0 < p < 1"

47
If "FX " is continuous and strictly monotone, then the definition simplifies to:
−1
QX (p) = FX (p)

Here we have a graphical example:

Remember that the median is just the quantile of "0.5".

4.3 Uniform random variables


Definition - Uniform distribution

A continuous random variable "X" has uniform density distribution over the interval "[α, β]" if its density
is: (
1
for α ≤ x ≤ β
fX (x) = β−α
0 otherwise
and we write:
X ∼ U[α, β]
The expected value and variance are indeed:

β+α (β − α)2
E(X) = V AR(X) =
2 12

48
Application

The uniform, and especially the "U[0, 1]" plays a central role in the framework of simulation. Let us suppose
−1
we want to generate a random number with density "fX ". If the cdf "FX " is known and "FX " can be
computed, then:
• Choose a random number "u" from a "U[0, 1]" distribution
−1
• Compute "QX (u) = FX (u)"

[We will come back on this topic in the next lectures]

4.4 Exponential random variables

Definition
In some sense the continuous exponential distribution is the counter-part of the geometric random variable
in the discrete case, in the sense that it has the lack-of-memory property we previously defined for the
geometric distribution. A continuous random variable "X" has exponential distribution with parameter
"λ > 0" if its density is:
fX (x) = λe−λx
when "x > 0" (and "0" otherwise). We write:

X ∼ ε(λ)

The expected value and variance are indeed:


1 1
E(X) = V AR(X) =
λ λ2

49
Exercise - Exponential random variable

Let us consider a r.v. "X" representing the income distribution in a population. If "X" has exponential
distribution with mean 22 ("K$"), compute the probability that an income lies between "15" and "20".
The parameter of the exponential distribution is:
1 1
λ= =
E(X) 22

Using Software R we obtain:

pexp(20,1/22)-pexp(15,1/22)
[1] 0.1028064

remember that the prefix "p" gives the cumulative distribution function "cdf". The exponential distribution
has a simple expression of the cdf for "x ≥ 0":

FX (x) = 1 − e−λx

Plot in the same graph the densities of "ε(1)", "ε(1/2)" and "ε(1/5)". Use "[0, 10]" for the x-range.

x=seq(0,10,0.1)
y1=dexp(x,1)
y2=dexp(x,1/2)
y3=dexp(x,1/5)

plot(x,y1,type="l",lwd=3,col="cyan")
lines(x,y2,lwd=3,col="black")
lines(x,y3,lwd=3,col="magenta")

50
4.5 Exponential distribution
The exponential distribution is the continuous counterpart of the geometric distribution.

Remember

Let "X ∼ ε(λ)". We have:


P(X > x + t|X > t) = P(X > x)
for all "x, t > 0".

4.6 Gamma random variables


The reasoning behind the introduction of this distribution right after the exponential distribution is the same we
previously saw for the negative binomial distribution after the Poisson distribution. So the idea here is to define a
similar density but with two parameters in order to manage data where there’s a difference between the expected
value and the variance. The gamma distribution is a generalization of the exponential distribution as, if we consider
"α = 1", we obtain the exponential density with parameter "λ".
The origin of the Gamma random variables is in the sum of independent exponential distributions, but they
have been generalized and deserve as the basic ingredient of several distributions.

Gamma distribution
A continuous r.v. "X" has Gamma distribution with parameters "α > 0" and "λ > 0" if its density is:

λα xα−1 e−λx
fX (x) =
Γ(α)

when "x > 0" (and "0" otherwise). Here our "Γ" is again the function:
Z ∞
Γ(s) = ts−1 e−t dt
0

We write:
X ∼ Gamma(α, λ)
The expected value and variance are indeed:
α α
E(X) = V AR(X) =
λ λ2
We encounter a special case for "α = 1" as we have "Gamma(1, λ) = ε(λ)".
The sum of "n" independent exponential r.v.’s "E(λ)" is a "Gamma(n, λ)". However, the definition of the
Gamma distribution is given for a generic positive first parameter. The distribution "Gamma( n2 , 12 )" is a
well-known distribution, the Chi-square distribution "χ2 (n)".

To see what happens with different values of "α" and "λ", and to learn about a different parametrization of the
Gamma distribution, look at

[https://en.wikipedia.org/wiki/Gamma_distribution]

51
Exercise - Gamma random variables

Use (and adapt) the following R code to plot the densities of several gamma random variables:
1 For fixed "α = 2", varying "λ"
2 For fixed "λ = 1", varying "α"

3 Use the R help to learn about two different parametrizations of the Gamma r.v.’s

###Fixed shape###
x=seq(0,10,0.1)
y1=dgamma(x,shape=2,rate=1)
y2=dgamma(x,shape=2,rate=2)
y3=dgamma(x,shape=2,rate=0.4)

plot(x,y2,type="l",lwd=3,col="black",main="Fixed shape")
lines(x,y1,type="l",lwd=3,col="cyan")
lines(x,y3,type="l",lwd=3,col="magenta")

###Fixed rate###
y1=dgamma(x,shape=1,rate=1)
y2=dgamma(x,shape=2,rate=1)
y3=dgamma(x,shape=4,rate=1)

plot(x,y2,type="l",lwd=3,col="black",main="Fixed rate")
lines(x,y1,type="l",lwd=3,col="cyan")
lines(x,y3,type="l",lwd=3,col="magenta")

52
4.7 Gaussian (normal) random variables

Gaussian distribution

A continuous r.v. "X" has normal or Gaussian distribution with parameters "µ" and "σ 2 > 0" if its
density is:
1 (x−µ)2
fX (x) = √ e− 2σ2
2πσ
for all "x ∈ R". We write:
X ∼ N (µ, σ 2 )
The expected value and variance are indeed:

E(X) = µ V AR(X) = σ 2

since we have these results the parameters account for the first two moments of the distribution. It’s
important to notice that the graph of the density is symmetric around the expected value.

4.8 Gaussian random variables


The great popularity of the Gaussian distribution follows from three properties:

1. Closure with respect to the sum of independent random variables components:


If "X1 ∼ N (µ1 , σ12 )" and "X2 ∼ N (µ2 , σ22 )" are independent, then:

X1 + X2 ∼ N (µ1 + µ2 , σ12 + σ22 )

This means that if we know that the distribution of two independent random variables is normal, then we
also know the distribution of their sum.

2. Closure with respect to linear maps:


If "X1 ∼ N (µ, σ 2 )", then
Y = a + bX ∼ N (a + bµ, b2 σ 2 )
If we consider a normal random variable "X" and perform a linear transformation, we obtain again a normal
distribution.

53
3. Central limit theorem:
The Gaussian distribution is the limit of the sample mean for whatever distribution. The sample mean of a
normal sample is normally distributed for each "n" but for large values of "n" the sample mean is normally
distributed for whatever distribution of our data, even for distributions with properties far from the Gaussian.
If "X1 , X2 , . . . , Xn , . . . " is a sequence of i.i.d r.v.’s with "E(Xi ) = µ" and "V AR(Xi ) = σ 2 " not necessarily
normally distributed, then the sample mean:
X1 + · · · + Xn
X̄n =
n
satisfies:
X̄n − µ
σ/√n
−→ N (0, 1)

when "n" goes to infinity.

4.8.1 Gaussian random variables in Software R

Let us consider the construction of the boxplot under the point of view of probability theory. One of the main
features of the boxplot is the detection of (univariate) outliers.

Boxplot for continuous random variables

Boxplots are the representation of continuous random variables based on the quartiles. The quartiles are
the quantiles which divide the statistical distribution into 4 parts, each of them having one fourth of the
distribution.

Graphically the outliners are the data points displayed with a special symbol: they represent the data
points "1.5" times the interquartile range far from the boxplot. These values are optimized under a normal
distribution: this means that the boxplot has been defined having in mind the Gaussian distribution. The
definition of outliers is optimal if our data comes from a normal distribution.

54
Exercise - Boxplot for continuous random variables - computation of quantiles

Let’s now consider this example:


• Take a normal distribution, "N (0, 1)" to ease the computations. Use the R functions "pnorm" and
"qnorm" to compute the probability of the outlying zone.

• Repeat the same exercise with a "ε(1)" exponential distribution.


The corresponding plot would be the following "N (0, 1)" distribution summarized here:

Here we need to compute the probability of the outline region. We compute the interquartile range by consid-
ering "Q1 " and "Q3 " (the length of the box):

Q1=qnorm(0.25,0,1) #First quartile#


Q3=qnorm(0.75,0,1) #Third quartile#
IQR=Q3-Q1 #Interquartile range#
LL=Q1-1.5*IQR #Lower limit#
UR=Q3+1.5*IQR #Upper limit#

pnorm(LL,0,1) #Probability of the two tails where outliers are detected#

For the normal "N (0, 1)" distribution we obtain a "IQR = 1.34898". The lower and the upper limits are the
points "1.5" times the interquartile range far from the borders of the box. The probability of the lower zone
in this case is "0.003488302" (by symmetry we obtain the same thing on the other tail): so the probability of
the outline zone is approximately "0.7%". This means that if we generate for example 1’000 data we expect
to find 7 outliers.

4.9 Mixture of two random variables


Mixture of two random variables
Given two r.v.’s "X1 " and "X2 " with density "fX1 " and "fX2 " respectively, a mixture of "X1 " and "X2 " is
the r.v. with density
αfX1 (x) + (1 − α)fX2 (x)
with "α ∈ (0, 1)" being the mixture parameter.

The mixtures are used to model the presence of two (or more) sub-populations with different parameters. They
also model two-stage experiments.

55
4.10 Mixture of several random variables
The generalization to more than two variables is straightforward:

Mixture of several random variables


Given r.v.’s "X1 , . . . , Xk " with densities "fX1 , . . . , fXk ", a mixture of "X1 , . . . , Xk " is the r.v. with density

α1 fX1 (x) + · · · + αk fXk (x)

with "αj ∈ (0, 1)" and " j αj = 1".


P

Another important example is the mixture of normal distributions:

Mixture of normal distributions


A normal mixture is a density of the form:

α1 fX1 (x) + · · · + αk fXk (x)

with "αj ∈ (0, 1)" and " j αj = 1". Here the "fXk (x)" are normal distributions with their own parameters:
P

X1 ∼ N (µ1 , σ12 ), . . . , Xk ∼ N (µk , σk2 )

4.11 Kernel density estimation


The basis of the Kernel density estimation is to consider the data not as simple data points over the real line but as
the center of Gaussian distribution. This is a way to regularize the representation: instead of having an histogram
which is not continuous we can have the sum normal densities which is a continuous function.

4.11.1 Problem statement

The problem is: if we want to estimate the density of a continuous variable "X" on the basis of a sample of size "n",
both the histogram and the empirical cumulative distribution function (cdf) are discrete in nature. This seems to
be bad, because we would like to have a continuous object.

4.11.2 Histogram

Let us give a deeper look into the histogram definition. If "f (x)" is a smooth density, we have that:
 Z x+h
x+h x+h
    
x−h x−h 2
P <X< =F −F = f (z) dz ≃ hf (x)
2 2 2 2 x−h
2

56
where "h > 0" is a small (positive) scalar called "bin width". If "F (x)" were known, we could estimate "f (x)" using:

F ( x+h
2 ) − F( 2 )
x−h
fˆ(x) =
h
but this is not practical because if we know "F " we don’t need to estimate "f ". Different choices for "m" and "h"
will produce different estimates of "f (x)":

• Freedman and Diaconis method (1981)


1
We set "h = 2(IQR)n− 3 " where "IQR = interquartile range". Then divide the range of data by "h" to
determine "m".
• Sturges method (1929) [default in Software R’s "hist" function]
We set "m = [log2 (n) + 1]" where "[·]" denotes the ceiling function. It may over-smooth for non-normal data
(i.e. use too few bins).

Exercise - Freedman and Diaconis method// Sturges method

To see the differences use this code:


par(mfrow=c(1,2))
set.seed(1)
x = rnorm(20)

hist(x,main="Sturges")
hist(x,breaks="FD",main="FD")
par(mfrow=c(1,1))

Repeat with different sample sizes and different distributions.

4.11.3 Kernel functions

A kernel function "K" is a function such that:

• K(x) ≥ 0 ∀x ∈ R It is non-negative
• K(x) = K(−x) ∀x ∈ R The kernel is symmetric around "0"
R +∞
• −∞ K(x) = 1 The integral is "1"

57
In other words, "K" is a non-negative function that is symmetric around "0" and integrates to "1": therefore the
expected value is "0". A simple example is the uniform (or box) kernel:
(
1 if − 12 ≤ x ≤ 21
K(x) =
0 otherwise

Another popular kernel function is the Normal distribution kernel with"µ = 0" and fixed variance "σ 2 ":
1 x2
K(x) = √ e− 2σ2
2πσ
We could also use a triangular kernel function:
(
K(X) = 1 − |x| if x ∈ [−1, 1]
K(x) =
0 otherwise

So the important here is to always have the mean at "0".

58
http://upload.wikimedia.org/wikipedia/commons/4/47/Kernels.svg

4.11.4 KDE construction

If "K" is a kernel function, then the scaled version of "K":


1
 
x
Kh (x) = K
h h
is also a kernel function, where "h > 0" is some positive scalar. We can center a scaled kernel function centered at
any data point "xi ", such as:
1
 
x − xi
Khxi (x) = K
h h
to create a kernel function that is symmetric around "xi ".

KDE (Kernel Density Estimate)

Given a random sample "X1 , . . . , Xn " i.i.d. from an unknown density "f (x)", the KDE of "f " is:
n n
1 X xi 1 X
 
ˆ x − xi
f (x) = K (x) = K
n i=1 h nh i=1 h

where "h" is now referred to as the "bandwidth" (instead of "bin width").

An example with 6 points

59
4.11.5 Bias-variance trade-off

In this example if we choose a very small "h", which is basically the standard deviation of the kernel, we would
obtain very concentrated and equal kernels for each data point. The result of the sum of these kernels wouldn’t be
a "smooth" graph as in the previous figure but instead we would obtain 6 different peaks (bias). On the contrary
if we choose a very large "h" (a very large dispersion for each kernel) we would obtain 6 very flat kernels, whose
sum would be a density close to a flat line (variance). The optimal choice lays of course between these two extreme
situations.

Exercise - Kernel density estimation

Here we estimate the Kernel density:

x = rnorm(20)
plot(density(x))
plot(density(x,kernel="epanechnikov"))
plot(density(x,kernel="epanechnikov",adjust=0.5))

There’s a clever way to define bandwidth: we can move from the optimal bandwidth (enlarging or restricting
the kernel) by using the "adjust" parameter (Software R multiplies the optimal value for that parameter).
For small values we obtain peaks for every data points (not useful density estimation as we obtain several
local maxima), whereas for bigger values we get greater variance and we lose relevant information about the
data points (we may also lose the outliers) [try for example values of "0.1" and "2"].

60
4.12 Checking for normality
Our objective here is to check if our sample is (or not) a normal one. There are basically two methods: the first
one is based on a graphical comparison of the data and is called Q-Q plot (Quantile-Quantile plot). It is based
on the comparison of two quantiles over the horizontal (empirical quantiles of our distribution) and vertical
(corresponding quantile of the standard normal distribution) axes. If our data is normal then the two distributions
should coincide (the data follow a straight line).

4.12.1 Q-Q plots

Q-Q plots display the observed values against normally distributed data (the line represents perfect normality).

4.12.2 Statistical tests

Another way to check for normality is to use statistical tests. Statistical tests for normality are more precise since
actual probabilities are calculated. Tests for normality calculate the probability that the sample was drawn from a
normal population.

Hypotheses of the tests

The hypotheses used are:


• H0 : (the null hypothesis) [Normality of the data]
The sample data are not significantly different than a normal population: we have the normality
of the data.
• H1 : (the alternative hypothesis) [Non-normality of the data]
The sample data are significantly different than a normal population.

61
4.13 Anderson-Darling test
The Anderson-Darling statistic belongs to the class of quadratic EDF statistics (tests based on the Empirical
Distribution Function, not on the moments). Quadratic EDF statistics look at the "distance":
Z ∞
n (Fn (x) − F (x))2 w(x) dF (x)
−∞

where we have:

• "w(x)" is a weight function which slightly reduces the weight on the extremes (technical factor to obtain a
better test)
• "Fn " is the Empirical Cumulative Distribution Function
• "F " is the Theoretical Cumulative Distribution Function under "H0 "

The Anderson-Darling test is based on the distance:


Z ∞
(Fn (x) − F (x))2
A =n
2
dF (X)
−∞ F (x)(1 − F (X))

Thus, the Anderson-Darling distance places more weight on observations in the tails of the distribution.

The Anderson-Darling tests computes the distance (the area) between the Empirical Cumulative Dis-
tribution Function and the Theoretical Cumulative Distribution Function: the more the area, the less
the correspondence of the functions (the null hypothesis corresponds to an area equal to "0").

4.14 Anderson-Darling test of normality


To perform the Anderson-Darling test:

• Sort the data "x1 ≤ · · · ≤ xn "


• Standardize the data
x1 − x̄ xn − x̄
y1 = , . . . , yn =
s s
• Compute the test-statistic:
n
1X
A2 = −n − (2i − 1)(ln ϕ(Yi ) + ln(1 − ϕ(Yn+1−i )))
2 i=1

where "ϕ" is the cdf of the standard normal distribution "N (0, 1)"
• The critical values are in suitable tables

62
Exercise - Andrerson-Darling test statistic

Let us consider the following data (population density of the cities in a Mexican state). Import the data
from the file "mexvillages.txt":

Village PopulationDensity
Aranza 4.13
Corupo 4.53
SanLorenzo 4.69
Cheranatzicurin 4.76
Nahuatzen 4.77
Pomacuaran 4.96
Sevina 4.97
Arantepacua 5.00
Cocucho 5.04
Charapan 5.10
Comachuen 5.25
Pichataro 5.36
Quinceo 5.94
Nurio 6.06
Turicuaro 6.19
Urapicho 6.30
Capacuaro 7.73

Load (and install if necessary) the "nortest" library and use the Software R "help" in order to find how to
perform the Anderson-Darling test.
First of all we perform the Q-Q plot (quantile-quantile compared with the normal distribution).

hist(mexvillages$PopulationDensity)
View(mexvillages)
attach(mexvillages)

###Q-Q PLOT###
qqnorm(PopulationDensity)
qqline(PopulationDensity)

63
We then perform the Anderson-Darling normality test:

ad.test(PopulationDensity)

> Anderson-Darling normality test

data: PopulationDensity
A = 0.77368, p-value = 0.03543

The p-value is "3%" so, considering a standard "5%" threshold level, we reject the null hypothesis: this
means that we reject the normality (our data is non-normal). The non-normality problem in this example is
basically caused by the last outlier point we find near "7.5": if we remove that point we would probably accept
the null hypothesis.

64
5 Lab 2
5.1 Continuous random variables
• Normality test
1. Import the data in the file "cust_sat.xlsx". The data set contains the satisfaction score assigned to an
internet banking service by the customers, together with some information about the customers (place
of residence, gender, age). The score in given in a scale 0–1000, and it is the result of the sum of 100
scores on a scale 0–10.
cust_sat <- read_excel(". . ./cust_sat.xlsx")
View(cust_sat)

2. Compute the basic descriptive statistics for each variable.


#Dimension check: number of rows and columns#
dim(cust_sat)
attach(cust_sat)

#Compute the table for every categorical variables#


table(Country)
table(Country)/length(Country)*100 #percentages#

table(gender)
table(gender)/length(gender)*100 #percentages#

summary(age)
boxplot(age, horizontal=TRUE)

summary(score)
boxplot(score, horizontal=TRUE)

We obtain the basic information about all the variables.


3. Do the normality test to check the normality of the "age".
qqnorm(age)
qqline(age)
ad.test(age)

In this particular case we don’t actually notice a "normality" behavior as we have, particularly in both
tails, some variation from the straight line. If we then perform the Anderson-Darling normality test on

65
the "age" data we obtain a very small p-value, which leads us to the rejection of the null hypothesis: our
data is not normally distributed.
Anderson-Darling normality test
data: age
A = 1.7119, p-value = 0.0002166

4. Do the normality test to check the normality of the "score".


qqnorm(score)
qqline(score)
ad.test(score)

From the graphical point of view we can see again that we have some variations in the tails, mostly
on the left one. We then perform the Anderson-Darling normality test on the "score" data and again,
considering a threshold of "5%", we reject the null hypothesis: our data is not normally distributed.
Anderson-Darling normality test
data: score
A = 1.4235, p-value = 0.001108

5. The place of residence is a variable with three levels. Take the score separately for each place of residence
and do the normality tests.
ad.test(score[Country=="UK"])
ad.test(score[Country=="GER"])
ad.test(score[Country=="IT"])

So now we compute the same test but separately for each country. From the results we notice that we
accept the normality for all the three countries:
Anderson-Darling normality test
data: score[Country == "UK"]
A = 0.3574, p-value = 0.4519

Anderson-Darling normality test


data: score[Country == "GER"]
A = 0.32684, p-value = 0.5173

Anderson-Darling normality test


data: score[Country == "IT"]
A = 0.29728, p-value = 0.587

So the question now is: why the main sample does not respect the normal distribution but its 3 sub-samples
do?

66
6. Compare also the the QQ-plots.
boxplot(score~Country)

7. Give a (possible) explanation of the results.


So from the figure we can distinguish the 3 different boxplots, one for each population. We can see that
there are major differences in the 3 plots as, for example, Italy has a higher score or Germany has a
wider variance. These differences may have different causes: for example differences in the service or,
more trivially, the same question may have been translated in slightly different ways in the 3 countries,
acquiring different meanings and, in the end, leading to different answers.

67
COMPLETE EXERCISE

###1)
cust_sat <- read_excel("M. . ./cust_sat.xlsx")
View(cust_sat)

###2)
#Dimension check: number of rows and columns#
dim(cust_sat)
attach(cust_sat)

#Categorical variables#
table(Country)
table(Country)/length(Country)*100 #percentages#

table(gender)
table(gender)/length(gender)*100 #percentages#

summary(age)
boxplot(age, horizontal=TRUE)

summary(score)
boxplot(score, horizontal=TRUE)

###3)
#Normality check for the variable age#
qqnorm(age)
qqline(age)
ad.test(age)

###4)
#Normality check for the variable score#
qqnorm(score)
qqline(score)
ad.test(score)

###5)
#Normality check for the variable Country#
ad.test(score[Country=="UK"])
ad.test(score[Country=="GER"])
ad.test(score[Country=="IT"])

###5)
#Compare the Q-Q plots#
boxplot(score~Country)

68
6 Multivariate random variables
6.1 Bivariate random variables
The basic idea of considering multivariate random variables is to measure two or more variables on the same sample
space which means, in statistical terms, on the same statistical individuals. Let us consider two (discrete) r.v.’s "X"
and "Y" on the same sample space.

Joint density

The joint density of "X" e "Y" is the density of the two variables taken together. It is the function of the
pair "X, Y ". We consider the probability of the intersection of:

f(X,Y ) (x, y) = P(X = x ∩ Y = y)

If the supports of "X" and "Y" are finite (finite sample space), the function "f(X,Y ) " is usually summarized in a
two-way table. In this case we consider a variable "X" (Bernoulli distribution) and a variable "Y " with 3 levels.
This is a joint-densirty as each number in the table gives the probability of the intersections of two events. Since it
is a density of course all the values inside the table sum up to "1".

Y
X 0 1 2
0 0.11 0.09 0.20
1 0.30 0.14 0.16

6.1.1 Marginal distributions

Given a joint density "f(X,Y ) (x, y)" the marginal densities of "X" and "Y" the densities of the variable without
having any information of the other variable. They are defined by:
X X
fX (x) = P(X = x) = fX,Y (x, y) fY (y) = P(Y = y) = fX,Y (x, y)
y x

The name "marginal" comes from the previous tabular: we can obtain the joint density just by summing up by row
(or column) all the singular densities.

Joint density

Y
X 0 1 2
0 0.11 0.09 0.20 0.40
1 0.30 0.14 0.16 0.60
0.41 0.23 0.36

The joint density is more informative than the marginal distribution as, given a joint density, we can compute
the marginal distribution but we can’t perform the contrary (there are infinite possibilities).

69
6.1.2 Independence

Independence means that the knowledge about the variable "X" doesn’t affect the density distribution (probability)
of the variable "Y ". Two random variables are independent if the realization of one does not affect the probability
distribution of the other.

Independence

The r.v.’s "X" and "Y" are independent if for all "∀A, B ∈ F":

P(X ∈ A ∩ Y ∈ B) = P(X ∈ A)P(Y ∈ B)

So the joint density is the product of the marginal densities.

For discrete r.v.’s if "A = {x}" and "B = {y}" we get:

P(X = x ∩ Y = y) = P(X = x)P(Y = y)

i.e.
f(X,Y ) (x, y) = fX (x)fY (y) ∀x, ∀y

Theorem - Independence of random variables

Two r.v.’s are independent if and only if the joint density is the product of the marginal densities.

6.1.3 Conditional distributions

Given two r.v.’s "X" and "Y" the conditional distributions of "Y" given "X" are the distributions of "Y" for fixed
values of "X". Formally:

Conditional distribution

The conditional distribution of one variable "Y " given "X = x" (fixed), is obtained by the joint distribution
divided by the marginal distribution of the fixed value:

f(X,Y ) (x, y)
fY |X=x (y) =
fX (x)

It is a univariate distribution of the "Y " but we have several of them. Indeed we have one conditional
distribution of the "Y " for each possible value of "X = x".

Computation of conditional distributions

Compute the conditional distributions of "Y" given "X" (it has two possible values) for:

Y
X 0 1 2
0 0.11 0.09 0.20 0.40
1 0.30 0.14 0.16 0.60
0.41 0.23 0.36

This means that we obtain one different conditional distribution for each row of the table (for the different
values of the fixed variable).

This concept is pretty simple in discrete case since the "X" variable can only assume "few" values: it becomes a bit
complicated in continuous case as the "X" values are infinite.

70
6.1.4 Conditional expectation

The expected value of "(Y |X = x)" is the conditional expectation of "Y " given "X = x". The conditional
expectation is simply the set of all the conditional distributions row by row.

Exercise - Computation of conditional expectation

Compute the conditional expectations of "Y" given "X" for:


Y
X 0 1 2
0 0.11 0.09 0.20 0.40
1 0.30 0.14 0.16 0.60
0.41 0.23 0.36

First of all we generate our data (the table) and then we compute the marginals (which are simply the
columns and rows sums)

x=c(0,1)
y=c(0,1,2)
p=matrix(c(0.11,0.09,0.20,0.30,0.14,0.16),nrow=2,byrow=T) #joint distr#
mX=rowSums(p) #marginal distribution X#
mY=colSums(p) #marginal distribution Y#

We can then compute the conditional distribution for the "Y " when "X = 0" and "X = 1", and then the
conditional expectation:

cY0=p[1,]/mX[1] #conditional distribution Y given X=0#


cY1=p[2,]/mX[2] #conditional distribution Y given X=1#

sum(y*cY0) #conditional expectation of Y given X=0#


sum(y*cY1) #conditional expectation of Y given X=1#

6.1.5 Covariance and correlation

The covariance is a measure of the strength of the association between two random variables. A value of "−1"
describes a perfect negative correlation whilst a value of "1" describes a perfect positive correlation. For two r.v.’s
we define:
COV (X, Y ) = E((X − E(X))(Y − E(Y ))) = E(XY ) − E(X)E(Y )
and
COV (X, Y )
Cor(X, Y ) =
σX σY
with the same meaning as in exploratory statistics.
Remark: some useful formulae:

• E(X + Y ) = E(X) + E(Y )

• If "X, Y " are independent, then "COV (X, Y ) = 0"


• V AR(X + Y ) = V AR(X) + V AR(Y ) + 2COV (X, Y )
• Therefore, for independent variables: "V AR(X + Y ) = V AR(X) + V AR(Y )"

71
For random vectors "X = (X1 , . . . , Xp )" such numbers are usually arranged into matrices:

V AR(X1 ) COV (X1 , X2 ) . . . COV (X1 , Xp )


 
..
COV (X2 , X1 ) V AR(X2 ) .
 
...
COV (X) = 

.. .. .. ..

. . . .
 
 
COV (Xp , X1 ) ... ... V AR(Xp )

and similarly
1 Cor(X1 , X2 ) . . . Cor(X1 , Xp )
 
..
Cor(X2 , X1 ) 1 .
 
...
Cor(X) = 

.. .. .. .

. . . ..
 
 
Cor(Xp , X1 ) ... ... 1

6.1.6 Properties of the covariance matrix

Covariance matrix
The covariance matrix can be written as:
 
COV (X) = E (X − E(X))(X − E(X))t = E(XX t ) − E(X)E(X)t

• The covariance matrix is symmetric


• The covariance matrix is positive semi-definite, i.e. for all:

∀a ∈ Rp at · COV (X)a ≥ 0

The same applies to the correlation matrix. Why?

Exercise - Compute the Correlation matrix for the bivariate density

Compute the correlation matrix for the bivariate density. In this case we have two random variables "X"
and "Y ":
Y
X 0 1 2
0 0.11 0.09 0.20 0.40
1 0.30 0.14 0.16 0.60
0.41 0.23 0.36

1
 
−0.2562
−0.2562 1

Compute the Correlation matrix for the bivariate density

Suppose we flip a coin 3 times, and denote with


• "X1 " the number of "Heads" in the first two tosses
• "X2 " the number of "Heads" in the last two tosses
• "X3 " the number of "Heads" in the last toss

Compute the correlation matrix for the vector "(X1 , X2 , X3 )".

72
6.2 The multinomial distribution
The multinomial scheme is a sequence of i.i.d. random variables:

X1 , X2 , . . . , Xn , . . .

each taking "k" possible values: this means that the sample space of each "k" variable is "ω = {1, . . . , k}". Thus,
the multinomial trials process is a simple generalization of the Bernoulli (binomial) scheme (which corresponds to
"k = 2"). Let us denote:
pi = P(Xj = i) i ∈ {1, . . . , k}
As with our discussion of the binomial distribution, we are interested in the random variables that count the number
of times each outcome occurred. Thus, let:

Yn,i = #({j ∈ {1, . . . , n} : Xj = i})


i ∈ {1, . . . , k}

where "#" denotes the "cardinality", the number of elements. Note that " i Yn,i = n" so if we know the values of
P
"k − 1" of the counting variables, we can find the value of the remaining counting variable (by difference).

Example

Consider 20 tosses of a dice "ω = {1, 2, 3, 4, 5, 6}". The sample we consider is:

2, 4, 3, 1, 6, 6, 1, 3, 5, 5, 2, 1, 3, 4, 5, 1, 5, 6, 1, 1

So we have:
Y20,1 = 6 Y20,2 = 2 Y20,3 = 3
Y20,4 = 2 Y20,5 = 4 Y20,6 = 3
We don’t have to check the last value of "Y20,6 " as the 6 variables aren’t independent: the knowledge of the
first five variables becomes deterministic for the last one.

Indicator variables
Prove that we can express "Yn,i " as a sum of indicator variables:
n
X
Yn,i = 1(Xj = i)
j=1

where "1(Xj = i)" is "1" if "Xj = i", and is "0" otherwise.

Multinomial coefficient

The distribution of "Yn = (Yn,1 . . . , Yn,k )" is called the multinomial distribution with parameters "n" and
"p = (p1 , . . . , pk )". Its density is:
 
n
P (Yn,1 = j1 , . . . , Yn,k = jk ) = pj1 . . . pjkk
j1 , . . . , j k 1

for "(j1 , . . . , jk )" such that "j1 + · · · + jk = n". Here the symbol
 
n n!
=
j1 , . . . , jk j1 ! . . . jk !

is the multinomial coefficient.


M ultinom(n, p)

73
We write "multinom(n, p)" for the multinomial distribution with parameters "n" and "p". For a multinomial
distribution "Y" with parameters "n" and "p", the expected value and variance are indeed:

E(Yn,i ) = npi V AR(Yn,i ) = npi (1 − pi )

The covariance and correlation are:


pi pj
r
COV (Yn,i , Yn,j ) = −npi pj Cor(Yn,i , Yn,j ) = −
(1 − pi )(1 − pj )

Note that the number of times outcome "i" occurs and the number of times outcome "j" occurs are negatively
correlated, but the correlation does not depend on "n" or "k". Does this seem reasonable?
The important point is that the "COV " depends on "n" whilst the "COR" does not: the correlation is independent
on "n" and only depends on the structure of the "p".

Exercise - Marginal distributions

Show that "Yn,i " has the binomial distribution with parameters "n" and "pi ":
 
n j
P(Yn,i = j) = pi (1 − pi )n−j j ∈ {0, . . . , n}
j

Give a probabilistic proof, by defining an appropriate sequence of Bernoulli trials.

Exercise - Covariance matrix and correlation matrix

Suppose that we roll 4 ace-six flat dice (faces 1 and 6 have probability "1/4" each; faces 2, 3, 4, and 5
have probability "1/8" each). Write the covariance and correlation matrices of the relevant multinomial
distribution.

6.2.1 Generate data from a multinomial distribution

Exercise - Multinomial distribution generation

Use the Software R function "rmultinom" to generate data from a multinomial distribution and in particular:
• Toss 1 fair die

• Toss 20 fair dice


• Toss 20 fair dice, 50 replicates
• Toss 20 unfair dice

x=rmultinom(1,1,c(1,1,1,1,1,1)) #toss of a die#


#we should also write x=rmultinom(1,1,c(1/6,1/6,1/6,1/6,1/6,1/6))#
y=rmultinom(1,20,c(1,1,1,1,1,1)) #toss of 20 dice#
z=rmultinom(50,20,c(1,1,1,1,1,1)) #sample of size 50#

x1=rmultinom(1,20,c(1,2,3,4,5,6)) #toss of 20 unfair dice#

74
6.3 Chi-square goodness-of-fit test
The chi-square is a distribution derived from the normal [0,1] distribution (relative to the "t" distribution). The
goodness of fit means that we take a probability distribution on a finite sample space and we check if our sample is
derived from our distribution under "H0 ". This means that we compare an empirical distribution function (a set of
observed frequencies) with a theoretical distribution function (set of theoretical frequencies which are probabilities).
Suppose we have a sample of size "n" for a categorical r.v. "X" with a finite number of levels, and denote with
"{1, . . . , k}" such set of levels (frequency table of our sample). Let us denote with "ni " the observed frequency for
the i-th level. The probability distribution of "X" is a multinomial distribution with parameters "n" and "p1 , . . . , pk ",
i.e., the probabilities of the "k" levels. The chi-square goodness-of-fit test is a test to check if the variable "X"
has fixed probabilities "q1 , . . . , qk " or not. We want to check if our frequency table (empirical distribution) comes
from our set of fixed probabilities or not.

Hypotheses and test statistic

The null hypothesis of our test is to set a fixed vector and we assume that our empirical distribution comes
from a multinomial random variable with probabilities "q1 , q2 , . . . , qk ". The alternative hypothesis is every
other possible distribution, so every other possible vector of probabilities.

H0 : p1 = q1 , . . . , pk = qk

H1 : p1 ̸= q1 for at least one "i"


The test is based on the computation of the expected counts, which are the normalization of our observed
table (it’s a table of frequencies so it sums up to "n") and the probability vector under "H0 " (the probabilities
sums up to "1"):
n̂i = nqi
the expectations of "Yn,i " under "q1 , . . . , qk ". So now we have two vectors, the observed frequencies and the
expected counts: both vectors sum up to "n" so we can compare them.
The test statistic is the Pearson’s test statistic. Here we compare the distance between the observed and
the expected counts: when the distance is low we accept "H0 ":
k
X (ni − n̂i )2
T =
i=1
n̂i

In the limit case when we have a perfect fit between our "H0 " and the observed counts, we should have the
expected frequencies exactly equal to observed counts. Under "H0 " this test statistic should be near to "0".
When "T " is large enough we reject the null hypothesis as there’s too much difference between the observed
and the expected counts.

Under the null hypothesis "H0 ", "T " has an asymptotic chi-square distribution with "(k − 1)" degrees of freedom
(they slightly change the shape of the distribution even if the form remains unchanged). The test is one-sided
(right), and thus:
R = (cα , +∞)

The critical value "cα " [χ2c ] can be found in statistical tables, or with Software R. Otherwise we can operate through

75
the p-vale’s study:

• Accept null Hp p − value ≥ α

• Reject null Hp p − value < α

Remark: the chi-square test is asymptotic, thus we need large "n".


Rule-of-thumb: a standard requirement is that all the expected counts "n̂i " are greater than or equal to 5
(also Software R warns you if you try to use a too small sample size for the test).

Exercise - Pearson statistic - Chi Square

Suppose we want to check if a die is fair. We toss the die 200 times (this is the empirical distribution):

Outcome 1 2 3 4 5 6 Total
ni 36 30 25 30 38 41 200

In a fair die (this is our theoretical distribution under "H0 ") all outcomes have the same probability, so
we use the uniform distribution as the null hypothesis:
1
q1 = · · · = q6 =
6
We perform the chi-square test at the 5% level.
First, compute the expected counts:
n̂i = nqi = 33.33
and compare the table of the expected counts.
Outcome 1 2 3 4 5 6 Total
n̂i 33.33 33.33 33.33 33.33 33.33 33.33 200
with the table of the observed counts:
Outcome 1 2 3 4 5 6 Total
ni 36 30 25 30 38 41 200
The Pearson test statistic is:
(36 − 33.33)2 (30 − 33.33)2 (25 − 33.33)2 (30 − 33.33)2 (38 − 33.33)2 (41 − 33.33)2
T = + + + + + = 5.38
33.33 33.33 33.33 33.33 33.33 33.33
The value "T = 5.38" must be compared with the suitable critical value of the chi-square distribution: we
have "6 − 1 = 5" (k − 1) degrees of freedom so that:

> qchisq(0.95,5)
[1] 11.0705

the rejection half-line is "R = (11.07; +∞)". The observed value does not belong to the rejection half-line
(11.0705 > 5.38), thus we accept "H0 " and conclude that the die is fair.

76
6.4 Bivariate continuous random variables
For continuous random variables the approach is basically similar, we just replace the "sums" with "integrals".

Bivariate continuous random variables - Joint density

The joint density of "X" e "Y" is the function "fX,Y (x, y)" such that:
Z b Z d
f(X,Y ) (x, y) dydx = P(a ≤ X ≤ b, c ≤ Y ≤ d)
a c

What we obtain in this case is a figure whose total volume is equal to "1". A graphical representation is for
example the following:

The marginal densities of "X" and "Y" are defined by:


Z Z
fX (x) = f(X,Y ) (x, y) dy fY (y) = f(X,Y ) (x, y) dx
R R

6.4.1 Conditional distributions and conditional expectation

Given two r.v.’s "X" and "Y" the conditional distributions of "Y" given "X" are the distributions of "Y" for fixed
values of "X". Formally it is the ration between the joint density and the marginal density:

Conditional distribution
The conditional distribution of "Y" given "X = x" is:

f(X,Y ) (x, y)
fY |X=x (y) =
fX (x)

The expected value of "(Y |X = x)" is the conditional expectation of "Y" given "X = x".
The main difference in the interpretation here is that in continuous case we can’t interpret this expression
in terms of probability as we previously did in the discrete case. In the continuous case we can’t do this as
the probabilities of a single value are "0" (we would be dividing "0/0"), so we can only work with densities.
In the discrete case we can use both definitions.

77
6.4.2 Independence and correlation

Again also in this case we have to work with densities and not with probabilities as we can’t translate the expression
(all the probabilities are "0").

Theorem - Independency of random variables

Two random variables are independent if and only if the joint density is the product of the marginal densities:

f(X,Y ) (x, y) = fX (x) · fY (y)

Covariance and correlation have the same definitions as in the discrete setting.

Exercise - Joint distributions of exponential and uniform functions

Write the joint distribution of the pair "(X, Y )" in the following situations:
1. "X, Y " are exponentially distributed "ε(λ)", independent
In this case we consider two independent functions distributed:

X ∼ ε(λ) Y ∼ ε(λ)

We know the densities of these two functions:

fX (x) = λe−λx x≥0 fY (y) = λe−λy y≥0

Since they are independent it’s very easy to describe their joint density:

f (x, y) = λe−λx λe−λy = λ2 e−λ(x+y) x ≥ 0, y ≥ 0

So we understand that independence is a very easy case where the joint density is derived by the
marginal distributions.
2. "X, Y " are uniformly distributed "U[0, 1]", independent
In this case we consider two independent functions distributed:

X ∼ U[0, 1] Y ∼ U[0, 1]

We know the densities of these two functions are constant:

fX (x) = 1 0≤x≤1 fY (y) = 1 0≤y≤1

Since they are independent it’s very easy to describe their joint density:

f (x, y) = 1 0 ≤ x ≤ 1, 0 ≤ y ≤ 1

So we see that the density is always equal to "0" except in the square "1 × 1" where it is equal to "1".

78
Joint distributions of exponential and uniform functions

Write the joint distribution of the pair "(X, Y )" in the following situations:
3 "X" is exponentially distributed "ε(λ)", "Y " is uniformly distributed "U[0, 1]", independent.
In this case we consider the mix of the previous cases:

X ∼ ε(λ) Y ∼ U[0, 1]

We know the densities of these two functions are:

fX (x) = λe−λx x≥0 fY (y) = 1 0≤y≤1

So the joint density is:


f (x, y) = λe−λx x ≥ 0, 0 ≤ y ≤ 1
Graphically we would obtain something like this:

The idea behind this is that when assuming independence it’s pretty easy to define the joint density, starting from
he marginal density. In most cases without independence we are not able to do inference and to do statistical
modeling.

6.5 Multivariate normal distribution


Multivariate normal distribution is the generalization of the normal distribution. To be able to make inferences about
populations, we need a model for the distribution of random variables, and in most cases the normal distribution is
a reasonably good approximation of many phenomena. There is a multivariate version of the central limit theorem,
so that multivariate normality holds approximately for sums and averages. We will see in the next lectures that the
sampling distribution of estimators and (test) statistics are often approximately multivariate or univariate normal
due to the central limit theorem.
Remember that the density function of the univariate normal distribution is the classical normal distribution
centered in the "µ" parameter:
1 (x − µ)2
 
f (x) = √ · exp
2π · σ σ2
The parameters (expected value and variance) that completely characterize the distribution are:

E(X) = µ V AR(X) = σ 2

Examine the term:


(x − µ)2
= (x − µ)(σ 2 )−1 (X − µ)
σ2
It is a squared statistical distance between "x" and "µ" in standard deviation units. In the case of "p > 1" variables:

• The expected value is a p-dimensional vector

• The standard deviations units are measured through the covariance matrix "Σ" (remember that "Σ" must be
positive definite).

79
• The exponent term becomes now:
(x − µ)(Σ)−1 (x − µ)

By normalization we obtain the definition of a Multivariate normal distribution.

Density of the multivariate normal distribution

The density of the multivariate normal distribution is:

1 1
 
f (X) = exp − (x − µ)Σ (x − µ)
−1
(2π)p/2 det(Σ)1/2 2

for "x ∈ Rp ".

This is denoted with "Np (µ, Σ)", where the "p" means that we have a normal distribution in "p" dimensions. Again
from the previous construction we notice that the mean and the variance are the parameters of the distribution:

• "µ" is the vector of the expected values


• "Σ" is the variance/covariance matrix

For "p = 1" this reduces to the univariate normal density.


The unique difference from the mathematical pov between the univariate and multivariate case is in the "σ": in
the univariate case we can divide for "σ 2 " but in the multivariate case we can’t since we have a matrix (we have to
use the inverse of "Σ").

6.6 Multivariate standard normal distribution


A special case arises when "µ = 0" (is a vector of all "0")and "Σ" is the identity matrix "Σ = Ip " (the identity matrix
which has "1" over the main diagonal and "0" otherwise):
p
1 1X 2
 
f (x) = exp − x
(2π)p/2 2 j=1 j

for "x ∈ Rp ". This is named as the multivariate standard normal distribution.

6.6.1 Figures of multivariate normal distributions

• Density and contour plot for a bivariate normal with zero covariance

The "circular" shape means "independence". Standard normal distribution means that the marginal distribu-
tion has mean "0", variance "1" and in addition we have no covariance between the two variables "x1 " and
"x2 ". This means that the plot is perfectly symmetric around the center "(0, 0)": this leads to the formation
of circular level-lines.

80
In the univariate case we know there is a standardization, so that from a normal distribution that we know
"X ∼ N (µ, σ 2 )" we can take standard one defined as "z = X−µσ " which is by definition a normal distribution
"z ∼ N (0, 1)". Plus if we start from a normal distribution "z ∼ N (0, 1)" we can generate general univariate
normal distributions simply by taking "X = µ + σz": in this way we obtain a distribution of "X" with mean
"µ" and variance "σ 2 " [this to prove that in the univariate case it’s easy to generate every possible normal
distribution starting from the standard one].
Basically the same holds even for the multivariate case. The generalization of the linear transformation in
the multivariate case is the matrix multiplication: matrix multiplication, for the multidimensional case, is
what the linear transformation is for the univariate case.
With matrix multiplication we can move from a plot of the previous case where the two components are
independent (the distribution is perfectly symmetric), to distributions like the following ones where we have
dependence between the two variables "x1 " and "x2 " (we obtain ellipses).
• Density and contour plot for a bivariate normal with positive covariance [positive correlation]

• Density and contour plot for a bivariate normal with negative covariance [negative correlation]

6.6.2 Property - Linear combinations

The first relevant property of the multivariate normal distribution concerns linear combinations.

Property 1 - Linear combination

A linear combination of a multivariate normal distribution is again a multivariate normal distribution.

In formulae, if "X ∼ Np (µ, Σ)" then:


at X = a1 X1 + · · · + ap Xp
where "at X" stands for the "⟨a, X⟩", is distributed as:

at X ∼ N (at µ, at Σa)

81
Exercise - Multivariate normal function distribution
Given a multivariate normal:
5 16 12
   
X∼ ,
10 12 36
and "at = (3, 2)", find the distribution of "Y = at X".
If we take the transposed:
Y = at X = 3X1 + 2X2
then the expected value of "Y " is:

5
 
E(Y ) = a µ = (3, 2)
t
= 3 · 5 + 2 · 10 = 35
10

16 12 3  3 3
      
σY2 = (3, 2) = 3 · 16 + 2 · 12 3 · 12 + 2 · 36 = (72, 108) = 72 · 3 + 108 · 2 = 648
12 36 2 2 2

More generally, if "X ∼ Np (µ, Σ)", "A" is matrix with dimension "q × p", and "d" is a q-dimensional vector, then:

Y = AX + d

is a multivariate normal distribution:


Y = Nq (Aµ + d, AΣAt )

Exercise - Function distribution


Take the "X" of the previous exercise and compute the distribution of "Y = AX" with:

1 1
 
A=
1 −1

6.6.3 Property - Subsets of variables

Property 2 - Subsets of variables

All multivariate random variables defined as subsets of a multivariate normal distribution are again multivari-
ate normal. So if we take every component (every marginal distribution) of a standard normal distribution
you again obtain a normal distribution.

If "X ∼ Np (µ, Σ)" then all the marginal distributions are normally distributed, and this is also true for pairs
"(Xi , Xj )", and so on. This is an easy consequence of the previous property. Why?

6.6.4 Property - Zero-correlation implies independence

Property 3 - Zero-correlation implies independence

For multivariate normal distributions, having zero-correlation implies having independent components. This
normal distribution case is basically the unique one where zero-correlation implies independence: for general
random variables we previously saw that independence implies zero-correlation but the contrary wasn’t true
in general.

In formulae, if if we consider a normal distribution "X ∼ Np (µ, Σ)" and we assume the zero-correlation (which
means that the VAR-COV matrix "Σ" is a diagonal matrix [there are only the variances of the components and
no covariance]), then all the components of "X" are pairwise independent.

82
Exercise - Zero-correlation (bivariate distribution)

Prove this statement in the case "p = 2". To ease computations take a zero-mean bivariate distribution.

6.7 Conditional distributions


Property 4 - Conditional distribution

Conditional distributions of a multivariate normal distribution are again (multivariate) normal distributions.
This means that if we fix the value of one of the variables (for example in the following plot we set "x1 ")
and we consider the conditional distribution of the second variable ("x2 ") we obtain a normal distribution.

6.7.1 Central limit theorem

Data are not always multivariate normal, but:

Law of Large Numbers

Let "X1 , . . . , Xn " be independent and identically distributed random vectors with "E(X) = µ". Then
n
1X
x̄n = xi
n i=1

converges to "µ" as n goes to infinity.


So if we take the sample mean of random vectors (component for component: we consider the sample mean
of "X1 ", of "X2 " and so on), we have a convergence to the expected value (the theoretical mean) when "n"
is large. The same also holds for the empirical VAR-COV matrix as it converges to "Σ" for large "n".

That is
x̄n −→ µ and Sn −→ Σ
and these are true regardless of the true distribution of the "Xi "’s.

83
Central limit theorem

Let "X1 , . . . , Xn " be independent and √


identically distributed random vectors with "E(X) = µ" and non-
singular covariance matrix "Σ". Then " n(X̄n − µ)" converges to a "Np (0, Σ)" distribution if "n ≫ p" (it is
an asymptotic theorem).
This means that the sample mean, for large "n", can be approximated with the standard normal distribution
"Np (0, Σ)".

So for "large n":


1
X̄n ≈ Np (µ, Σ)
n
and when "Σ" is unknown:
1
X̄n ≈ Np (µ,Sn )
n
Remark: pay attention to the condition "n ≫ p": we will come back on this in statistical model theory.

6.7.2 Generate multivariate normal data

In Software R, multivariate normal distributions are handled through the "mvtnorm" package.

Multivariate R packages

• Install and load the "mvtnorm" package


• Use the help to see how "mvtnorm" works and use the "rmvnorm" function to generate data from
different bivariate normal distributions.

84
7 Lab 3
7.1 Multivariate random variables
• Linear transformations of the Multinormal distribution

In this exercise you need to (install and) load the libraries "MASS" and "mvtnorm". Let:

Y = AX [1]

where:
0 1 0
   
X ∼ N2 ,
0 0 1
is the standard multinormal distribution and:
1
 
c
A=
c 1
So the idea is simple: we start from the "X" (a standard normal distribution because for the mean we have a 0-vector
and the VAR-COV matrix is an identity) in two dimensions, and then we consider the "A" matrix. The model we
consider is then:
1 c X1 + cX2
      
X1 Y1
Y = AX = = =
c 1 X2 cX1 + X2 Y2

1. Compute the theoretical values of the expected value and the variance/covariance matrix of "Y ".
This part can be computed both by hand and by using Software R. First of all we import all the packages we
need and then we compute:

0 1 0 1 c
     
µX = ΣX A=
0 0 1 c 1

Y = AX ∼ N2 (Aµ, AΣAt )
We know that the distribution of the transformed variable "Y " is a bivariate normal distribution with mean
"Aµ" and VAR-COV matrix "AΣAt " (notice that, since the "A" matrix is symmetric, then "A = At "). We
can then compute the expected value:

1 c 0 0
    
Aµ = =
c 1 0 0

A bit more complicated is the VAR-COV matrix:

1 c 1 0 1 c 1 c 1 1 + c2 c+c c2 + 1 2c
          
c
ΣY = = = =
c 1 0 1 c 1 c 1 c 1 c+c c2 + 1 2c c2 + 1

Notice that the multiplication for the identity matrix doesn’t change the matrix.
2. Choose different values for "c" (e.g. "c = −4, −2, 0, 2, 4"), generate "1000" observations from the distribution
of "Y " using the function "mvrnorm" in MASS and plot the observations. The code for doing this is:

binorm.sample <- mvrnorm(1000,meanvec,covarmat)


y1 <- binorm.sample[,1]
y2 <- binorm.sample[,2]
plot(y1,y2)

The idea of this lab is to find an empirical counter part of this small computation: in the 2-dimensional case
it’s difficult to plot the theoretical densities of the random variables because we have 3 dimensional objects. The
best idea is to move from the theoretical densities to the empirical ones: we can do this via a random-numbers
generation. We generate random numbers from the relevant normal distribution and then we compute the
scatter plot to see the ellipses based not on the theoretical distribution but by plotting the two dimensional
scatter plot of the data points. We can achieve this by using the "nvrnorm" command.

85
Here we need to define the vector of the mean (it’s always the null vector) and the variance matrix (which is
the previous "c2 + 1" and "2c" we defined) and then generate 1000 random vectors from a multivariate normal
distribution with the previous parameters. In this case we consider "c = 2" but we can try also with different
values.

library(mvtnorm)
library(nortest)
library(MASS)

meanvec=c(0,0)
varmat=matrix(c(5,4,4,5),nrow=2)
#it is the equivalent of "varmat=matrix(c(c^2+1,2c,2c,c^2+1),nrow=2)"#

and so if we check the matrix we obtain:

>varmat

[,1] [,2]
[1,] 5 4
[2,] 4 5

If for example we consider "c = −2", so if we change the direction of the linear transformation, we basically
obtain the same matrix (the variance is still "5")but the correlation (the covariance) has opposite sign ("−4").
In order to generate random numbers we just need to copy the proposed lines of code:

binorm.sample <- mvrnorm(1000,meanvec,varmat)


y1 <- binorm.sample[,1]
y2 <- binorm.sample[,2]
plot(y1,y2)

where we take the binorm sample and then we set as "y1" and "y2" the two dimensions (the two components
of the vector). What we obtain is a sort of ellipse on the "positive" direction (which is right since we have a
positive ("+4") covariance):

86
3. Choose a value for "c" (different from "0" and "1"). Estimate the mean vector, covariance matrix and the
correlation matrix of your bivariate normal sample. Use the functions "mean", "cov", "cor".
If we use these functions on "y1" and "y2" we have estimates based on a sample size of "1000" so we expect to
have a quite precise estimation. We can then check if our computations are correct by calculating the mean,
variance, covariance and by comparing them with the theoretical model:

> mean(y1) #We expect a value close to "0" so we are good#


[1] -0.04640224

> mean(y2) #We expect a value close to "0" so we are good#


[1] -0.02895224

> var(y1) #We expect a value close to "5" so we are good#


[1] 4.925653

> var(y2) #We expect a value close to "5" so we are good#


[1] 5.078911

> cov(y1,y2) #We expect a value close to "4" so we are good#


[1] 3.949047

4. Also make histograms and normal QQ-plot for each variable. What does the normal QQ-plot show? Use the
functions "plot", "hist", "qqnorm", "qqline" .
Here we check that the components of our rotated vector "y" are again normally distributed. This means that
if we take a normality test for both "y1" and "y2" we should accept the normality hypothesis. We can also use
the QQ-plot function:

hist(y1)
hist(y2)
qqnorm(y1)
qqline(y1,col="red")
qqnorm(y2)
qqline(y2,col="red")

Both from the histograms, and more precisely from the QQ-plots, we can verify that both variables are normally
distributed:

87
5. What does it happens in the previous items when "c → ±∞"? And when "c = 1"?
Here we investigate what happens when we move to special cases. Special cases are indeed when:
• "c = 0" : in this case we obtain an identity matrix. It is a very special case because this matrix doesn’t
alter the vector so the "y" is equal to the "x".
• "c = ±∞" : in this case the correlation coefficient goes to "0" (almost no correlation)
• "c = 1" : this is the opposite case since the correlation coefficient is equal to "1". Graphically this means
we would have a perfect line.
These results can be demonstrated by considering again the "Σ" covariance matrix of the "y". The correlation
coefficient is the ratio between the covariance and the two standard deviations:

COV (Y1 , Y2 )
ρ=
σY1 σY2
2c
=p
(c + 1) (c2 + 1)
p
2

2c
= 2
c +1
If we then consider the limit as "c → ∞" of the correlation coefficient we obtain the limit of two polynomials
(numerator and denominator). Notice that the denominator goes to infinity much faster than the numerator
so the limit goes to zero:
2c
lim =0
c→∞ c2 + 1

We can also test these results also with empirical functions. For example we should consider "c = 100" and
"c = 1":
###CASE "c=100"
varmat=matrix(c(10001,200,200,10001),nrow=2) #remember that we have "c^2+1" and "2c"#
binorm.sample <- mvrnorm(1000,meanvec,varmat)
y1 <- binorm.sample[,1]
y2 <- binorm.sample[,2]
plot(y1,y2)

###CASE "c=1"
varmat=matrix(c(2,2,2,2),nrow=2) #remember that we have "c^2+1" and "2c"#
binorm.sample <- mvrnorm(1000,meanvec,varmat)
y1 <- binorm.sample[,1]
y2 <- binorm.sample[,2]
plot(y1,y2)

88
For the first case we obtain a scatter plot with almost no correlation:, whilst int he second case we obtain
perfect linear correlation.

6. Equation [1] is the basis to define generic multivariate Gaussian random vectors. Construct the vector
"binorm.sample" by means of two independent Gaussian vector and then applying rotation:

x1 <- rnorm(1000,mu,sigma)
x2 <- rnorm(1000,mu,sigma)
binorm.sample.ind=matrix(c(x1,x2),byrow=F,ncol=2)
binorm.sample=binorm.sample.ind%*%t(A)

Check with the analysis in items 3,4,5 that we obtain the same result as in the case of direct multivariate
generation.

[We didn’t compute this point]

89
COMPLETE EXERCISE

###2)
library(mvtnorm)
library(nortest)
library(MASS)

meanvec=c(0,0)
varmat=matrix(c(5,4,4,5),nrow=2)
#it is the equivalent of "varmat=matrix(c(c^2+1,2c,2c,c^2+1),nrow=2)"#
varmat

binorm.sample <- mvrnorm(1000,meanvec,varmat)


y1 <- binorm.sample[,1]
y2 <- binorm.sample[,2]
plot(y1,y2)

###3)
mean(y1)
mean(y2)
var(y1)
var(y2)
cov(y1,y2)

###4)
hist(y1)
hist(y2)
qqnorm(y1)
qqline(y1,col="red")
qqnorm(y2)
qqline(y2,col="red")

###5)
###CASE "c=100"
varmat=matrix(c(10001,200,200,10001),nrow=2) #remember that we have "c^2+1" and "2c"#
binorm.sample <- mvrnorm(1000,meanvec,varmat)
y1 <- binorm.sample[,1]
y2 <- binorm.sample[,2]
plot(y1,y2)

###CASE "c=1"
varmat=matrix(c(2,2,2,2),nrow=2) #remember that we have "c^2+1" and "2c"#
binorm.sample <- mvrnorm(1000,meanvec,varmat)
y1 <- binorm.sample[,1]
y2 <- binorm.sample[,2]
plot(y1,y2)

90
8 Likelihood
We now move from probability to statistics (estimation). We saw that in general in the probability theory we have
a wide range of possible probability distributions, each of them defined with different parameters: this means that
we need a general method to estimate these parameters. The first method we consider is based on the "likelihood".
The idea behind this method is that we have an objective function, the "likelihood", and we want to maximize it
(we consider the first derivative).

8.1 Introduction
The notion of likelihood is the most important (and practical) way to analyze parametric statistical models. We
see this now for simple distribution functions, and we will investigate its use for regression models in the next
lectures. The idea of likelihood has a long history in statistics, but the formal, rigorous study of likelihood and
their properties as the foundation for inference is largely due to the work of R. A. Fisher in 1922.

8.2 Definitions
Remember that:

Parametric statistical model

A (parametric) statistical model is a set of probability distributions parameterized by a "θ". It is a triple:

(Ω, F, (Pθ )θ∈Θ )

or
(Ω, F, (Fθ )θ∈Θ )
where "F " denotes the distribution function of a random variable. Notice that the parameter "θ" ranges into
the set "Θ".

Likelihood

Let "f (x|θ)" denote the joint density of the random sample, the set of observations "(X1 , X2 , . . . , Xn )". The
joint density of the sample is the multivariate density of the random vector "(X1 , X2 , . . . , Xn )". Then, given
that sample "x" is observed, the likelihood function of "θ" is defined to be:

L(θ|x) = f (x|θ)
Likelihood Joint density

So the likelihood function is exactly the joint distribution of the sample. The difference is in the meaning
of the two arguments: in probability theory if we take a density or a probability function we consider the
density or probability given the parameter "θ" (it is a known quantity and my unknowns are the results); in
statistics the data is given and the unknown is the function of "θ". In probability we consider the "x" given
the "θ", whilst in likelihood we work with the "θ" given the data.

It is crucial to keep in mind, however, that "f (x|θ)" is a function of "x" with "θ" known, while "L(θ|x)" is a function
of "θ" with "x" known.
Remark: notice that the symbol L(θ|x)" is simply a notation and not a conditional probability (we are not in
a probability space). Indeed the equation we find in the definition in some books is also written as:

Lx (θ) = fθ (x)

This is why independent random variables are particularly important in statistics: for independent samples it’s easy
to describe the likelihood.

91
Two further basic definitions on statistical models are the following:

Identifiable model
A statistical model:
(Ω, F, (Pθ )θ∈Θ )
is identifiable if the map:
θ −→ (Pθ )
is injective.
The map, the corresponding distribution (probability) function for each "θ", is injective. Because my object
is the data which comes from a density. If the same density comes from two different values of "θ" we can’t
decide which is the right one (it means that we have "θ1 " and "θ2 " which give the same density "f (x|θ)").
In this case we can’t estimate the true value between the two parameters "θ" because they converge on
the same density. The basic requirement to perform statistical inference is then having identifiable models,
meaning that two different parameters generate two different distributions (fortunately in our framework all
the models are identifiable).

Regular model

A statistical model:
(Ω, F, (Pθ )θ∈Θ )
is regular if all the densities (probability distributions) "(Pθ∈Θ )" have the same support [the support is the
set where the density is strictly positive].

For example:

• Identifiability is a very basic requirement. Examples of non-identifiable models refer to over-parametrization


(more parameters than observations). We will consider this issue later in the course.
• Regularity is not essential but we will see that non-regular models are not easy to handle.

Poisson model
Under the Poisson distribution, all the densities have the same support "N", independent of the value of the
parameter "λ", thus the model is regular.

Uniform model

Under the uniform distribution "U[0, θ]", the support of the densities clearly depends on the value of "θ",
thus the model is not regular.

8.2.1 Interpretation

While "f (x|θ)" measures how probable various values of "x" are for a given value of "θ", "L(θ|x)" measures how
likely the sample we observed was for various values of "θ".
So, if we compare two values and "L(θ1 |x) > L(θ2 |x)", this suggests that it is more plausible, in light of the data
we have gathered, that the true value of "θ" is "θ1 " than it is that "θ2 " is the true value. Of course, we need to
address the question of how meaningful a given difference in likelihoods is, but certainly it seems reasonable to ask
how likely it is that we would have collected the data we did for various values of the unknown parameter "θ".

92
8.3 Maximum likelihood estimation [MLE]
Perhaps the most basic question is: which value of "θ" maximizes "L(θ|x)"?
This is known as the maximum likelihood estimator, or MLE, and typically abbreviated as "θ̂" (or "θ̂M LE "
if there are multiple estimators we need to distinguish between). Provided that the likelihood function is differen-
tiable and with a unique maximum, we can obtain the MLE by taking the derivative of the likelihood and setting
it equal to "0" (see below the Bernoulli and Normal examples):

d
L(θ|x) = 0

Example - Finite parameter space

Consider a coin, which can be fair (θ1 = 1/2) or unfair (θ2 = 1/3), where "θ" is the probability of Head. We
flip the coin 100 times, obtaining 43 Heads. What can we conclude?

In this case we don’t have to compute derivatives as we simply have to check if the likelihood in "θ1 " is greater
(lesser) than the likelihood in "θ2 ".
L(θ1 |x) > L(θ2 |x)
Here we have "Θ = {1/3, 1/2}". Using the likelihood approach we compute:

• for "θ = 21 "


 43  57
100 1 1

L(43|θ = 1/2) = = 0.03006864 = 3%
43 2 2

• for "θ = 1/3"


 43  57
100 1 2

L(43|θ = 1/3) = = 0.01065854 = 1%
43 3 3

Since "L(43|θ = 1/2) > L(43|θ = 1/3)" the maximum likelihood principle is in favor of "θ = 1/2". If we take
the parameter space equipped with a probability function, then one would compute:
• for "θ = 12 "
P(43|θ = 1/2)P(θ = 1/2)
P(θ = 1/2|43) =
P(43)

• for "θ = 13 "


P(43|θ = 1/3)P(θ = 1/3)
P(θ = 1/3|43) =
P(43)

Here the Bayes theorem is used and if the probability on "Θ" is uniform, then:

P(θ = 1/2|43) = 0.7383 P(θ = 1/3|43) = 0.2617

This paradigm is used in Bayesian statistics, a theoretical framework not covered in this class.

In general the value of "θ" ranges on some intervals or in the whole line: this means we need the derivatives. The
first example we are going to see it’s the MLE for the Bernoulli distribution in the general case.

93
8.3.1 MLE: example for a Bernoulli distribution

Suppose that "X" follows a Bernoulli distribution with probability of success given by "θ".

Bern(θ) 0<θ<1

Exercise - MLE Maximum Likelihood Estimator

Find the MLE of "θ". Given data "x = (x1 , . . . , xn )" for a sample of size "n", the likelihood is:
 
n s
L(θ|x) = θ (1 − θ)n−s
s
Pn
where "s = i=1 xi " is the number of successes. Take the derivative of "L(θ|x)" with respect to "θ" and
equals to zero:  
n
(sθs−1 (1 − θ)n−s + (n − s)θs (1 − θ)n−s−1 (−1)) = 0
s
θs−1 (1 − θ)n−s−1 (s(1 − θ) − (n − s)θ) = 0
thus:
s
θ̂ =
n
The MLE estimator is therefore the well-known sample proportion:
P
S Xi
θ̂ = = i=1
n n

For the value of "θ̂" we obtain the maximum likelihood.

94
8.4 Log-likelihood
Note that, for independent and identically distributed (i.i.d.) samples (data), the likelihood is equal to the
product of the densities of the marginal distributions. We have:
n
Y
L(θ|x) = f (xi |θ)
i=1

Is difficult to take its derivative as it requires an extensive use of the product rule. An alternative that is almost
always simpler to work with is to maximize the log of the likelihood, or log-likelihood instead (this means that
we work with the sum and not with the product). The log-likelihood is usually denoted with:
n
X
ℓ(θ|x) = log[f (xi |θ)]
i=1

Of course the results (the maxima) are the same as the logarithm is a monotone function. Note that the Bernoulli
example from before is a bit easier when working with the log-likelihood (try it). For other distributions, such as
the normal, it is much easier than working with the likelihood directly.

Example 1 - MLE for the Normal distribution

Find the MLE for the mean "θ" of the normal distribution with known variance "σ 2 ".

We then have to compute the MLE for a normal distribution "N (µ, σ 2 )" with "σ 2 " known. So the density
for the one dimensional case is:
1 (x−µ)
f (x|µ) = √ e− 2σ2
2πσ
we then consider the logarithm:

1 (x − µ)
 
log f (x|µ) = log √ −
2πσ 2σ 2

We now move from the one dimensional density to the sample "X1 , . . . , Xn " and we compute the log-
likelihood:
1 = 1n (xi − µ)2
  P
ℓ(µ|x) = n log √ − i
2πσ 2σ 2
We now consider the derivative of the log-likelihood:
n n
dℓ(µ|x) 1 X 1 X
=− 2 2(xi − µ)(−1) = 2 (xi − µ) = 0
dµ 2σ i=1 σ i=1

In this equation we have the "1/σ2 " constant which we can cancel so we obtain:
n
X
(xi − µ) = 0
i=1

and so the estimator is: Pn


xi
µ̂ = i=1
n
Sample mean

95
Example 2 - MLE for the Normal distribution

If for example we consider again a normal distribution with both the mean "µ" and the variance "σ 2 ", the
density and the log-likelihood of the sample are the same:
Pn
1 (xi − µ)2
ℓ(µ, σ 2 |x) = n log( √ ) − i=1 2
2πσ 2σ

but now we can’t cancel the first item as we previously did since it isn’t a constant (there’s a variable inside).
In this case we have to use the partial derivatives with respect to the two parameters "µ" and "σ 2 ":

∂ℓ(µ, σ 2 |x)
= ...
∂µ

∂ℓ(µ, σ 2 |x)
= ...
∂σ 2

8.4.1 Properties of the MLE

MLE Invariance

If "θ̂" is the MLE of a parameter "θ" and "g" is a real function, then "g(θ̂)" is the MLE of "g(θ)".
So if we compute the MLE "θ̂" of a parameter then we can compute the MLE for every possible function of
the parameter "θ" simply by applying the same function to the MLE.

So for example if we consider:


X ∼ N (µ, σ 2 )
µ̂
Then the MLE of:
eµ −→ eµ̂
If we take a function of our parameter then the MLE of the transformed parameter is the transformed MLE. The
point is that we don’t have to replicate the derivatives, likelihood and so on: we can just apply the same function
to the MLE. Again if we take for example the MLE of "µ2 " then we obtain "µˆ2 ".

Biasedness
Remark: note that MLE is preserved under transformations, while unbiasedness does not, unless in the
case of linear functions, so in general they are usually biased.
For example if we take the example we previously considered we see that the MLE for the parameter "µ" is
equal to the sample mean (it is unbiased), whilst the MLE for the parameter is biased as the denominator
is just "n" and not "n − 1":
 2
 Pn
 ∂ℓ(µ,σ |x) = 0 xi
µ̂ = i=1 = x̄ Unbiased

∂µ n
 ∂ℓ(µ,σ2 |x) = 0 σˆ2 = 1 Pn (x − x̄)2 Biased
∂σ 2 n i=1i

The poin here is that the MLE usually needs a further step to correct this bias.

Asymptotic unbiasedness

Under mild regularity conditions on "L(θ|x)", if "θ̂" is the MLE of a parameter "θ", then, as "n" goes to
infinity. So for large samples it holds:
θ̂M LE −→ θ

96
Asymptotic normality

Under mild regularity conditions on "L(θ|x)", if "θ̂" is the MLE of a parameter "θ", then, as "n" goes to
infinity (it is a sort of generalized Central Limit Theorem):

θ̂ − θ
−→ N (0, 1)
σ(θ̂)

8.4.2 Sufficiency

Our problem now is to estimate a parameter and our idea is to use Likelihood through the data (the likelihood is
a function of the parameter "θ" given the data). For the data we considered the vector set "X1 , . . . , Xn " , so the
whole set of data. The question is: do we actually need the entire set of data (all the information) to compute our
estimate? The answer is no: we don’t need the whole information on the sample to compute the MLE (for example
we consider the mean of the normal distribution: it is estimated by using the sample mean, which means it doesn’t
need the "n" dimensional vector as it’s enough to have the sum of the data and all the discarded information is not
essential).
The point here is that we don’t need the whole "n" dimensional set of numbers and that we can reduce our
information to a relevant number (the sum of our vector).
Example: we consider our "n" dimensional vector (sample) "X1 , . . . , Xn ". When we write "L(µ|X)" we are
taking an "n" dimensional vector as we consider all the values of our experiment:
n
X
L(µ|X) = L(µ|S) where S= Xi
i=1

so we can rewrite the likelihood function not as a function of "X" (we don’t store all the "n" numbers) but as a
function of the sum, which is just one value.
A rough definition is: a statistic "T (X)" is sufficient for a parameter "θ" if it contains all the information to write
the likelihood (and to compute the MLE). The formal definition of sufficiency is based on conditional distributions.

Sufficiency

A statistic "T (X)" is sufficient for a parameter "θ" if the conditional distribution of the data "X" given "T "
does not depend on "θ".

Neyman-Fisher factorization theorem

A statistic "T (X)" is sufficient for a parameter "θ" if and only if the likelihood function can be factorized as:

L(θ|x) = f (x)gθ (T (x))

for some non-negative functions "f " and "gθ ", where "f " does not depend on "θ".
So we can write the likelihood of "θ" given the "X" as a function of "θ" given "T ":

L(θ|X) = L(θ|T )

so we can describe the density of our random variable (remember that the joint density of the sample is
indeed the likelihood) as a function of "T ": we don’t need the whole sample.

97
Exercise - Estimation problem

The statistic "T = X̄" for the parameter "λ" of the Poisson distribution is a sufficient statistic.

So we have a sample with a Poisson distribution with unknown "λ" that we have to estimate:

X1 , . . . , Xn ∼ P (λ)

We consider the independence and compute the likelihood:


λx1 λxn 1 Pn
L(λ|X1 , . . . , Xn ) = e−λ . . . e−λ = e−nλ λ( i=1 xi )
x1 ! xn ! x1 !x2 ! . . . xn !
Pn
where the exponent "( i=1 xi )" is our sufficient statistic "T ". Indeed if we compute the MLE we obtain:

d 1 1
L(λ|X1 , . . . Xn ) = −ne−nλ λT + e−nλ T λT −1
dλ x1 !, . . . , xn ! x1 !, . . . , xn !
and now we compute when it vanishes:
>0
−nλ
e
λT −1 [−nλ + T ] = 0
x1 !, . . . , xn ! >0
=0
>0

−nλ + T = 0
T
λ̂ = = X̄
n
so the MLE of the parameter "λ" of the Poisson distribution is the sample mean "X̄": this is not surprising
as the expected value of a Poisson distribution is indeed "λ". The information on "T " is enough to compute
the MLE: sufficient statistic.

8.5 Score function


The "score" is simply the name of the first derivative of the log-likelihood. Maximum likelihood estimation is
performed by:

1. Working with the log-likelihood


2. Taking derivatives

Thus, the derivative of the log-likelihood w.r.t. the parameter is given its own name: the score.

Score function
The score, commonly denoted with "U ", is:
d
UX (θ) = ℓ(θ|X)

Note that "U " is a random variable (like in the case of ML estimator), as it depends on "X", and is also
a function of "θ".

Remark: with i.i.d. data, the score of the entire sample is the sum of the scores for the individual observations:
X
U= Ui
i

In view of this remark, in the following we will work in general with the score for samples of size "1".

98
8.6 Score and MLE
It is easy to observe that the MLE is found by setting the sum of the observed scores equal to "0":
X
Ui (θ̂) = 0
i

Exercise - Score (Normal distribution)

Find the score for the normal distribution. More precisely find the score for the model "N (θ, σ 2 )", with "σ 2 "
known.
Xi − θ
Ui =
σ2
Pn
Xi − nθ
U = i=1 2
σ
θ̂ = X̄
We take one random variable. Remember that the likelihood:
1 (x−σ)2
L(θ|X) = √ e− 2σ2
2πσ
so the log-likelihood is:
2
− (x−σ)
1
 2

ℓ(θ|X) = log √
2πσ
and then the derivative is:
dℓ 2(x − θ)
=0− (−1)
dθ 2σ 2
so the score is:
dℓ (x − θ)
U x(θ) = =
dθ σ2
The score of the whole sample is the sum of the single score components. The computational improvement
of the score function is that we consider the derivatives at the stage where the sample size is equal to "1" and
we sum them up: previously we defined instead the likelihood of the sample and then we took the derivative
(it was a more difficult approach).

8.6.1 Role of score in MLE

Exercise - Score (Poisson distribution)

Find the score for the Poisson distribution:


Xi
Ui = −1
θ
...
θ̂ = X̄

8.6.2 Mean

We now turn our attention to the theoretical properties of the score. It is worth noting that there are some regularity
conditions that "f (x|θ)" must meet in order for these theorems to work. For the purposes of this class we will assume
that we are working with a distribution for which these hold.

99
Theorem - Expected value

The score has expected value equal to zero:


E(U ) = 0

8.6.3 Variance and information

The variance of "U " is given a special name in statistics: it is called the Fisher information or simply the
information. The Fisher information is the amount of information (contained in a sample) that an observable
random variable "X" carries about an unknown parameter "θ" of a distribution that models "X".

Exercise - Fisher information for the Normal distribution


Find the information for the Normal distribution. More precisely find the information for the model
"N (θ, σ 2 )" with "σ 2 " known.
1
I(θ) =
σ2

Xi − θ
Ui =
σ2
E(Ui ) = 0
1
 
Xi − θ
V AR(Ui ) = V AR = 4 V AR(Xi )
σ2 σ
Remark: remember that the variance of a linear transformation "V AR(aX + b) = a2 V AR(x)".

So this is the information for a sample of size "1". If we consider the information of a sample with a greater
size we just have to multiply it: X n
(θ) = 2
σ
Remark: notice that the information is the reciprocal of the variance of the sample mean.

8.6.4 Information

In the case of the one-parameter normal model the information does not depend on "θ":
1
I(θ) =
σ2
This is not true in general, as shown below.

Exercise - Fisher information for the Poisson distribution

Find the information for the model "P(θ)".


1
I(θ) =
θ

100
8.7 The Cramer-Rao lower bound
Under some regularity conditions (basically the same for which "E(U ) = 0"), the following theorem gives the clear
meaning of the word "information".

Cramer-Rao lower bound

Let "T (X)" be any statistic with finite variance for all "θ". Denote "E(T (X)) = ψ(θ)". Then, for all "θ", if
"ψ" is the identity function, then:
1
V AR(T (X)) ≥
I(θ)
so the variance of an estimator is always greater or equal to the reciprocal of the information. So if we
consider the sample mean of a normal distribution we obtain the equality: we are in the best case. In some
sense the sample mean is the best possible estimator to estimate the mean of a normal distribution: every
other possible estimator of the mean have higher variance (worse performance).

This theorem then tells us that the information is linked with the variance of the estimator and, as we
previously said, the variance of every estimator is always greater or equal to the reciprocal of the information.

8.7.1 Information

Note that for sampling we have: X


I(θ) = Ii (θ)
i

For i.i.d. data, once again we use additivity: the information for a sample is the sum of the information of the
components.
Remark: each observation contains the same amount of information, and they add up to the total
information in the sample.

Fisher’s Information - Normal case

For a sample from the normal distribution, "I(θ) = σ 2 ".


n
How can we read this in connections with the
Cramer-Rao lower bound?

8.7.2 Another information identity

Another property of the score function which is often useful is the following:

I(θ) = −E[U ]

Information - Normal distribution (I)



Find "U " for the Normal distribution:
′ 1
Ui = −
σ2

Information - Normal distribution (II)



Find "U " for the Normal distribution:
′ X
Ui = −
θ2

101
8.7.3 Asymptotic distribution

One final, very important theoretical result for the score may be obtained by applying the central limit theorem:

{U − E[U ]} d
√ −→ N (0, Ii (θ))
n

or equivalently:
1 d
√ U −→ N (0, Ii (θ))
n
d
where the expression "−→" means that the quantity on the left converges in distribution to the distribution on
the right as the sample size "n → ∞". We will write concisely "U ≈ N (0, I(θ))".

8.7.4 Multiple parameters

The preceding results all describe a situation in which we are interested in a single parameter "θ". It is often the
case (and always the case in regression modeling) that "f (x)" depends on multiple parameters. All of the preceding
results can be extended to the case where we are interested in a vector of parameters "θ = (θ1 , θ2 , . . . , θp )".
The score is now defined as
U(θ) = ∇ℓ(θ|x)
where "∇ℓ(θ|x)" is the gradient (a vector) of the log-likelihood (we consider all the partial derivatives):
 
∂ℓ(θ|x) ∂ℓ(θ|x) ∂ℓ(θ|x)
∇ℓ(θ|x) = , ,...,
∂θ1 ∂θ2 ∂θp

Note that the score is now a "p × 1" vector: to denote this we use the bold notation "U". The MLE is now found
by setting each component of the score vector equal to zero; i.e. solving the (sometimes linear) system of equations
"U= 0" (all the partial derivatives equal to zero), where "0" is a "p × 1" vector of zeros.
The score still has mean zero:
E(U) = 0
The variance of the score is now intended as the covariance matrix, and it is still the information:

COV (U) = I(θ)

although the information "I(θ)" is now a "p × p" matrix. We have that:

I(θ) = −E(JU )
2
where "JU " is a "p × p" matrix of second derivatives with "(i, j)"-th element " ∂∂θℓ(θ|x)
i ∂θj
". This matrix is called the
Jacobian of the score or the Hessian matrix of the log-likelihood (remember that the Jacobian matrix is the
matrix containing all the cross partial derivatives [the second derivatives]).
It is still true that, for i.i.d. data. If we have a sample of size "n" we can sum the scores of the components:
X
U= Ui
i

and: X
I(θ) = Ii (θ)
i

and also in this framework the asymptotic normality of the score holds:

U ≈ Np (0, Ii (θ))

102
Exercise - Normal model with "σ 2 " unknown

Write the score and the information matrix for a sample "X1 , . . . , Xn " of a normal distribution "N (θ, σ 2 )"
with both parameters unknown.

So we have a sample normally distributed with both parameters unknown:

X1 , . . . , Xn ∼ N (µ, σ 2 )

So we use the property of the score function: we consider a single component of the sample and then the score
of the sample is the sum of the scores and the information matrix is the sum of the information matrices.
We can work then with a single density:
1 (xi −µ)
Xi f (x1 ) = √ e− 2σ2
2πσ
and then we take its logarithm:

1 (xi − µ)2
 
ℓ(µ, σ|xi ) = log f (xi ) = log √ − log(σ) −
2π 2σ 2

and so we can compute the score (we have two partial derivatives):

∂ log ℓ ∂ log ℓ 1 (xi − µ)2


   
xi − µ
Ui = , = , − +
∂µ ∂σ σ2 σ σ3

For the information matrix we have to consider the variance of "U ". There are two methods to compute the
matrix:
• The first one is which is more difficult:

I(µ, σ 2 ) = V AR(Ui )

• The second is easier and involves the Jacobian matrix of the matrix "U " (we take the second derivative
of the score) [we will apply this]:
I(µ, σ 2 ) = −E(JU )

In the matrix we need to take the derivative of the first element with respect to "µ" and then "σ" and then the
derivative of the second element with respect to "µ" and then "σ". So in the first row we consider the derivative
of the first element " xiσ−µ
2 " and we compute it with respect to the two parameters (partial derivatives): in the

first column of the matrix we compute the derivative with respect to "µ" [ ∂µ ] whilst in the second column

we compute it with respect to "σ" [ ∂σ ]. In the second row we compute again the derivative for the second
(xi −µ)2
element of the score "− σ1 + σ3 " with respect to the two parameters:
!
− σ12 − 2(xσi −µ)
I(µ, σ ) = −E
2 3
3(xi −µ)2
− 2(xσi −µ)
3 σ
1
2 − σ4
1
0
 
= σ2
0 2
σ2

Notice that in the non-diagonal elements we have the "E(xi − µ) = 0" and that in the last element we have
that the expected value "E((xi − µ)2 ) = σ 2 " is equal to the variance. The information matrix we obtained is
a diagonal matrix: the "0" means that in the normal distribution the sample mean and the sample variance
are independent (it is a special property of the normal distribution). This means that the estimation of the
mean and the estimation of the variance are two independent processes (there’s no correlation between the
estimation of the parameter "µ" and the parameter "σ").

103
8.8 Exponential families
Now we introduce a special family of probability distributions, namely the exponential family, we examine some
special cases, and we see what it is about members of the exponential family that makes them so attractive to work
with. Since we work with i.i.d. samples, we write the likelihood for a single observation. It is easy to write down
the likelihood of a sample of size "n" by taking a product of n terms (or the sum when using the log-likelihood).
The exponential family is a set of probability distributions where it’s easy to find the likelihood and, more impor-
tantly, it’s easy to find the score and the information matrix (which is usually difficult to compute since we have to
calculate an expected value or a variance).

Exponential families

A statistical model belongs to the exponential family if the likelihood can be written as (remember that
"exp" just means "exponential"; it’s a way to write more clearly the exponential):

L(θ|x) = exp(−ψ(θ) + θT (x))

and so we have a linear expression on "θ" multiplied by the sufficient statistic, and a constant (doesn’t depend
on the sample) "−ψ(θ)" which is the deterministic function. The parameter "θ" is named as the canonical
parameter, and "T " is of course the sufficient statistic.
In the multi-parameter case we make use of the inner product [prodotto scalare]:

L(θ|x) = exp(−ψ(θ) + θt T (x))

where we have that "θt T (x) = θ1 T1 (x) + θ2 T2 (x) + . . . ". Also the name now becomes more clear because
the likelihood is expressed in the form of the exponential function of this special argument. It’s also easy to
compute the log-likelihood since we just have to take the exponent.

Exercise - Geometric distribution (exponential family)

1) Write the Geometric distribution in the form of the exponential family.

We only compute the one dimensional case. The density of the geometric function is:

X ∼ Geom(p) f (x) = p(1 − p)x 0<p<1 x∈N

remember that the parameter "p" is between "0" and "1" since it’s the probability of the underlying Bernoulli’s
scheme. The likelihood is then the density (as we said we consider size "1"):

L(p|x) = f (x) = exp(log p + x log(1 − p))

where "log(1 − p)" is "θ". Remember that our goal is to obtain the form "L(p|x) = exp(−ψ(θ) + θT )". So we
have:
L(p|x) = exp(−ψ(θ) + θX)
So we have written this distribution as an exponential function. If we want to be more precise we can compute
the explicit expression of the function "ψ" (we need to compute "p" as a function of "θ"):

ψ(θ) = − log(p)

θ = log(1 − p)
eθ = 1 − p
p = 1 − eθ
L(p|x) = exp(− log(1 − eθ ) +θX )
−ψ(θ)

T (X) = X

104
Exercise - Normal distribution (exponential family)

Write the Normal distribution ("σ 2 " known) in the form of the exponential family.

So we have a one-parameter problem:

X ∼ N (µ, σ 2 ) σ 2 unknown

Here we perform the same passages we saw in the previous exercise, computing the likelihood for the density:
1 (x−µ)2
L = (µ|x) = f (x) = √ e− 2σ2
2πσ
1 (x − µ)2
   
= exp log √ −
2πσ 2σ 2
1 x + 2µx + µ2
2
   
= exp log √ −
2πσ 2σ 2
 
= exp . . .

µ
T (X) = X θ=
σ2
2σ 2 " so we have again that "T (X) = X" is equal to the identity.
Notice that the linear expression is " 2µx

Exercise - Normal distribution (exponential family)

(A bit difficult) Write the Normal distribution (both "µ" and "σ 2 " unknown) in the form of the exponential
family with two parameters.

So the idea is that it’s not difficult to write densities in the form of exponential families but in most cases we need
to perform a change of parameter.

8.8.1 Exponential families – in view of GLM

For applications in the theory of linear models we will use a slightly more general form of the exponential family,
introducing also a scale parameter or nuisance parameter "ϕ".

Exponential families (canonical parameter)

A distribution falls into the exponential family if its distribution function can be written as:
 
xθ − b(θ)
f (x|θ, ϕ) = exp + c(x, ϕ)
ϕ

where the canonical or natural parameter "θ = h(µ)" depends on the expected value of "X", "ϕ" is a
positive scale parameter, and "b", "c" are arbitrary functions.
So the function is exactly the same:
• We have a linear part on the "θ" (the canonical parameter)

• "b(θ)" is a function of the parameter "θ" but not of the "x": is the deterministic function so there’s no
randomness in "b"

As we will see, if a distribution can be written in this manner, maximum likelihood estimation (MLE) and
inference are greatly simplified and can be handled in a unified framework.

105
Example 1 - Poisson distribution (exponential families)

To get a sense of how this expression of the exponential family works, let’s work out the representation of a
few common families, starting with the Poisson:

exp(−µ)µX
f (X|µ) =
X!
This can be rewritten as:
f (X|µ) = exp(X log µ − µ − log X!)
so we have that the canonical parameter is "θ = log µ". Observe that "Xθ = x log µ−b(θ)
ϕ " where "ϕ = 1" and
that "c(ϕX) = − log x!". Thus falling into the exponential family with "θ = log µ" and "b(θ) = eθ ". Note
that the Poisson does not have a scale parameter ("ϕ = 1"). For the Poisson distribution, the variance is
determined entirely by the mean.

Example 2 - Normal distribution (exponential families)

Other distributions such as the normal, however, require a scale parameter:

1 (X − µ)2
 
f (x|µ, σ ) =
2
√ exp −
2πσ 2 2σ 2
1 2
Xµ − 2 µ 1 x2
  
= exp − + log(2πσ )
2
σ2 2 σ2

which is in the exponential family with "θ = µ", "b(θ) = 21 θ2 " and "ϕ = σ 2 ".

Example 3 - Bernoulli distribution (exponential families)

Finally, let’s consider the Bernoulli distribution:

f (X|µ) = µX (1 − mu)1−x
   
µ
= exp x log + log(1 − µ)
1−µ

which is in the exponential family with the canonical parameter, "b" and "ϕ" equal to:
 
µ
θ = log b(θ) = log(1 + eθ ) ϕ=1
1−µ

note that "c(ϕ, x) = 0". Even in the simple case of the Bernoulli distribution the canonical parameter isn’t
equal to the expected value (the standard parameter).
Note that, like the Poisson, the Bernoulli distribution does not require a scale parameter. The more general
case of the binomial distribution with "n > 1" is also in the exponential family ("n" fixed).

106
8.9 Properties of the score statistic
8.9.1 Score statistic for exponential families

Why it is important to work with densities using the form of exponential families? We use this form because we have
simple expressions for the score and the information matrix. As we have seen maximum likelihood theory revolves
around the score. The score is the derivative of the log-likelihood with respect to the parameter "θ". Consider,
then, the score for a distribution in the exponential family:

θx − b(θ)
ℓ(θ, ϕ|x) = + c(ϕ, x)
ϕ
which is: ′
d x − b (θ)
U= ℓ(θ, ϕ|x) =
dθ ϕ
and in the case of a sample "X1 , . . . , Xn " the "U " is just the sum of the scores:
X xi − nb′ (θ)
U=
ϕ
and we know that the expected value is:
E(U ) = 0
P
xi ′
this means that the sample mean is an unbiased estimator of " n b (θ)".

Property 1 - Unbiased estimator



The observation "x" is an unbiased estimator of "b (θ)".

Recall from our previous lecture that the score has the following properties:

E(U ) = 0 V AR(U ) = −E(U )

and the variance of "U " is referred to as the information.

Property 2 - Information variance

For distributions in the exponential family:

b” (θ)
V AR(U ) =
ϕ

8.9.2 Mean and variance for exponential families

Thus, for the exponential family the mean and variance of "X" can be computed through derivatives.

Property 3 (1+2) - Mean and variance

′ d
E(X) = b (θ) = b(θ) V AR(X) = ϕb” (θ)

so we need to derive with respect to the parameter "θ": if our parameter isn’t the canonical parameter we
first need to move the expression to the canonical form (otherwise we can’t take the derivative).

107
Mean and variance
e−µ µx
We consider the distribution "X ∼ P (µ)" and "f (x) = x! ". We have:

L(µ|x) = exp(xθ − b(θ))

So "θ = log µ" (where "µ = eθ ") and "b(µ) = µ". Then we have "b(θ) = µ(θ) = eθ ". We obtain:

L(µ|x) = exp(xθ − eθ )

So in the end we have: ′


b(θ) = eθ E(X) = b (θ) = eθ
This means that the variance is:
V AR(X) = ϕ b” (θ) = eθ
=1

Since the mean and the variance of the Poisson distribution are the same we can write:

E(X) = µ V AR(X) = µ

Norice that if in the previous passage we had taken directly the derivative of "b(µ) = µ" without using the
"θ" we would have obtained "E(X) = 1" and "V AR(X) = 0" which are wrong.

Note that the variance of "X" depends on both the scale parameter (a constant) and on "b" (there’s a connection),

a function which controls the relationship between the mean and variance. Letting "µ = b (θ)" and writing "b” (θ)"
as a function of "µ" with "V (µ) = b” (θ)" [variance function], we have:

V AR(X) = ϕV (µ) V AR(U ) = ϕ−1 V (µ)

This point is really important since it allows us to compute the moments of random variables (expected values of
powers or related functions of the random variable) by considering the derivatives:

• For the Normal distribution, "V (µ) = 1" since the mean and the variance are not related.
• For the Poisson distribution, "V (µ) = µ" since the variance increases with the mean.

• For the Binomial distribution, "V (µ) = µ(1 − µ)" since the variance is largest when "µ = 1/2" and
decreases as "µ" approaches "0" or "1".

8.10 Link functions


Link function
In general, a link function is any function of the expected value of "X".

g(µ) = g(E(X))

Since ML estimation is particularly simple when the distributions are expressed in the canonical form of the
exponential family, with natural parameter "θ = h(µ)", we state the following definition.

Canonical link function


The link function
g(µ) = θ = h(µ)
is the canonical link function for the distribution of "X". It transforms the expected value of our random
variable (the mean) into the canonical parameter "θ" of the exponential family.

108
The use of the canonical link function corresponds to a reparametrization of the distribution in terms of the
natural parameter of the exponential family.

8.10.1 Benefits of canonical links

We can see that this way of writing distributions in terms of the canonical parameter and the exponential family
simplifies the research of the MLE and so on. There is, therefore, a reason to prefer the canonical link when specifying
the model. Although one is not required to use the canonical link, it has nice properties, both statistically and in
terms of mathematical convenience:

• It simplifies the derivation of the MLE


• They ensure that the estimates are based on sufficient statistics
• When applied to a response variableP in a regression context, it ensures that many properties of linear regression
still hold, such as the fact that " i ei = 0".
• It tends to ensure that "µ" stays within the range of the outcome variable

In linear regression for normal distributions we write the expected value of our response variable equal to:
E = β0 + B 1 x 1 + · · · + β p x p
and we haven’t explicitly seen this part because for the normal distribution, when estimating the mean, the canonical
function is the identity. So what we have is:
Y ∼ N (µ, σ 2 )
E(Y ) = β0 + β1 x1 + · · · + βp xP
But if for example we take a Bernoulli distribution we know that our response variable is no more a quantitative
variable as it is a "0" and "1" variable:
Y ∼ Bern(µ)
µ = E(Y ) = β0 + β1 x1 + · · · + βp xP
But we know that "µ" is the parameter of the Bernoulli distribution is "0 < µ < 1", while the predictors in the
form "β0 + · · · + βp xP " are all over the real line (the regression line goes over "R"): this means that we can
 have

predictions of the probability outside the "[0,1]" range. So if we move from "µ" to the canonical "theta = log µ
1−µ "
we then obtain a parameter "θ" that covers all the real line "−∞ ≤ θ ≤ +∞" because the limits are:
lim θ = −∞ lim θ = +∞
µ→0 µ→1

By operating this way we have a flexible expression of our density to use in our regression problems.

[We will analyze the last two points in full detail in the second part]

We now disclose some issues of the second part of this class. In a regression context, we will see that the
canonical link is the best way to link the expected value of a response variable to a linear predictor:
θi = g(µi ) = ηi = β0 + β1 xi1 + · · · + βp xip
Maybe you have not noticed this point in your previous regressions. Why? Because for the normal distribution the
canonical link function is the identity, so in practice you don’t need the notion of link function.

Exercise - Link for the Binomial distribution


As an example, consider the canonical link for the binomial distribution:


 
µ
η = g(µ) = log µ = g −1 (η) =
1−µ 1 + eη

As "η → −∞", then "µ → 0", whilst as "η → ∞", then "µ → 1". On the other hand, if we had chosen, say,
the identity link, "µ" could lie below "0" or above "1", clearly impossible for the binomial distribution.

109
9 Simulation and the bootstrap
9.1 Motivation
Before computer age, statistical analysis used probability theory to derive statistical expression for standard errors
(or confidence intervals) and testing procedures. For instance in the Student’s t-test, normality is assumed: "Let
"X1 , . . . , Xn " be a sample of "N (µX , σX
2
)" and let "Y1 , . . . , Ym " be a sample of "µY , σY2 )" or in a regression model:

Y = Xβ + ε

the normality of the error term is assumed: "Let "εi " be i.i.d. with distribution "N (0, σ 2 )"...".
Most formulas are approximations, based on large samples. With computers, simulations and resampling meth-
ods can be used to produce (numerical) standard errors and testing procedure (without the use of formulas, but
with a simple algorithm).

9.2 Generating randomness


Here "random" means a sequence of numbers do not exhibit any discernible pattern, i.e. successively generated
numbers can not be predicted.

"A random sequence is a vague notion... in which each term is unpredictable to the uninitiated and whose digits
pass a certain number of tests traditional with statisticians.
Derrick Lehmer, quoted in Knuth, 1997

The first problem is then to show how computers generate "random" numbers. Computers are deterministic machines
in general so the first question is "how a machine can generate randomness?". The answer is quite simple since
computers and machines in general don’t actually generate real random numbers but they generate pseudo-random
numbers (they follow a rule for the generation). The goal of pseudo-random numbers generators is to produce
a sequence of numbers in the interval "[0, 1]" that imitates ideal properties of random number (with a uniform
distribution).

9.3 Linear Congruential Generator (LCG)

LCG algorithm

LCG produces a sequence of integers "X1 , X2 , . . . " between "0" and "m−1" following a recursive relationship:

Xi+1 = (aXi + b) mod m

and sets:
Xi
ui =
m
So the idea is to take a linear equation: we define the "i + i" element as a linear function of the "Xi " and
then we consider the "mod" operation which is the remainder of the division. The sequence of "ui " are a
sequence of pseudo-random numbers in "[0,1]". They are pseudo random because if you know the values of
the parameters "a", "b", "m" and the starting point (the seed) "x0 " then the sequence is deterministic and
we can recover the sequence.

Usually the parameters "m" and "a" are set to large numbers to have good performances: we obtain a quite good
approximation of the uniform [0,1] distribution (e.g. "m = 232 − 1", "a = 16807(= 75 )" and "b = 0"). Also, "m" and
"a" must be coprime (why?).
In Software R the random number generation is done with the "runif " function. This function "runif" is based on
the Mersenne-Twister algorithm (a refined version of LCG) and it generates numbers from the "U[0, 1]" distribution.

110
Try this code several times:

runif(10)

Now, try this code several times

set.seed(12345) #or whatever you want#


runif(10)

So notice that in the first case we obtain always different values because the algorithm starts at different "x0 "
everytime the program receives a new input, whilst in the second case, where we set the starting point, we obtain
always the same output.

9.4 Checking randomness


Heuristically,

1. Calls should provide a uniform sample:


n
1X
lim 1(ui ∈ (a, b)) = b − a
n→∞ n
i=1

2. Calls should be independent:


n
1X
lim 1(ui ∈ (a, b), ui+k ∈ (c, d)) = (b − a)(d − c)
n→∞ n
i=1

for all "k".

What we need to do is simulations, which means generating random numbers from a given probability distribution.

9.5 Sampling from a finite set

Sampling

Let us suppose we need to generate a sample of size 100 from a finite set. We have 3 possible outcomes:

S = {A, B, C}

where:
pA = P(A) = 0.13 pB = P(B) = 0.35 pC = P(C) = 0.52
First we generate random numbers "u" from the uniform "U[0, 1]" distribution:
• If "u ≤ 0.13" then we choose "A"
• If "0.13 < u ≤ 0.48" then we choose "B"

• Otherwise, so if "u > 0.48", we choose "C"

111
Try to do that in Software R with the "if " conditional statement (we sample a category variable):

u=runif(100) #We generate 100 uniform random numbers from the distribution#
catv=rep("C",100)
for (i in 1:100)
{
if (u[i]<=0.13){catv[i]="A"}
else if (u[i]<=0.48){catv[i]="B"}
}
barplot(table(catv),ylim=c(0,60))

or use the clever function "cut" (see the help):

u=runif(100)
catv=cut(u,c(0,0.13,0.48,1),labels=c("A","B","C"))
barplot(table(catv),ylim=c(0,60))

9.5.1 Function "sample()"

Software R has a built-in function for sampling from finite sets, the function "sample":

Exercise - Sample function

1. Use the "help" to learn how the function sample works

2. Use the "sample" function to generate "1000" random tosses of a dice

help(sample)
help(sample)
samp=sample(1:6,1000,replace=T)
#we can check the result by using the barplot#
barplot(table(samp))

3. Use the "sample" function to generate "1000" random tosses of two dice, taking the sum

samp1=sample(1:6,1000,replace=T)
samp2=sample(1:6,1000,replace=T)
samp=samp1+samp2
barplot(table(samp))

4. Use the "sample" function to generate a random permutation of the set "{1, . . . , 100}"

sample=(1:100,100,replace=F)

112
9.6 Sampling from a parametric (discrete) distribution
For several parametric families the sampling problem is solved through the functions "rbinom", "rgeom" and so
on. The first step is the generation of a "U[0, 1]" random variable. There are optimized algorithm for the inversion
of the cdf (in the discrete case the cdf is never invertible).

Exercise - Sampling from parametric discrete distributions

1. Check the "rgeom" function with the following code

x=rgeom(100,0.3) #sample size=100 and p=0.3#


plot(table(x)/length(x)) #we plot the empirical distribution#
points(0:20,dgeom(0:20,0.3),col="red",pch=22) #we add the theoretical plot#

Here we are sampling from a discrete distribution (geometric):

2. Try with different parameter values and different sample sizes.


3. Try with different distributions.

Exercise - Coupon collector’s problem

A guy collects coupons. He buys one coupon at a time and all coupons have the same probability. Approx-
imate by simulation the distribution of the time needed to complete the collection of "n" coupons.
The first coupon is a good one for sure. For the second coupon, the waiting is a geometric random variable
n ". So we are in the case of geometric distributions with a variable parameter (it changes
with parameter " n−1
at each step).
We fix "n = 100".
1. The parameters are:

n=100
p=(n:1)/n

2. Generate 100 geometric random variables with the appropriate parameters as follows:

x=rgeom(n,p)

3. Remember the meaning of the geometric parameters in Software R and define the waiting time as:

t=sum(x)+n

Of course this is the structure of the generation of one waiting time. If we want to approximate the
distribution of the waiting time we just need to replicate this code a certain number of times.

113
4. Repeat "1000" times and get the approximate distribution of the waiting time:

time=rep(0,1000)
for (i in 1:1000)
{
x=rgeom(n,p)
time[i]=sum(x)+n
}
mean(time)
var(time)
hist(time)

So the key point here is that we use simulations to approximate unknown distribution functions: in this case for
example we don’t know the exact distribution of this function.

9.7 Sampling from a finite set - the multinomial distribution


The function "rmultinom" generates samples from the multinomial distribution. This is a multivariate distribution.
In the following example we take a sample from a multinomial distribution:

x=rmultinom(1,20,c(1,2,3))

means that we sample:

• 1 vector
• the vector has sum 20
• there are three outcomes with probabilities (proportional to) "(1, 2, 3)".

The goal of the exercise below is to show empirical correlations in the multinomial distribution converge to the
expected value from the formulae:

Example - Sampling from a multinomial distribution

1. Generate "500" vectors of a multinomial distribution from our


rand=matrix(rep(0,3*500),ncol=3)
for (i in 1:500)
{
rand[i,]=rmultinom(1,20,c(1,2,3))
}

114
2. With the functions "cov" and "cor" compute the variance/covariance matrix and the correlation matrix
of the sampled data

mean(rand[,1])
var(rand[,1])
cov(rand)
cor(rand)

3. Compare with the theoretical covariances and correlations

9.8 Sampling from a (continuous) parametric distribution


Also for continuous random variables, several parametric families can be sampled through suitable functions such
as "rbeta", "rexp" and so on. The first step is the generation of a "U[0, 1]" random variable. There are optimized
algorithm for the inversion of the cdf. In most cases the cdf is strictly monotone, and nevertheless the problem is
not easy. For instance in the case of the "rnorm" function, the cdf itself has no closed-form expression.
When the cdf is easy, we can follow the steps above.

Example - Sampling from the exponential and uniform distributions

Generate a sample from the exponential "ε(0, 2)".


1. Write the cdf and its inverse
2. Generate a random sample from the uniform "U[0, 1]" distribution with "runif "
3. Use the inverse cdf to define your sample

4. Plot the empirical cdf and compare with the theoretical cdf with

plot(ecdf(x))
x=sort(x)
lines(x,pexp(x,0.2))

where "x" is the vector containing the sample.

9.9 The empirical cumulative distribution function (cdf)


So far we have used the empirical cdf and now we take a closer look at this object.

Empirical cumulative distribution function

If we have a random sample "X1 , . . . , Xn " of size "n", then empirical distribution function "F̂n " is the cdf of
the distribution that puts mass "1/n" at each data point "xi ". Thus:
n
1X
F̂n (x) = 1(Xi ≤ x)
n i=1

where "1(·)" stands for the indicator function, i.e. the function which equals "1" if the statement in brackets
is true, and "0" otherwise.
In other words, "F̂n (x)" shows the fraction of observations with a value smaller than or equal to "x".

115
Theorem - Unbiased and consistent estimator
The behavior of this empirical distribution function is exactly what we expect as it converges to the correct
value. It is an unbiased estimator of the theoretical distribution function because its expected value is
equal to the theoretical distribution function and it is a consistent estimator because its variance goes to
"0" as the sample size grows to infinity.
For all "x ∈ R":
F (x)(1 − F (X))
E[F̂n (x)] = F (X) V AR[F̂n (x)] =
n
Thus for "n → ∞" we have that "F̂n (x) → F (x)". Actually a stronger result holds.

Glivenko-Cantelli theorem
If "X1 , . . . , Xn " is a random sample from a distribution with cdf "F ", then the distance between the empirical
distribution function and the theoretical distribution function goes to zero as "n" goes to infinity:

sup|F̂n (x) − F (x)| → 0


x∈R

9.10 The bootstrap


We now investigate a non-parametric method of estimation: this means that we have a sample and we don’t assume
any information about the sample a priori, in particular we don’t assume any shape of the probability distribution.

Bootstrap

The bootstrap is a family of methods that resamples from the original data.

From the population (the first line), which is composed by balls with different colors, we consider a sample (the
second line) which is our data of "N " elements.
If we can’t assume any information about the probability distribution about of the random variable over the
population (we can’t assume for example if it is a Poisson or Normal distribution), then our unique information is
in the sample. The main idea of the Bootstrap is that the best approximation of the underlying distribution is the
empirical distribution of the sample itself. So if we have no information about the population (the first line), we
then assume that the sample is the best possible information about the shape of the distribution of the population.
The motivation of this reasoning exactly can be found in the theorem we previously saw. So the idea is to use our
sample as a population.
We draw samples from the original sample and this is why we call this "resampling". We repeat this process
several times and we compute the statistics of interest over the bootstrap samples and the values based on these

116
samples are approximation of the distribution in the population. Of course we consider samples with replacement
because we draw samples with the same size.
With this procedure we can estimate several parameters:

• Assessment of the variability (i.e. estimation of the variance "σ 2 ")


• Computation of confidence intervals

• Statistical tests

Non parametric framework

The bootstrap is non-parametric in the sense that we can do the previous tasks without making any as-
sumptions on the distributions of the variables (we don’t make any assumptions about the shape of our
distribution).

The bootstrap is based on the idea that the observed sample is the best estimate of the underlying distribution.
Thus, the bootstrap is formed by two main ingredients:

• The plug-in principle: the sample we take and consider is taken as a population. This means that the em-
pirical estimate (for example for a parameter) on the sample is now the unknown parameter in the population.
This means that we need a further estimation step which is the following point.
• A Monte Carlo method to approximate the quantities of interest based on samples taken with replacement
from the original sample and with the same size. This means that we do computations via resampling and
simulations and not with the exact analytical argument.

9.10.1 A general picture for the bootstrap

If we consider the "real world" we follow a precise path: we take the population, we consider a random sample and
from it we estimate a function: we estimate some parameters using an estimator which is a function of the sample.
If we consider the "bootstrap world" we have an estimated population which is our original sample. Here the
estimation is performed by considering bootstrap datasets (bootstrap samples).

The real world is a good way of reasoning if operating in a parametric framework but can’t be applied in a non-
parametric framework: this obliges us to move to the bootstrap world. Notice that in the bootstrap "universe" we
denote the elements with the notation "∗ ".

117
9.10.2 How to do it

We can now investigate this technique applied to the simplest case which is the estimation of a parameter.
We take a sample "X1 , . . . , Xn " and suppose that "θ" is our parameter of interest, and let "θ̂" be its estimate on
the sample. In order to obtain an approximate distribution of the parameter "θ̂" we can use the bootstrap method
without making any assumption on the distribution of the sample. The steps are:

1. Take a bootstrap sample "X1∗ , . . . , Xn∗ " from "X1 , . . . , Xn " of the same sample size
2. Compute the estimate "θ∗ " on the bootstrap sample

3. We repeat the previous steps a large number of times in order to have enough values to obtain a good
approximation of the distribution.

How to approximate the distribution of the sample mean "X" in a non-parametric framework? With the
bootstrap.

1. Take a bootstrap sample "X1∗ , . . . , Xn∗ " from "X1 , . . . , Xn "

2. Compute the estimate "barx∗ " on the bootstrap sample.


3. Repeat items "1" and "2" to have enough values to obtain a good approximation of the distribution.

We now see an example:

Example - Boostrap for the mean

Given a sample of size "8":


12 14 15 15 20 21 30 47
compute the approximated distribution of "X̄".

A standard number for the bootstrap replication is "5000": we then define a vector which will contain the
values of the average of all the bootstrap samples and then, each bootstrap sample in the "for" cycle is defined
as a sample with replacement from our population with the same sample size of our original sample. In the
following code we sample from the vector "x" a sample of the same size "8" with replacement.
We then consider the mean of each bootstrap sample and we store each value inside the vector "bootx". In
the end we plot the graph to visualize the results:

x=c(12,14,15,15,20,21,30,47)
B=5000
bootx=rep(NA,B)
for (i in 1:B)
{
boots=sample(x,8,replace=T) #This is the M.C. method#
bootx[i]=mean(boots)
}
plot(density(bootx),lwd=2)

118
Here is the plot:

Since it is an empirical function it is not perfectly estimated: we see it looks like a right tailed distribution.
Remark: due to the randomness of the simulations, different executions give rise to (slightly) different plots
(at each replication we obtain different random numbers).
We can then, of course for a much smaller number of iterations, look inside of our for cycle, printing the
samples and the corresponding mean. We consider for example "B = 10":

x=c(12,14,15,15,20,21,30,47)
B=10
bootx=rep(NA,B)
for (i in 1:B)
{
boots=sample(x,8,replace=T)
print(boots)
bootx[i]=mean(boots)
print(bootx[i])
}

Doing so we can see that we obtain different bootstrap samples with different means: the histogram of the
density estimation of all these values generates the density estimation of the distribution of our sample mean.
In this case we don’t consider the plot of the density as "10" isn’t sufficiently large.

So the bootstrap methodology is especially useful when dealing with small samples (in the last example we just had
"8" elements) as we didn’t have any tool to deal with them.
Of course the bootstrap for large samples is still useful and interesting for different parameters when the Central
Limit Theorem doesn’t hold.

119
9.11 Estimating the bias of an estimator
Let "θ" be a parameter of interest and "θ̂" its sample estimate. We want to provide an estimate of the bias of "θ̂".

Boostrap bias estimation

The bootstrap estimates the bias of an estimator based on "B" bootstrap samples is:

bboot (θ̂) = θ̄∗ − θ̂

where "θ̄∗ " is the average of the bootstrap estimates.

9.12 Estimating the standard deviation of an estimator


ˆ
Let "θ" be a parameter of interest and "theta" its sample estimate. We want to provide an estimate of the standard
deviation of "θ̂".

Standard deviation estimation


The bootstrap estimate of the standard deviation based on "B" bootstrap samples is:
v
u 1 X
u B
sboot (θ̂) = t (θ∗ − θ̄∗ )2
B − 1 i=1 i

where "θi∗ " is the estimate on the i-th bootstrap sample and "θ̄∗ " is the average of the bootstrap estimates.

9.13 Advantages of the bootstrap


The bootstrap technique does not need any assumption about the distribution of the data. The bootstrap can be
used for (almost) any parameter of interest.

Exercise - Estimate the standard deviation


Based on the sample
12 14 15 15 20 21 30 47
estimate the standard deviation of the sample median.

x=c(12,14,15,15,20,21,30,47)
B=5000
bootx=rep(NA,B)
for (i in 1:B)
{
boots=sample(x,8,replace=T)
bootx[i]=median(boots)
}
sd(bootx)

[1] 4.043891 #ofc you may obtain slightly different results#

120
9.14 Confidence intervals: the basic method
There are several methods to define confidence intervals based on the bootstrap.

Confidence intervals - basic method

Based on "B" bootstrap samples, compute the distribution of "θ̂∗ ". A confidence interval for "θ" with
confidence level "1 − α" is:  
∗ ∗
θ̂α/2 , θ̂1−α/2

where "θ̂α∗/2 " and "θ̂1−



α/2 " are the " 2 " and "1 − 2 " quantiles of "θ̂ ".
α α ∗

This is the trivial way to define a confidence interval, just take the relevant percentiles of the empirical
bootstrap distribution "θ̂∗ ".

The bootstrap method is very flexible for every parameter we want to estimate: it’s an empirical distribution so
we don’t need any information a priori. For example now we try to compute the confidence interval for the median:

Example - Confidence interval for the median

x=c(12,14,15,15,20,21,30,47)
B=5000
bootx=rep(NA,B)
for (i in 1:B)
{
boots=sample(x,8,replace=T)
bootx[i]=median(boots)
} #we compute the interval#
int=c(quantile(bootx,0.025),quantile(bootx,0.975)) #2.5% in both tails#

9.15 Confidence intervals: the percentile method


A slightly more refined way is the so-called "percentile method". We consider a distribution centered around the
"θ̂". Since a confidence interval for "θ" is of the form:

(θ̂ + δ1 , θ̂ + δ2 )

in the "bootstrap world" we approximate "δ = θ̂∗ − θ̂".

The percentile method

Based on "B" bootstrap samples, compute the distribution of "δ ∗ ". A confidence interval for "θ" with
confidence level "1 − α" is:
(θ̂ − δ1−

α/2 , θ̂ − δα/2 )

Note that in this expression the two quantiles are in a reverse order.

121
We can easily repeat the process using Software R:

Exercise - Confidence Interval for the median (percentile)

x=c(12,14,15,15,20,21,30,47)
B=5000
med=median(x)
bootdelta=rep(NA,B)
for (i in 1:B)
{
boots=sample(x,8,replace=T)
bootdelta[i]=median(boots)-med
}
int=c(med-quantile(bootdelta,0.975), med-quantile(bootdelta,0.025))

Every bootstrap procedure is based on an unconditional cycle where we consider the bootstrap sample with
replacement from the original sample (and with the same size). Then we compute the statistic of interest, we store
the values on a vector and then we can compute, for example, the standard deviation, the variance, the CI and so
on.

Exercise - Confidence Interval for the non-parametric skewness (percentile)

1. Generate a sample of size "15" from an exponential distribution with mean "1.4".
2. Use the bootstrap percentile method to define a confidence interval for the non-parametric skewness:

M e(X) − µ
W =
σ

Remark: the non-parametric skewness "W " is a measure of asymmetry. If "W > 0" it means we have
a left-skewed distribution, while "W < 0" means we have a right-skewed distribution. A "W = 0" means we
have a symmetric distribution since the mean and the median coincide.
Since the exponential distribution is a skewed distribution (on the right) we should obtain a positive value
for the parameter "W ".

9.16 Multivariate and time series and more


The theory developed here has been focused on univariate i.i.d. samples. The bootstrap can be adapted to more
complex situations, but with care:

• For multivariate analyses, resampling must be for paired observations (we will consider this case in the
Lab)
• For time series, resampling must consider blocks of consecutive observations.

• In regressions, resampling is usually applied to the residuals.

9.17 Software R package


There is an R package for working with the bootstrap. It is the package "boot". However, it is not always easy to
use and for our purposes it is better to define the bootstrap resampling algorithm by ourselves.

122
10 Lab 4
This lab is an exercise about bootstrap: in particular it’s a bootstrap for the computation of confidence intervals.

10.1 Simulation - The bootstrap


We consider a dataset containing a random sample of "25" used Mustangs being offered for sale on a website. The
dataset contains information on price (in thousand of dollars), mileage (in thousands of miles), and age (in years)
of these cars. The data are in the "csv" file "mustang.csv".

1. Import the data set into Software R.


In order to import the data we can use the "import dataset" tool or alternatively the code:

library(readr)
mustang <- read_csv(. . ./mustang.csv")
View(mustang)

2. Compute the basic descriptive statistics for the three variables, including the correlation matrix. What is the
average price of a used Mustang in this sample?
We now take a look at the data and give a basic description: we consider the mean, the standard deviation
(variance) and we check for outliers with the boxplot. In this case we just have three quantitative variables
but for categorical variables we can also perform the "table" and "barplot" to check how many levels have each
variable. Here we also compute the correlation matrix.

dim(mustang) #check the dimensions#


attach(mustang)

summary(age)
sd(age) #sd to check the dispersion of the r.v.#
boxplot(age, horizontal=TRUE) #graphical summary of the values#

summary(miles)
sd(miles)
boxplot(miles, horizontal=TRUE)

summary(price)
sd(price)
boxplot(price, horizontal=TRUE)

mean(age)
mean(miles)
mean(price)
var(age)
var(miles)
var(price)

cor(mustang)

123
From the output of the correlation matrix we notice that the "price" is negatively correlated both with "age"
and "miles".
3. Using this sample we would like to construct a bootstrap confidence interval for the average price of all used
Mustangs sold on this website. To take for instance "1000" bootstrap samples and record the means use:

boot_means = rep(NA, 1000)


for(i in 1:1000){
boot_sample = sample(mustang$price, 25, replace = TRUE)
boot_means[i] = mean(boot_sample)
}

Now, the confidence limits are the appropriate percentiles of "boot_means".


In the past lectures we have seen that every bootstrap procedure is based on a "for cycle" where we have
the generation of the bootstrap samples. On that we compute the parameter of interest (in this case we are
considering the mean) and we store the values into a suitable vector (in this case we will obtain a vector with
"1000" units since we have "1000" replications). So we run the code the exercise proposes us and then compute
the interval (the two quantiles) [in this case we considered a 95% confidence interval]. We also calculate the
"actual" mean of the root sample in order to see if it is the center of the confidence interval.

boot_means = rep(NA, 1000) #we replicate 1000 times#


for(i in 1:1000){
boot_sample = sample(mustang$price, 25, replace = TRUE)
boot_means[i] = mean(boot_sample)
}

int=c(quantile(boot_means,0.025), quantile(boot_means,0.975)) #take the quantiles#


int #compute the interval#

mean(price)

Since we are operating a simulation we will obtain slightly different results every time we run the code. To
reduce the variability in this output we could compute an increased number of replications (for example 5000).
Remember that "1000" is the number of samples for which we estimate the density of the estimator: by
increasing its value we obtain a better density estimation.

124
4. Write your confidence interval based on "B = 1000" bootstrap samples, and compare it with the confidence
interval given by the "t.test" function where normality is assumed.
We compute the "t.test" function (the Student t Test) for the mean, which also calculates the confidence interval
(it computes the default 95% interval). So the idea here is to compare the bootstrap confidence interval with
the parametric confidence interval. Here the normality of the data is assumed.

t.test(price)

> One Sample t-test


data: price
t = 7.1894, df = 24, p-value = 1.981e-07
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
11.39252 20.56748
sample estimates:
mean of x
15.98

The null hypothesis of the test checks if the mean is equal to zero whilst the alternative hypothesis checks that
the mean is different from zero. Here we are only interested in the part regarding the confidence interval: notice
that with the bootstrap simulation we obtained a slightly smaller confidence interval (it is a better result):

CIboot : [12.1396; 20.4009] CIt.test : [11.39252; 20.56748]

We can conclude that bootstrap is a good family of techniques which leads to an adaptation of the results to the
actual shape of the data. This means that usually the results are better with bootstrap than with the normality
assumption: there’s of course one exception which is when the data are exactly normal.

plot(density(price))

If we consider the density estimation we see that the distribution has a wide right-tail but in general the shapes
recalls the normal distribution.

125
5. Plot the bootstrap distribution using the density estimation.
We now plot the density estimation of the sample mean based on the bootstrap replications:

plot(density(boot_means))

6. Compute a bootstrap confidence interval for the standard deviation.


We now move from the "mean" to the "standard deviation" parameter (we could also use the coefficient of
variation, the asymmetry coefficient or any other parameter we want). In order to do that we just need to
adapt the previous code we used for the mean [so if we need the confidence interval for other parameters we
just need to change the name into the vector]:

boot_sds = rep(NA, 1000)


for(i in 1:1000){
boot_sample = sample(mustang$price, 25, replace = TRUE)
boot_sds[i] = sd(boot_sample)
}
int=c(quantile(boot_sds,0.025), quantile(boot_sds,0.975))
sd(price)
int

The lower and upper bounds are respectively "(7.069764, 13.84862)": we can see that standard deviation of the
sample is "11.11362", a value between the two bounds of our interval.
7. Try to compute a confidence interval for the minimum and compare with the actual minimum in the original
sample.
This last point is to check the case where bootstrap isn’t actually the correct answer for the estimation of a
parameter. If we try to compute the bootstrap estimation for the minimum (or the maximum as it is the
same for our analysis) we see that the parameter is really difficult to estimate. We then compute the actual
minimum (maximum) of our sample and then do a comparison:

boot_mins = rep(NA, 1000)


for(i in 1:1000){
boot_sample = sample(mustang$price, 25, replace = TRUE)
boot_mins[i] = min(boot_sample)
}

int=c(quantile(boot_mins,0.025), quantile(boot_mins,0.975))
min(price)
int

126
As we can see the confidence interval for the minimum using the bootstrap is between "3" and "7" but the
actual minimum of our sample is "3". So for example if we consider our sample points we see that the sample
minimum would be the red point. If it is an actual value of the distribution it means that the minimum in the
population would lie below the sample minimum: this means that are certain that the minimum lays below the
red point. But if we apply resampling from our sample and then compute the minimum, then the minima of
each bootstrap sample would lay on the right side.

The "min" and the "max" are then two special parameters of the distribution: even using things like normal
confidence intervals based on large samples (we assume for example a distribution like the green one and then
we consider the confidence interval) we obtain wrong results because one half of the distribution lays on the
wrong part of the graph [the central limit theorem does not hold].

In this case the right distribution would be the so called Gumbel distribution: we need an asymmetric probability
distribution with respect to the data in order to consider the confidence interval. The same holds of course
symmetrically for the maximum.

Notice that even in the parametric case using likelihood we can’t use the Central Limit theorem to approximate
the distribution of the estimator because the minimum (maximum) isn’t the center of the distribution. As we
just said before in order to perform this we need completely asymmetric distributions (on the left and on the
right).

127
COMPLETE EXERCISE

###1)
library(readr)
mustang <- read_csv(". . ./mustang.csv")
View(mustang)

###2)
dim(mustang) #check the dimensions#
attach(mustang)
summary(age)
sd(age) #sd to check the dispersion of the r.v.#
boxplot(age, horizontal=TRUE) #graphical summary of the values#
summary(miles)
sd(miles)
boxplot(miles, horizontal=TRUE)
summary(price)
sd(price)
boxplot(price, horizontal=TRUE)
mean(age)
mean(miles)
mean(price)
var(age)
var(miles)
var(price)
cor(mustang)

###3)
boot_means = rep(NA, 1000) #we replicate 1000 times#
for(i in 1:1000){
boot_sample = sample(mustang$price, 25, replace = TRUE)
boot_means[i] = mean(boot_sample)
}
int=c(quantile(boot_means,0.025), quantile(boot_means,0.975)) #take the quantiles#
int #compute the interval#
mean(price)

###4)
t.test(price)
plot(density(price))

###5)
plot(density(boot_means))

###6)
boot_sds = rep(NA, 1000)
for(i in 1:1000){
boot_sample = sample(mustang$price, 25, replace = TRUE)
boot_sds[i] = sd(boot_sample)
}
int=c(quantile(boot_sds,0.025), quantile(boot_sds,0.975))
sd(price)
int
###7)
boot_mins = rep(NA, 1000)
for(i in 1:1000){
boot_sample = sample(mustang$price, 25, replace = TRUE)

128
boot_mins[i] = min(boot_sample)
}
int=c(quantile(boot_mins,0.025), quantile(boot_mins,0.975))
min(price)
int

129
11 Regression: parametric and nonparametric approaches
11.1 Introduction
Linear regression is the first statistical model to investigate relationships, dependencies and casuality. Linear
regression is designed for the case when the response variable is quantitative (and with some further assumption
we will discuss later). For linear regression we need:

• one response variable "Y " (quantitative)


• one or more (non-random) predictors or explanatory variables or regressors "X1 , . . . , Xp ". They aren’t
random variables so we don’t have to fix any probability distribution for them (the predictors are considered
as fixed).

If "p = 1" (the number of linear predicotrs) we have a simple regression, if "p > 1" we have a multiple regression.

11.2 Linear regression model

Linear regression model

The linear regression model is:

Y = β0 + β1 X1 + β2 X2 + . . . βp Xp + ε

where we have that:


• "Y " is the response variable

• "β0 + · · · + βp Xp " are the linear predictors


• "ε" is a symmetric zero-mean random variable (the error term). In general this term is normally
distributed around "0".
The complexity of the model is of course determined by the number "p" of predictors.

The parameters "β0 , . . . , βp " are unknown and must be estimated from the data. In vector notation we write:

β = (β0 , . . . , βp )

Remark

In regression analysis the predictors are considered as deterministic (fixed values). The random variable is
only on the residuals, and thus on the response variable.

11.3 Estimation of "β"


To estimate the vector "β" we compute the predicted values:

ŷi = β̂0 + β̂1 xi1 + β̂2 xi2 + · · · + β̂p xip

and we choose the "β̂" that minimizes the sum of the squared residuals (least squares method). The residual sum
of squares (RSS) is the sum of all the observations (statistical units), of the square difference between the observed
value of the response variable and the predicted value of the response variable.
n
X
RSS = (yi − ŷi )2
i=1

130
For the simple regression (only one regressor "X1 ") the expressions are simple:

COV (X1 , Y )
Slope : β̂1 = Intercept : β̂0 = ȳ − β̂1 x̄1
V AR(X1 )

while for multiple regression, the expression is somewhat more complicated and it is usually expressed in terms of
the design matrix. This matrix contains all the predictors column by column and then a column made up just
by "1" for the intercept. It is defined as:
X = [1, X1 , X2 , . . . , Xp ]
Also, let "y = (y1 , . . . , yn )t " the vector of the observed responses. Assume for example that we have:

p=2 n=3

Y = Xβ + ε
1
       
y1 x11 x12 β0 ε1
Y = y2  X = 1 x21 x22  β = β1  ε = ε2 
y3 1 x31 x32 β2 ε3
Then the minimization of the RSS is reached with:

β̂ = (X t X)−1 X t y

This is the least square (LS) estimate of the parameters.

11.4 Estimation of "β" with Gaussian residuals


We now need to apply statistical models: we need to fix a random variable for the response variable. This is a
simple task in the regression framework because the linear predictor is non-random: in order to define a probability
distribution for the "Y " we just need to define a probability distribution for the error term (in this case we consider
a normal distribution). This means that to perform inference we need to fix the distribution of the residuals (and
thus of the response variable).

Gaussian linear model


The Gaussian linear model is:

Yi = β0 + β1 Xi1 + β2 Xi2 + · · · + βp Xip + εi = (Xβ)i + εi

with independent "εi ∼ N (0, σ 2 )". Notice that the variance is the same for all the statistical units.

Thus we have (in a more concise form) that each response variable is a normal random variable because it’s the
sum of a constant plus a normal random variable. So our response variable in the sample is normally distributed:

Yi ∼ N ((Xβ)i , σ 2 )

Incidentally, in this case the LS estimator is also the solution of the MLE problem (the equations and solutions
become the same). This is the first time we deal with a sample of independent random variables but not identically
distributed: this is because each response variable has its own "mean" (we don’t have the same distribution for
all the random variables). If we try to plot this in the simple regression case we have that for each value of the
predictor we have a normal distribution which moves along the regression line.

131
So our sample on the "Y " regression is a set of normally distributed random variables but the symmetry of all
the random variables follows the regression line: for each value of the predictors we have a different mean of the
response variable (the shape of the Gaussian distribution is the same since we have the same variance).

11.5 Properties of the LS estimator


Denote as usual with "B" the LS estimator of "β", i.e.:

B = (X t X)−1 X t Y

we can prove that:

• The LS estimator "B" is unbiased (the expected value of the estimator is equal to the parameter):

E(B) = β

Here the demonstration:


E((X T X)−1 X T Y ) = E((X T X)−1 X T (Xβ + ε))
Y
T
E((X X) −1
X Xβ + (X X)
T T −1
X ε) = E((X X)
T T −1
X Xβ) + E((X T X)−1 X T ε)
T

So the first part is a constant (a number) since we have neither the "Y " nor the "ε" (we just have the expected
value of "β" which is "β"). In the second term we have "ε" and so we obtain:

= β + (X T X)−1 X T E(ε) = β
=0

but since the mean of the error is "0" the last component disappears.
• If the error terms are uncorrelated with equal variance "σ 2 ", the variance/covariance matrix of "B" is:

COV (B) = σ 2 (X t X)−1

From this equation we can see that for uncorrelated predictors the components of the vector "B" are uncor-
related. This is important because if we have predictors with a large correlation we face serious problems:
if we have correlation on the "B" and we have an error in the component, there’s a high probability to have
an error in all the components. It is indeed very important in regression models to avoid highly correlated
predictors (this problem is called multicollinearity). This is why regression modeling is problematic in very
large datasets: if we have a large number of predictors there’s a high probability of having correlation between
the predictors (this may have an impact on the estimation and on the testing) [the last part of this course
will be about the model selection].

132
11.6 The Gauss-Markov theorem
This last point is about the selection of the consistency of the estimator.

Gauss-Markov Theorem
The Gauss-Markov theorem says that the LS estimator "B" is BLUE: under the assumption of error terms
uncorrelated and with equal variance "σ 2 ", the LS estimator:

B = (X t X)−1 X t Y

is BLUE (Best Linear Unbiased Estimator). This means in particular that for all other linear estimator
"B̃ = CY " one (the best) has the lower variance:

V AR(Bj ) ≤ V AR(B̃j )

11.7 Questions
1. Is at least one of the predictors "X1 , X2 , . . . , Xp " useful in predicting the response?
If the answer is no we can’t use our data since there’s no significant connection between the "Y " and the
predictors. This means that the model isn’t useful at all.

2. Do all the predictors help to explain "Y ", or is only a subset of the predictors useful?
Given that we have at least one useful predictor in our set of predictors, can we use a smaller subset of
predictors? Smaller sets of predictors are always more preferable.
3. How well does the model fit the data?
Here we compute the goodness of fit of the model to the data.
4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction?
5. Can we incorporate categorical predictors and interactions between the predictors?

133
Advertising (ISLR)

Here we have a sample of "200" markets: our objective is to explore the relations among the "sales" of a prod-
uct and the advertising budget on "TV ", "radio", and "newspaper". The data is in the file "Advertising.csv".
• Response variable: "sales"
• Predictors: "TV ", "radio", "newspaper"
Here we perform a linear regression by using the "lm" function: notice that we have the response variable
"sales" and then all the regressors separeted by the "∼".

model=lm(sales~TV+radio+newspaper,data=Advertising)
summary(model)
anova(model)

The output of the "summary" is:

Call:
lm(formula = sales ~ TV + radio + newspaper, data = Advertising)
Residuals:
Min 1Q Median 3Q Max
-8.8277 -0.8908 0.2418 1.1893 2.8292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.938889 0.311908 9.422 <2e-16 ***
TV 0.045765 0.001395 32.809 <2e-16 ***
radio 0.188530 0.008611 21.893 <2e-16 ***
newspaper -0.001037 0.005871 -0.177 0.86
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 1.686 on 196 degrees of freedom
Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16

> anova(model)

Analysis of Variance Table


Response: sales
Df Sum Sq Mean Sq F value Pr(>F)
TV 1 3314.6 3314.6 1166.7308 <2e-16 ***
radio 1 1545.6 1545.6 544.0501 <2e-16 ***
newspaper 1 0.1 0.1 0.0312 0.8599
Residuals 196 556.8 2.8
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

134
11.8 F-Test
The first question is: "is at least one of the predictors "X1 , X2 , . . . , Xp " useful in predicting the response?".
Let us consider two nested models (nested means that we have a large model with a smaller model inside of it).
So we have:

• The complete model (large model) with predicted values:


(c)
ŷi = b0 + b1 xi1 + b2 xi2 + · · · + bp xip

• The empty or null model:


(r)
ŷi = b0 = ȳ

Ideally we want to compare the residuals of the two models. So we define the residual sum of squares for both the
complete and the reduced model:
n n
(c) (r)
X X
RSSC = (yi − ŷi )2 RSSR = (yi − ŷi )2 = SSY
i=1 i=1

Of course we have that the complete model is more precise than the reduced model since we have a larger number
of degrees of freedom:
RSSR > RSSC
We again use the hypothesis test method in order to test the validity of the reduced model. If we accept the reduced
model it means that there’s no significant predictor in our model: we accept a model with only the intercept. We
define:
H0 : β1 = · · · = βp = 0 H1 : ∃βj ̸= 0

Fisher’s test statistic


The test statistic is:
(RSSR − RSSC )/p
F = RSSC /(n − p − 1)

The test statistic is based exactly on the difference between the RSS of the empty and of the complete model.
If this difference is sufficiently small than we accept "H0 " (there are no significant predictors). Fortunately
we don’t have to compute this value since we can use the p-value (the last line of our R output).

If the errors "εi " are i.i.d. normally distributed, then "F " has a known distribution under "Ho ", named as "the
Fisher’s distribution with "p" degrees of freedom in the numerator and "(n − p − 1)" degrees of freedom in the
denominator". The test is a one-tailed right test.

135
The F-test can be used in general for testing the nullity of a set of "q" coefficients. In such a case let us consider
two nested models:

• The complete model with predicted values:


(c)
ŷi = b0 + b1 xi1 + b2 xi2 + · · · + bp xip

• A reduced model obtained by deleting "q" predictors, with predicted values:


(r)
ŷi = b0 + bq+1 xi,q+1 + bq+2 xi,q+2 + · · · + bp xip

To test the validity of the reduced model we define:

H0 : β1 = · · · = βq = 0 H1 : ∃βj ̸= 0

The test statistic is:


(RSSR − RSSC )/q
F = RSSC /(n − p − 1)

with Fisher distribution with "q" degrees of freedom in the numerator and "(n − p − 1)" degrees of freedom in the
denominator.

11.9 The Student’s "t" test


The second question is: "do all the predictors help to explain "Y ", or is only a subset of the predictors useful?"
The choice of a reduced model is a delicate issue, especially when the number of predictors is large. We will
consider this topic later in the next lectures. Here we present a simple test for the significance of a single coefficient.
In the multiple regression, the coefficient "βj " can be read off as:
∂Y
βj =
∂Xj
so "βj " accounts for the variation of "Y " with respect to "Xj ", all other quantities being fixed. The parameter "βj "
is estimated with "Bj " and, from our previous theory, we know both the expected value and the variance of the
estimator "Bj ", so we can standardize and define the test statistic.
For a given "j = 1, . . . , p" we define the hypotheses on the single parameter:

Ho : βj = 0 H1 : βj ̸= 0

Student’s test statistic


The test statistic is:
Bj
Tj = p
COV (B)jj
Under "H0 " the statistic "Tj " has a Student’s t distribution with "(n − p − 1)" degrees of freedom. The test
is (usually) a two-tailed test.

11.10 "R2 " coefficient (coefficiente di determinazione)


Third question: "how well does the model fit the data?"
The first numerical summary is the coefficient "R2 ":
RSSC
R2 = 1 − 0 ≤ R2 ≤ 1
SSY
where the term "RSSC " is also called error variance and the term "SSY " is called total variance. If this ratio
is small it means that we have a good model since we have a small error with respect to the total variance. As in
the simple linear regression, "R2 " is the fraction of the variance of "Y " explained by the model.

136
There are also other possible choices:

• The adjusted "R2 ", which takes into accounts the number of predictors:
RSSC /(n − p − 1)
2
Radj =1− SSY /(n − 1)

• The residual standard error "RSE":


s
1
RSE = RSSC
(n − p − 1)

but we will discuss these numbers later in this class.

11.11 The plot of the residuals


The plot of the residuals is important because it prevents us from using the linear model in the cases where the
dependence is non-linear (if for example we have dependence in a quadratic form we can of course compute the
regression line [we always can] but it will be faulty). To assess the validity of the linear model, in the simple
regression you can plot the scatter-plot of "Y " and "X" and add the regression line (this is no longer possible in
the context of multiple regression). The solution is to plot the residuals vs the predicted values (the residuals are
indeed the difference between the observed and the predicted "Y "):

e = Y − Ŷ

So the idea is to check possible non-linear relationships between the "Y " and the predictors by using the residuals.

Independency

It can be shown that expected values and residuals are independent

The residuals, unlike the errors, do not all have the same variance: it decreases as the corresponding x-values
gets farther from the means.

Studentized residuals

The Studentized residuals are (we standardize them by dividing the residuals by their standard deviation):
e
ẽi = √i
RSE 1 − hii
where "hii " is the i-th diagonal element of the hat matrix:

H = X(X t X)−1 X t

It is very important to use Studentized residuals instead of the standard residuals because with them we
can check if the residual is large or not: for Studentized residuals the standard bounds are "[−2, 2]" and so
all the values outside this interval are considered "unusual" (a sort of outlier of the residuals).

137
In a good regression model, the Studentized residuals should not present any pattern and, as we just said,
observations with Studentized residual outside the interval "[−2; 2]" should be regarded as outliers. This is our case:

here we have the "predictors" and the "Studentized predictors" on the axes. We take the predicted values "Ŷ " over
the abscissae axis and Studentized residuals "ẽ" over the ordinate axis. By looking at the graph we understand
we have a good regression when we don’t obtain any shape (we should obtain a regular cloud of points inside the
standard interval).
In our case we have two main problems: the first one is that we have a cloud of values which generates a particular
shape, and second we have several outliers (values outside the standard interval "[-2,2]") on the bottom part of the
graph.

11.12 Prediction
In this last point we see how to use the regression for the prediction. This is the first part of "true statistical model":
we have data points, from the data points we compute an equation and then we use the equation to predict values
of the response variable, also for data for which we don’t have the response variable. While computing the predicted
value isn’t that hard (we just need to plug the "X" into the equation and obtain the prediction), in this framework
computing the variance of the prediction (computing the confidence interval) is a bit more difficult. Prediction for
new observations is indeed made through confidence intervals. There are two kinds of such intervals:

• CI’s for the mean predicted value [prediction of the random variable]

• CI’s for a specific new value (forecast which is the new observation) [it is a confidence interval for the outcome
of the random variable]

We use the Software R function "predict" with "interval="confidence"" for means and "interval="prediction""
for forecasts. For instance:

predict(model,newdata=data.frame(TV=150,radio=50,newspaper=50), interval="prediction")

> fit lwr upr


1 19.17821 15.81828 22.53815

where "model" is the output of the "lm" function containing the equation. The output is the prediction for a new
observation which invests "150" for "TV", "50" for "radio" and "50" for "newspaper". From the output we see that
the new prediction for the company is "19.12" with a confidence interval of "[15.82; 22.54]".

138
11.13 Working with qualitative predictors
In our previous example we only worked with quantitative variables but predictors are not random variables (there’s
no randomness in them). This means that "as predictor" we can use for example quantitative variables or qualitative
predictors (for example "the type of the company" or its "location"). In this last case we indeed have a predictor but
we don’t have any "number" for it so we can’t use it in our equation. When we need to use a qualitative predictor
with "k" levels, we need:

• to create "k − 1" dummy variables (indicator variables with "1" on a given level, and "0" otherwise). Notice
that we can’t take all the "k" elements because we would obtain a "non full rank" matrix and we wouldn’t
compute the MLE and the LS estimator.
• to test the significance of the predictor with an F-test, since "(k − 1)" coefficients are involved simultaneously

This step is done by the "lm()" function, provided that the system correctly classifies the qualitative variable.

11.14 Interactions
To check for interactions between the predictors we need to add the product of two (or more columns) to our design
matrix. For instance if we include the interaction "TV-radio" we write (we replace the "+" with the "∗" since we
pass from an additive model to a multiplicative model):

model.wi=lm(sales~TV*radio+newspaper,data=data)
summary(model.wi)

> Call:
lm(formula = sales ~ TV * radio + newspaper, data = data)
Residuals:
Min 1Q Median 3Q Max
-6.2929 -0.3983 0.1811 0.5957 1.5009
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.728e+00 2.533e-01 26.561 < 2e-16 ***
TV 1.907e-02 1.509e-03 12.633 < 2e-16 ***
radio 2.799e-02 9.141e-03 3.062 0.00251 **
newspaper 1.444e-03 3.295e-03 0.438 0.66169
TV:radio 1.087e-03 5.256e-05 20.686 < 2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 0.9455 on 195 degrees of freedom
Multiple R-squared: 0.9678, Adjusted R-squared: 0.9672
F-statistic: 1466 on 4 and 195 DF, p-value: < 2.2e-16

This model has 4 predictors which are "TV ", "radio", "newspaper" and the interaction between "TV*radio". From
the output we notice that all the estimates are positive numbers: this means that each expenses in adv causes an
increase in the sales (the response variable). We then notice that "newspaper" is not significant and that there’s a
significant positive interaction (we have to look at the "TV:radio" estimate): the effect is more than linear. This
means that it’s better for example to split the budget among these two variables (and not only one of them) since
their interaction generates more sales (has a positive effect due to their "correlation" [of course not the correlation
of random variables but as we mean their "association"]).
Notice that we could also consider multiple interactions but also remember that the more interactions we add,
the less "useful" the model is. Usually when variables are non-significant we don’t consider them in the interaction
analysis: interaction is defined only for significant predictors.

139
11.15 Nonparametric approach: the k-NN method
As we have considered in the parameter estimation the two approaches (parametric approach based on the likelihood
estimation and the non-parametric based on bootstrap) we perform the same also in the regression framework. The
regression line based on the LS estimation and the MLE is the so called parametric estimation: we write a linear
equation and we have all the parameters which are the coefficients inside our equation.
There’s a simple non-parametric counterpart for the regression which is called "k-NN". The "k-NN " (abbreviation
for k-Nearest Neighbors) algorithm is the easiest non-parametric solution for regression (and classification). This
method computes predicted values based on a similarity measure on the regressors. It is a "lazy learning" algorithm:
it does computations only when the computation of a predicted value is requested.

k-NN method

In this framework we don’t use equations for the computation of the predicted values (for our model) but,
whenever we have a new data to predict, we find its "k" nearest neighbors in the space of the predictors and
use them to compute the predicted value.

In formulae, if we have to predict "yn+1 " we define:


1 X
ŷn+1 = yi
k
i∈Nn+1

where "Nn+1 " is the set of the "k" observed points nearest to the "(n + 1)"-th (the neighbors).

k-NN
We consider an example where we have two predictors "X1 " and "X2 ": our goal is to predict a response
variable "Y " (we suppose that the observed values are the black triangles in the figure). We need to compute
our predicted value for the values of the predictors given the red point in the graph. In order to define
the predicted value we just to consider the nearest "k" values (points) and then compute the mean of the
response variable of these points. So if for example we consider "k = 2" and "k = 5" it means that, starting
from the point, we consider the "2" and "5" nearest points.

140
For "k = 1", we simply divide our plane into several sectors. The predicted values form a piece-wise constant
function over polygons: such regions are called Voronoi cells.

Of course the predicted response variable over each set.

Regression line

Try to figure out what is the "regression line" in the simple linear regression (one predictor) when "k = 1".

In this case for each point we need to consider the nearest point: we have a step function. Here for each
value of the "x" we consider the nearest value on the "x" axis and then we take the corresponding value of
the response variable.

11.16 Distance
Another important problem for the application is "how to compute the distance between the points". Usually the
distance is the Euclidean distance in "Rp " between two vectors (we compute the square root of the sum of all the
differences): v
u p
uX
d(xi , xj ) = t (xi,h − xj,h )2
h=1

In general, the Minkowsky distance is a bit more refined:


p
X 1/r
d(xi , xj ) = |xi,h − xj,h |r

h=1

where "r > 1" is a real parameter.

141
• "r = 1" is know as the city-block distance, or Manhattan distance (used especially for categorical predictors)
• "r = 2" is the Euclidean distance
• "r → +∞" is the Supremum distance:

d(xi , xj ) = M AX |hi,h − xj,h |r


h=1,...,p

The problem here is the standardization of the variables because if we take multidimensional "X" (we have more
than one predictor) the distances over one axis may become irrelevant (for example on one axis we have a great
distance and on the other a very small one). The solution to this problem is the standardization of the variables
before starting the analysis. We replace each regressor (predictor) with its standardized version. We rescale the
data so that the mean is zero and the standard deviation is "1":
xh − x̄h
xh →
sn
In k-NN algorithm however the usual choice is:

xh − min(xh )
xh →
max(xh − min(xh ))

11.16.1 k-NN in Software R

The k-NN regression can be performed with a number of different machine learning packages. Here we use the
"caret" package and the function "knnreg".

k-NN - Advertising (prediction)

We use the previous "advertising.csv" dataset, trying to predict the response for the last "5" rows (using the
first "195" examples). In order to check the goodness of fit of our model we adopt the standard regressions
strategy used when dealing with complicated regression models: the idea is to divide the dataset into two
parts:
• The first one (larger) to train the model [the algorithm]
• The second one to test the model (to see if we obtain a good performance)

In this case we have "200" obs: we use "195" rows for the training and "5" for the testing phase [usually the
standard split is 70% for the training set and 30% for the testing set]. If the last "5" predicted values are
close to the observed values then the model has a good fit.

First, we standardize the variables,defining the function:

stdize=function(x)return((x-min(x))/(max(x)-min(x)))
sc.Advertising=as.data.frame(lapply(Advertising[,2:4],stdize))

Note that the standardization is only applied to the predictors. Select the examples of the training set:

tr.set.x=sc.Advertising[1:195,]
tr.set.y=Advertising[1:195,5]
tr.set=data.frame(tr.set.x,tr.set.y)

and the predictors of the test set (is given by the last lines of the dataframe):

test.set.x=sc.Advertising[196:200,]

142
Now execute:
knnmodel=knnreg(tr.set.x,tr.set.y)
pred_y = predict(knnmodel, test.set.x)

and check:
pred_y
Advertising$sales[196:200]

And so we obtain the set of 5 predicted values. We have a quite good prediction: for example the first observed
value is "7.6" and we predicted "7.66" (and so on). If we want a numerical summary of our goodness of fit we
can take the square correlation between the two vectors. In this case the default value for the "k" is "k = 5".
We can then also check the "R2 ":
Rsq=cor(pred_y,Advertising$sales[196:200])^2

11.17 The k-NN for classification


Due to its flexibility, the k-NN method can be used also for classification. In classification problems, the response
variable is categorical.

The k-NN rule for classification


The classification of a new observation is made with the majority criterion: each new instance is assigned to
the most common category amongst its k-Nearest Neighbors.
All our previous comments for k-NN regression are still valid (standardization, . . . ).

Response variable with 2 levels (coded with colors green and black):

143
11.18 The effect of "k"
In classification, it is simple to show the effect of "k" on the classification rule.

Increasing "k" yields smoother predictions, since we average over more data. When "k = 1" it yields "y=piecewise
constant labeling", "k = n" predicts "y=globally constant (majority) label".

11.18.1 With Software R

The Software R package for doing classification with k-nearest neighbors is "class". Here a "knn" function is
available (similar to the "knnreg" we have analyzed earlier). Again, there is large number of functions to perform
k-NN classification. So, always use the documentation to understand the syntax of your function and the data
structure for the input.

11.19 Pros and cons of k-NN


• PROS
⋄ Learning and implementation is extremely simple and intuitive
⋄ Flexible decision boundaries
CONS
⋄ Irrelevant or correlated predictors have high impact and must be eliminated (not a good method for
model selection)
⋄ Typically difficult to handle in high dimensions
⋄ Computational costs
⋄ Not useful for model description (it does not produce any equations)

144
12 Lab 5
12.1 Multiple Linear Regression - Boston Housing
In this lab we review the basics of Multiple Linear Regression using a classical data set, the "Boston housing"
sample. To use the data, just load the package "MASS", and the dataset named "Boston" (for the description
of the variables type "help(Boston)"). The response variable in this data set is "medv", the median value of
owner-occupied homes (in thousands of dollars).

1. Do some basic descriptive statistics for all the variables (use first the "summary" function). Look at the
boxplots to detect outliers in the (univariate) distributions and to see the symmetry/asymmetry of the dis-
tributions.
The "Boston Housing" dataset was made to predict the price (the "value") of the houses in Boston, starting
from geographical and socio-economic predictors. The response variable is "medv" which is (we can also see
this from the "help" tab) the median value of owner-occupied homes in thousands of dollars. So in our dataset
we have a list of regressors and in the end the response variable.
Here we just import the libraries we need and then compute a basic description of our data (we explore the
random variables). Remember that in regression problems the most important random variable is the response
variable because it is the only variable with a random component (the predictor are fixed).

library(MASS)
library(corrplot)
library(nortest)

###1)
data(Boston)
View(Boston)

#Explore the response variable (the most important)#


summary(Boston$medv)
boxplot(Boston$medv,horizontal=T)
plot(density(Boston$medv))

#Explore the quantitative predictors#


summary(Boston$crim)
boxplot(Boston$crim,horizontal=T)

#Explore the categorical predictors#


table(Boston$chas)

From the output ("summary") we notice that the response variable "medv" is highly skewed and asymmetric
(check the values of the quantiles and so on). We can of course see this more clearly by using the boxplot
function. From the plot we also notice that there are several outliers on the right side of the boxplot and the
same conclusion is also given by the plot for the density estimation.

145
As we can see we have a quite strange distribution which doesn’t look like a Gaussian distribution: this is not a
problem since in regression we don’t need to have normal distributions. The response variable is not normally
distributed but the conditional distribution of the response variable given the predictors is. In this case the
only normal distribution we actually need to have is the distribution of the residuals.
We can then explore also the "quantitative" and "categorical" predictors (in the output we just showed one
example for each case):
• The quantitative predictor "crime" is again highly skewed and highly asymmetric (we again use the
"summary" and the "boxplot" functions). We again notice that there are several outliers on the right
side of the box:

• For the categorical predictors we considered the "chas" predictor (it is a dummy variable where "0" means
that the geographical zone is not crossed by the river). Here we can just use the "barplot" or the "table"
functions in order to check the proportion in the division of the observations.

2. Compute the correlation matrix. If you would like to have a plot of the correlation matrix, use the function
"corrplot" in the package "corrplot".
In our dataset we have only numerical variables (even the categorical predictors are dummy variables with
"0" and "1" values) and so we can directly compute the correlation of the entire dataframe. By default the
correlation of the entire dataframe (we use the function "cor") is the correlation between all the possible pairs
of predictors.

#Explore correlations#
cor(Boston)
corrplot(cor(Boston))

Of course, since we have the response variable, 13 predictors and 506 observations, it’s not easy to directly
comprehend the output of the correlation: in order to obtain a clearer output we use indeed a graphical
representation of the correlation by using the "corrplot" function:

146
Here we have a picture representing the correlation between all the variables. The correlation spans in the
interval "[−1, 1]" and we have different colors for negative (red) and positive (blue) correlations (of course
notice that along the diagonal every variable has a perfect correlation with itself). We are then interested in
all the non-diagonal points: for example between "indus" and "tax" we have a strong positive correlation.
Another interesting example is indeed the strong positive correlation between the industry "indus" variable
and the air pollution "nox" variable. It’s also important to notice that in this plot of the correlation matrix
all the variables are represented, also the dummy variables of categorical variables (here we have the "chas"
dummy variable). This means that for this kind of variables the use of the correlation is not totally correct
from a statistical point of view (they are not quantitative).
3. Use the "lm" function to define a model with all the predictors. Look at the "R2 ", the plot of the residuals,
the Fisher and Student’s tests. Remove the non-significant predictors, and look again at the diagnostics. In
your final model, test the normality of the residuals.
Here we start to do the regression. The complete model is obtained with the "lm" function which has a particular
formula: first of all we insert the response variable "medv" and then, after the tilde, all the predictors separated
by a "+" sign. Remember that in regression is not recommended to use the "attach" function because, as we
will see, it creates problems when dealing with several dataframes which use the same variables.

model_c=lm(medv~crim+zn+chas+nox+rm+dis+rad+tax+ptratio+black+lstat,data=Boston)
summary(model_c)

As we can see from the output the model is significant since the p-value is the Fisher Test is very small. The
multiple R-squared in this case is sufficiently large and only few predictors are not significant ("indus" and
"age").

Call:
lm(formula = medv ~ crim + zn + indus + chas + nox + rm + age +
dis + rad + tax + ptratio + black + lstat, data = Boston)

Residuals:
Min 1Q Median 3Q Max
-15.595 -2.730 -0.518 1.777 26.199

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
crim -1.080e-01 3.286e-02 -3.287 0.001087 **
zn 4.642e-02 1.373e-02 3.382 0.000778 ***
indus 2.056e-02 6.150e-02 0.334 0.738288
chas 2.687e+00 8.616e-01 3.118 0.001925 **
nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***

147
age 6.922e-04 1.321e-02 0.052 0.958229
dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
tax -1.233e-02 3.760e-03 -3.280 0.001112 **
ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
black 9.312e-03 2.686e-03 3.467 0.000573 ***
lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.745 on 492 degrees of freedom


Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16

If we want to create a reduced model starting from this complete model we can then remove these two non-
significant predictors (we will just have 11 predictors instead of 13). This is still not a good model since we
seek for small models and 11 predictors are too many (we will do it later). In order to perform this we just
use the same lines of code but this time we remove the unwanted predictors:

model_r=lm(medv~crim+zn+chas+nox+rm+dis+rad+tax+ptratio+black+lstat,data=Boston)
summary(model_r)

Again we obtain a very similar output. Now that we have our "final" model we compute and test the normality
of the residuals:

plot(predict(model_r), rstudent(model_r),xlab="Predicted",ylab="St. residuals",pch=17)


ad.test(model_r$residuals)

Here we have the predicted residuals over the "x" axis and the Studentized residuals over the "y" axis. From
the plot we notice some "shape" and so we can’t be really satisfied from our residuals. Also notice that there
are several outliers outside the standard boundaries "[−2, 2]" (we have a very "large/wide" residuals’ plot).
As we can see from the plot we can easily state that the residuals are not normally distributed but in order to
double-check the non-normality of the residuals we can then use the "ad.test" function:

Anderson-Darling normality test

data: model_r$residuals
A = 10.483, p-value < 2.2e-16

As we predicted the Anderson-Darling test rejects the normality of the residuals (the p-value is very small).

148
4. Based on the correlation matrix, select the four predictors most correlated with the response variable, and
find a possible reduced model as above. If a nonlinear relationship between the response and a predictor is
likely (e.g. if the response and the predictor have different shapes of the distribution), you can add a quadratic
term. If the regressor is "x" the quadratic term is "I(x2 )". As always look at the "R2 ", the plot of the residuals,
etc.
The idea behind this point is that the model-selection based on the p-value (the Student t test) didn’t give a
satisfactory answer because we reduced the model but not in a considerable way as we still have 11 predictors
(we are still in a large model). So we now try a different solution based on this remark: since in terms
of regression the most relevant correlations are the ones in the last row (column) of the matrix, we check
the correlation between the response variable and the predictors. So the idea is to take just the 4 largest
correlations (of course we consider their absolute value) and to define a model where we consider just those
particular predictors.
In order to perform this we don’t display again the whole matrix but we just consider the last column (row)
and check which are the "greatest" values:
cor(Boston)[,14]

> crim zn indus chas nox rm age dis


-0.3883046 0.3604453 -0.4837252 0.1752602 -0.4273208 0.6953599 -0.3769546 0.2499287
rad tax ptratio black lstat medv
-0.3816262 -0.4685359 -0.5077867 0.3334608 -0.7376627 1.0000000

As we can see from the output the best correlations (in absolute value) are (in order) "lstat", "rm", "ptratio"
and then "indus". And so now we define a model only based on these four predictors:
model_1=lm(medv~lstat+rm+ptratio+indus,data=Boston)
summary(model_1)

Again from the output of the summary we can check that all the predictors are significant except "indus": this
means that in our final model we will just remove it:
model_2=lm(medv~lstat+rm+ptratio,data=Boston)
summary(model_2)

Notice that in this passage from "11" to just "3" predictors we record a very small decrease in the value of
the multiple R-squared. So we obtained, at the cost of a slight loss of precision, a much leaner and simple
model. The idea of model-selection we will study in the last part of the course will be indeed about the trade-off
between these two extremes: the best (precise) prediction and the simplest model.
The question about "non-linear" relationship is not so relevant for this exercise but it’s important for the
theory of this course. If we find a "non-linear" relation between the response variable and the predictors (for
example the plot of the residuals have a parabolic shape) we can try to add a quadratic term to the predictors:
for example we don’t consider "lstat" but "lstat2 ". The important point here is that the model is still linear
because it is linear on the parameters, but the predictors can be transformed as we want (we can take the
"log", the "sin", the "cos" and so on). For example we now add some quadratic terms by using the function
"I(x2 )" to the last final model we computed:

model_3=lm(medv~lstat+I(lstat^2)+rm+I(rm^2)+ptratio+I(ptratio^2),data=Boston)
summary(model_3)

Here the output changes as the "ptratio" becomes non-significant even in its linear form.
5. Comment the results of your analysis.
As we previously said adding non-linear forms to the model doesn’t change the linearity of the model, as it is
linear in the "β", but we can "change" (with functions) the predictors and we don’t have to set any probability
distribution as they are fixed.
We also noticed that adding or removing predictors from the model can actually change the significance for
other predictors (as we saw before) [this is due to the multicollinearity of the correlation matrix].

149
COMPLETE EXERCISE

library(MASS)
library(corrplot)
library(nortest)

###1)
data(Boston)
View(Boston)

#Explore the response variable (the most important)#


summary(Boston$medv)
boxplot(Boston$medv,horizontal=T)
plot(density(Boston$medv))

#Explore the quantitative predictors#


summary(Boston$crim)
boxplot(Boston$crim,horizontal=T)

#Explore the categorical predictors#


table(Boston$chas)

###2)
#Eplore correlations#
dim(Boston)
cor(Boston)
corrplot(cor(Boston))

###3)
#Multiple linear regression - Complete model#
model_c=lm(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat,data=Boston)
summary(model_c)

#Multiple linear regression - Reduced model#


model_r=lm(medv~crim+zn+chas+nox+rm+dis+rad+tax+ptratio+black+lstat,data=Boston)
summary(model_r)
plot(predict(model_r), rstudent(model_r),xlab="Predicted",ylab="St. residuals",pch=17)
ad.test(model_r$residuals)

#Multiple linear regression - Best 4 correlations#


cor(Boston)[,14]
model_1=lm(medv~lstat+rm+ptratio+indus,data=Boston)
summary(model_1)

#Multiple linear regression - Best 3 correlations#


model_2=lm(medv~lstat+rm+ptratio,data=Boston)
summary(model_2)

#Multiple linear regression - Quadratic terms#


model_3=lm(medv~lstat+I(lstat^2)+rm+I(rm^2)+ptratio+I(ptratio^2),data=Boston)
summary(model_3)

150
13 Generalized Linear Models
13.1 Introduction
In past lectures we recalled the basics about linear regression. Let us now take up the question of building models
where the standard assumptions of the linear model are impossible or not plausible. For instance:

• Outcomes with skewed distributions


• Outcomes with unequal variance
• Binary and categorical outcomes

• Discrete and count outcomes

13.2 A review of the Gaussian model


A Gaussian model is made of three components:

• A random component:
Yi ∼ N (µi , σ 2 )

• A systematic component:
ηi = xti β

• A link between the two parts


µi = ηi

13.3 Generalized linear models


The basic structure of a Generalized Linear Model (GLM) is as follows:

• The random component:


Yi ∼ some distribution with mean µi

• The systematic component:


ηi = xti β

• The link function "g" such that:


g(µi ) = ηi = xti β

13.4 The random component – overview


In principle, we could specify any distribution for the outcome variable. The basic requirement is make a reasonable
choice guided by the main features of the response variable.

[We will see later that the mathematics of generalized linear models work out nicely only for a special class of
distributions, namely the distributions in the exponential family.]

This is not as big a restriction as it sounds, however, as most common statistical distributions fall into this family,
such as the Normal, Binomial, Poisson, Gamma, and others.

151
13.5 The systematic component – overview
The quantity:
ηi = xti β
is referred to as the linear predictor for observation "i". The linear predictor is the way to use again the techniques
of linear models in GLMs. From:
E(Yi , xi ) = xti β
we need now to move to:
g(E(Yi |xi )) = xti β
We need to introduce the function "g" to take into account the special features of the response variable.

13.6 The link function – overview


In principle, "g" could be any function linking the linear predictor to the distribution of the outcome variable. In
practice, we also place the following restrictions on "g":

• "g" must be smooth (i.e. differentiable)

• "g" must be monotonic (i.e. invertible)

We will see later that again that we obtain some mathematical advantages when writing the distribution of the "Y "
in exponential form and then deriving from that the link function.

The number of doctor’s visits


Consider modeling the number of doctor’s visits per patient per year. We may assume that the number
of visits increases exponentially with age of the patient. Thus, if "µi " is the expected number of visits for
patient "i" with age "ai ", a model of the form:

µi = γ · exp(δai )

might be appropriate. If we take the logarithm of both sides:

log(µi ) = log(γ) + δai = β0 + β1 ai

Furthermore, since the outcome is a count, the Poisson distribution seems reasonable. Thus, this model
fits into the GLM framework with a Poisson outcome distribution, a log link, and a linear predictor given
"β0 + β1 ai ".

152
Predator-prey model

The rate of capture of prey, "yi " , by a hunting animal increases as the density of prey, "xi " , increases, but
will eventually level off as the predator has as much food as it can eat. A suitable model is:
αxi
µi =
h + xi
This model is not linear, but taking the reciprocal of both sides:
1 h + xi 1
= = β0 + β1
µi αxi xi
Because the variability in prey capture likely increases with the mean, we might use a GLM with a reciprocal
link and a gamma distribution.

13.7 How to extend the LM theory


Since the linear predictor in a GLM has the same expression as in Gaussian linear models, the first idea is to use
again the Least Square technique to estimate the "β" parameters. The problem here is that the inference in linear
models is based on the assumption:
εi i.i.d. εi ∼ N (0, σ 2 )
or equivalently:
Yi independent and V AR(Yi |xi ) = σ 2
While independence still holds in GLM’s, the homoskedasticity property does not. When homoskedasticity is
violated, the Gauss-Markov theorem is no longer valid, and ordinary LS (OLS) is not the right solution.
Take for instance the simplest case, a Bernoulli response variable:

Yi = Bern(pi )

If we write:
E(Yi |xi ) = xt β Yi |xi ∼ Bern(µi )
where for notation-convenience we set "pi = µi = xti β", we can compute the estimate "β̂OLS ", but the estimator
"BOLS " is not minimum-variance. In fact, we have:

V AR(Yi |xi ) = µi (1 − µi )

and, except from the trivial case of constant "µi ", there is heteroskesticity: heteroskesticity is an intrinsic property
of GLM’s.

13.8 Weighted least squares


The basic idea is to replace the OLS with a more general solution, the Weighted Least Squares (WLS). From a
sample "Y1 , . . . , Yn " with variances "V AR(Yi ), . . . , V AR(Yn )", how do we modify the sample in order to have equal
variance?
Y1 Yn
,...,
σ Y1 σ Yn
When considering the residuals in a regression, we add weights "w1 , . . . , wn ", and the WLS solution is now given
by:
n
∂ X
wi (yi − xti β)2 = 0
∂β i=1
i.e.:
n
X
xi wi (yi − xti β) = 0
i=1

153
WLS estimator

Define "W = V AR(Y )−1 ", a diagonal matrix with the reciprocal of the variances as diagonal elements.
Then, by solving:
Xn
xi wi (yi − xti β) = 0
i=1

we obtain:
BW LS = (X t W X)−1 X t W Y
where the "BW LS " is the best linear unbiased estimator for the heteroskedastic regression problem and
the "W " is the matrix of the weights, a diagonal matrix with the reciprocal of the variance over the main
diagonal and "0" elsewhere:
... 0
1 
σ12
 . .. 
..
W =  .. . . 

0 . . . σ12
n

So in general this is the solution of the Least Square estimation for heteroskedastic problems, so when the
variance is difference for each observation.

The problem here is that we assume that the weights are known. In order to compute the estimator (the weighted
least square estimator for the "β") we need to know the vector of regressors "X", the vector of response variables
"Y " and also the weights. So the problem is that in general we have no information over the "σi ".
We now consider the simplest case which is the response variable of a Poisson distribution (we consider this
function since it is the simplest as the mean is equal to the variance):

Yi ∼ P(µi )

µ2i = µi
µi = g −1 (xti β)
But the mean of the response variable is exactly what we want to estimate with our linear model (we don’t know
its value): in this framework we operate with unknown weights.

13.9 WLS – Known weights


Note that "β̂" is once again a linear combination of "{yi }", and therefore the estimator is normally distributed:

BW LS ∼ Np+1 (β, σ 2 (X t W X)−1 )

and recall that:


V AR(Y ) = σ 2 W −1
The subsequent derivation of confidence intervals and tests is similar to the linear regression.

13.10 WLS – Unknown weights


The preceding approach easily handles unequal variances for problems in which the relative variances of the outcomes
are known (note that it was not necessary to know the actual variance, just the relative ratios). However, what
about situations where the variance seems to increase with the mean? Here, we don’t know the means (otherwise
we wouldn’t need to fit a model) and therefore the appropriate weights are unknown parameters as soon as the
means are.

154
Gaussian regression (VAR depends on the mean)

The cross-section data set "foodexpenditure" contains average weekly expenditure on food and average
weekly income in dollars for households that have three family members. "40" households have been randomly
selected from a larger sample. The data set is from William E. Griffiths, R. Carter Hill and George G. Judge,
Learning and Practicing Econometrics 1993, Wiley (Table 5.2, p. 182). The variables in the data set are:

• Response variable: weekly household expenditure on food


• Predictor: weekly household income
So here our objective is to explain the expenditure of food as a function of the income. This is a simple
regression so we can plot it:

reg=lm(foodexpenditure$foodexp~foodexpenditure$income,data=foodexpenditure)
plot(foodexpenditure$income,foodexpenditure$foodexp)
abline(reg, col="blue")

There are several interesting things going on in this data set:


• The first one is that as the income increases, the food expenditure increases as well
• The second one is that as the mean food expenditure goes up, so does variability (variance) [here we
have the so called "megaphone effect"]

So we have an increase of the variance linked to an increase of the expected value.

155
13.11 Iterative re-weighting
Roughly speaking, if we assume that the variance increases with the expected value we can then suppose that the
standard deviation "σ" is proportional to the mean and so that the variance is proportional to the square of the
mean [this is our assumption: it is given from the previous plot (the megaphone effect). The lines imply that the
"σ" is linear with respect to the expected value "µ". Since we noticed this behavior of our data points the standard
assumption is that the standard deviation is linearly dependent on the mean. This means that the variance is the
square of the expected value]:
σ∝µ V AR(Y ) ∝ µ2
Here we don’t know the mean but we know the function which links the variance to the mean.
So the idea is to use the WLS method iteratively: we start from an arbitrary set of weights (maybe constant)
[this means that the first step is basically an OLS]. With the "β" we can have the first estimate of the expected
values (we use the inverse "g"). Then, by using the expected values, we can refine the estimate of the weights. We
use the second estimate of the weights and we compute again the "β" and so on.
2
Once we fit the model, we have estimates for "{µi }" and therefore, for "{wii }", where "ŵii = µ̂1i ". However, once
we change "{wii }", we change the fit, and thus we change "{µ̂i}", which changes "{wii }", and so on. Graphically we
have:
1 

1 ... ...

µ21
... ...
W1 =  ... ..  −→ B = (X T W X)−1 X T W Y −→ µ = g −1 (X T β) −→ W =   ... 1
.. 
. 
1 . 

1 1 1 1 2 2
 µ2 
... ... 1 1
. . . . . . µ2
n

when we compute "µ1 " we then compute the second matrix "W2 " because we have the first estimate of the expected
value (the weights are the reciprocal of the variance and the variance is the square of the mean in our assumption).
In the end we have:
B2 = (X T W2 X)−1 X T W2 Y
and we repeat this process until the convergence.

IRLS algorithm

Consider, then, the following approach to model fitting:


1. We start from an arbitrary weighting matrix (usually the identity matrix) [using the identity matrix
means that in the first step we are applying the standard regression, the OLS method]

2. Fit the model, we obtain the estimates "β̂" (estimate of the "β") and "µ̂" (estimate of the expected
value)
3. We use the expected value "µ̂" to recalculate (update) the weighting matrix "W "
4. We repeat these steps until the convergence

This approach is known as the Iteratively Reweighted Least Squares (IRLS) algorithm.

156
Gaussian regression (VAR depends on the mean)

One way to implement this algorithm is with the following code:

fit.w = lm(foodexp~income, data=foodexpenditure)


for (i in 1:20)
{
w = 1/fit.w$fitted.values^2 #the square is our assumption#
fit.w = lm(foodexp~income, data=foodexpenditure, weights=w)
print(fit.w$coefficients)
}
summary(fit.w)

We start from the standard OLS linear model (first step) where the response variable is the "foodexp"
and the predictor is the "income". This implementation assumes that 20 iterations is enough to achieve
convergence. This is fine for this case, but in general, it is better to write a "repeat" or "while" loop
which checks for convergence at each iteration and terminates the loop when convergence is reached (we are
done when estimates are constant and don’t change anymore). Notice again that the weights "w" are the
reciprocal of the fitted values.

13.12 OLS vs IRLS


Here we have a comparison (using our dataset) of the prediction of the food expenditure based on the income. The
two lines represent the two methods OLS and IRLS: are we notice they are slightly different. By using the following
script we obtain:

plot(foodexp~income,data=foodexpenditure,pch=19,ylab="food expenditure")
abline(fit,col="red")
abline(fit.w,col="blue")

The IRLS method is better because it reduces the weights of the data points with large dispersion. Also notice
that the standard error of the estimator drops from "0.05529" to "0.04643". Furthermore, the "R2" increases from
"0.3171" to "0.4671".

157
13.13 Choice of the weights
In general the choice of the weights is difficult. Here is a statistician (on the right) asking the Delphi’s Pythia (on
the left) what kind of weights should he use.

13.14 Back to GLM’s


We use the previous concepts into the GLM framework. The function linking the variance to the expected value is
a known function from the exponential family theory. So we have that the "variance function" is exactly the link
between the variance of the response variable and its expected value:
V AR(Yi ) = ϕV (µi )
While for general regression problems we need to assign some arbitrary weights, fortunately in the Generalized
Linear Model GLM the density belonging to the exponential family comes with the variance function (we just apply
it). This means that in the GLM there is a function linking the variance to the mean, so the choice of
the weights is not a problem.

13.15 Random component – Likelihood


So we now investigate how the IRLS model takes place into the GLM framework. In the OLS for Gaussian regression
we usually start with the LS estimate and then we conclude that there’s a close connection with the Likelihood, in
the sense that the estimates based on LS algorithm are the same computed with the MLE: in classical Gaussian
regression they give the same results when we consider homoskedastic errors.
When moving to GLM we need to have a different general distribution (not the Gaussian) for the response variable,
and we want to know what is the MLE for the parameter in this case. To analize the MLE for the GLM we need
to define the Likelihood for the random component, which is just the response variable (in a linear model). We do
this by taking the distribution in the exponential family and we define the log-likelihood for the response variable
(for every component "yi "). So in the exponential family, the log-likelihood of the observation "i" is:
yi θi − b(θi )
ℓ(θi , ϕ|yi ) = + c(yi , ϕ)
ϕ
where we separate the parameter of interest "θ" and the disturbance (noise) parameter "ϕ". Here we obtain a linear
expression: notice that the "θ" is the canonical parameter" of the exponential family.

Remark
The log-likelihood of a sample is simply the sum of the log-likelihoods of the single observations, because of
the independence assumption.

The first derivative with respect to "θi " (i.e. the score function for "θi " ) is:

∂ yi − b (θi )
ℓ(θi , ϕ|yi ) = U (θi ) =
∂θi ϕ

158
Remember also that the score function has expected value and variance equal to:

E(U (θi )) = 0 V AR(U (θi )) = −E(U (θi ))

the latter being the (i,i)-th component of the Fisher information (the variance of the score function) matrix.
Since the score has mean zero, we find that: ′
E(Yi ) = b (θi )
The second partial derivative of the log-likelihood (which is the first derivative of the score function) is:

∂2 b”(θi )
ℓ(θ i , ϕ|yi ) = −
∂θi2 ϕ

which is (minus) the observed information for the canonical parameter "θi " pertaining the i-th observation.
Exponential families are important, as we already said, because the score function and the information matrix are
very easy to compute as we don’t need to compute the expectation or variance. We have the function "b": by taking
its first derivative we obtain the "mean" of the response variable and by taking its second derivative we directly
define the information matrix. The variance of the score is the information matrix, so it is the second derivative of
"b" divided by "ϕ". So we have that:

Yi − b (θi ) b” (θi )
   
Yi
V AR(U (θi )) = V AR = V AR =
ϕ ϕ ϕ

we also have:
V AR(Yi ) = ϕb” (θi )
The variance of "Yi " is therefore a function of both "θi " and "ϕ". Note that the canonical parameter is a function of
"µi ": ′ ′
µi = b (θi ) θi (µi ) = (b )−1 (µi )
so that we can write:
V AR(Yi ) = ϕb” (θ(µi )) = ϕV (µi )
where "V " is the variance function dictating the link between the variance and the mean.

13.16 The systematic component and the link function


So now we have some expressions for the response variable (the random component) and we need to link this random
component ("µi " which is the expect value of the response variable "Y ") to the systematic component (the linear
predictor which is the "xti β"). With the systematic component and the link function we introduce in the model
dependence of the response from the predictors. Specifically, the relationship between "µi " and "xi " is given by the
link function "g":
g(µi ) = ηi = xti β

Remark
Note that here we specify a transformation of the conditional mean "µi ", but we do not make any
transformation of the response (random) variable "Yi ". This means that we have:

g(µi ) = g(E(Yi )) ̸= E(g(Yi ))

which is different from the expected value of the transformation. Remember that expected values and
transformations can be switched only for linear transformations.

The inverse of the link function provides the specification of the model on the scale of "µi ":

µi = g −1 (xti β)

We can perform this because here we don’t have any probability distribution, so we don’t have likelihoods: the
predictors "xi " are not random (we have no random components).

159
Canonical link function
The canonical link is:
g(µi ) = θ(µi )
Of course we can use any function as "link" function but the computation becomes easy if we consider the
expression above, where we have the canonical parameter as a function of the expected value "µi ". This is
the most powerful choice because most of computations become easy (on the contrary in the general case
the computations may become difficult and in some cases the solution can’t be computed explicitly).

Here there are two advantages:

• The function derives from the canonical form of the distribution as a member of the exponential family

• The function is the inverse of "b "

Credit card payments during Black Friday (I)

Consider this very small data set:


Y 0 0 2 1 5 7 12
X 14.1 16.3 16.8 20.4 30.2 38.0 39.9
The variables are:
• "Y " [response variable]: number of credit card payments during Black Friday
• "X" [predictor]: annual income of the cardholder

y=c(0,0,2,1,5,7,12)
y=c(14.1,16.3,16.8,20.4,30.2,38.0,39.9)
plot(y,x,ylab ="N° Credit Payments",xlab="Annual Income")

The response variable is a count and we may suppose that variance increases with the mean, so a Poisson
response variable appears as a suitable choice. So we write the Poisson distribution in the form of exponential
families. In this case we have a sample of "7" observations:

Yi ∼ P(λi ) i = 1, . . . , 7

We then compute the log-likelihood for each i-th element (remember that the density of a Poisson distribution
y
is "fY (y) = e−λ λy! "):
yi θi − b(θi )
ℓ(λi |yi ) = yi log(λi ) − λi − log(yi !) = − c(yi )
ϕ

160
but in the Poisson distribution we know that the dispersion parameter is "ϕ = 1" and so the we can remove
the denominator. We then have that the "mean", the "canonical parameter" and the "µ" are equal to:

λi = µi θi = g(µi ) = log(µi ) µi = g −1 (θi ) = exp(θ1 )

With the link function:


g(µi ) = log(µi )
we write the model in the form:

g(µi ) = β0 + β1 xi i = 1, . . . , 7

or in matrix form:
g(µ) = Xβ
where "X" is the design matrix with dimensions "7 × 2" (it contains one column for each quantitative
predictor and a column of all "1" for the intersect):

X = (1|x)

as in the multiple regression framework. In Software R we can use the following script:

y=c(0,0,2,1,5,7,12)
x=c(14.1,16.3,16.8,20.4,30.2,38.0,39.9)
X=cbind(rep(1,length(x)),x) #Here we bind the columns#

where, as we said, we have that our "X" matrix is:

> X

x
[1,] 1 14.1
[2,] 1 16.3
[3,] 1 16.8
[4,] 1 20.4
[5,] 1 30.2
[6,] 1 38.0
[7,] 1 39.9

13.17 Solving the likelihood equations


Since we are interested in estimating the "β" parameters, we need to solve the score equations:

ℓ(β, ϕ|y) = 0 j = 0, . . . , p
∂βj

We can use the chain rule (derivative of composite functions) to obtain the expression for the i-th contribution to
the score function for "βj ":
∂ ∂ ∂θi ∂µi ∂ηi
ℓ(β, ϕ|yi ) = ℓ(β, ϕ|yi )
∂βj ∂θi ∂µi ∂ηi ∂βj
but (remember that previously we showed some properties):

∂ yi − µi ∂µi ∂ηi
ℓ(β, ϕ|yi ) = = b” (θi ) = V (µi ) = (xi )j = Xij
∂θi ϕ ∂θi ∂βj

If we plug these expressions into the previous derivative we obtain the so called "normal equations" (they derive
from the log-likelihood as they are the derivative with respect to the "β"), which is the last equation we need to

161
solve in order to obtain the MLE for the "β":
n
∂ X ∂µi Xij
ℓ(β, ϕ|y) = (yi − µi ) = 0
∂βj i=1
∂ηi ϕV (µi )

Remark: for linear regression we recover the LS solution:

β̂M LE = (X t X)−1 X t y

With similar arguments we can also obtain the MLE estimate "ϕ̂" of "ϕ".
So we have seen that this set of equations is the OLS for the Gaussian distribution: this happens because the
variance function is constant and the derivative of "µ" with respect to "η" is "1". In general we have a WLS (weighted
least squares) problem because we have weights.

Non linearity

The equations:
n
∂ X ∂µi Xij
ℓ(β, ϕ|y) = (yi − µi ) = 0
∂βj i=1
∂ηi ϕV (µi )

incorporate a non-linear part in the dependence "µi = µi (β)", and therefore also in the variance
function"V (µi ) = V (µi (β))". In the sum the weight is represented by the expression " ∂µ
∂ηi ϕV (µi ) = W ".
i 1

We need to use the IRLS method because the weight depends on the mean.

Vector notation
In vector notation:
U = X t W (y − µ) = 0
where "W " is a diagonal matrix with diagonal element:
∂µi 1
wii =
∂ηi ϕV (µi )

13.18 Newton-Raphson
To find the MLE of the parameter means to solve the likelihood equations:
n
∂ X ∂µi Xij
ℓ(β, ϕ|y) = (yi − µi ) = 0
∂βj i=1
∂ηi ϕV (µi )

This is tricky because "β" appears in several places. We use the a special version of the Newton-Raphson
algorithm, which is an iterative algorithm for finding (and approximating) the zero of a convex function, based on
the gradient. Here is a picture on how the Newton-Raphson method works in the univariate case.

162
The basic idea is to take a complicated function and simplify it by approximating it with a straight line:

f (x) ≈ f (x(0) ) + f (x(0) )(x − x(0) )

where "x(0) ” is the point we are basing the approximation on. Thus an iteration moves from "x(0) " to:

f (x(0) )
x(1) = x(0) −
f (x(0) )

In the multi-variable setting:


f : Rp −→ Rp
one has:
f (x) ≈ f (x(0) ) + Jf (x(0) )(x − x(0) )
where "x(0) " is the point we are basing the approximation on. In this context, the Jacobian "Jf " is the multi-
dimensional extension of the derivative. Thus an iteration moves from "x(0) " to:

x(1) = x(0) − Jf−1 (x(0) )f (x(0) )

then replace "f (x)" with "U (β)" . . .

So now we need to use this general rule for a general "x" for the score function as a function of "β". Starting
from a parameter vector "β (0) ", we define a sequence:

β (0) , β (1) , . . . , β (r) , . . .

where the iterative rule is:


β (r+1) = β (r) − JU−1 (β (r) )U (β (r) )
The term "JU−1 (β (r) )" is called the observed information: the Jacobian of the score is the information matrix.

13.19 Fisher scoring


Fisher scoring is a modification of the previous iterative formula, obtained by replacing the observed information
with its expected value, i.e. the Fisher information. When computing the gradient of the score some analytical
machinery is needed because the "β" appears in several places. However, we get the Fisher information "I(β)" with
entries:
n  2
X ∂µi Xi,j Xi,k
I(β)j,k =
i=1
∂ηi ϕV (µi )
This information matrix can also be written in a concise form as:

I(β) = X t W̃ X

where matrix "W̃ " is a diagonal matrix of weights in some sense, with diagonal elements:
2
1

∂µi
w̃ii =
∂ηi ϕV (µi )

We can now insert our information matrix inside the iterative rule. So we have that the "β" at the "r + 1" iteration
is equal to the "β" at the previous step updated with the rule:

β (r+1) = β (r) − (X t W̃ X)U β (r)

We can then write (we multiply both parts)::

β (r+1) = (X t W̃ X)−1 (X t W̃ X)β (r) − (X t W̃ X)−1 (X t W )(y − µ(r) )

163
With this formula we can define a sequence of vectors of parameters which converge to the exact solution of our
problem, so the vector for which the score is null (we define the MLE of the parameter).
Now we have two weighting matrices (we took the old one from the equation at page 163) but they are close
relatives since they only differ by one exponent:
2
1 1
  
∂µi ∂µi
w̃ii = wii =
∂ηi ϕV (µi ) ∂ηi ϕV (µi )
and thus:
W = W̃ M
where "M " is a diagonal matrix with generic diagonal element:
∂ηi
mii =
∂µi
The equation:
β (r+1) = (X t W̃ X)−1 (X t W̃ X)β (r) − (X t W̃ X)−1 (X t W )(y − µ(r) )
can be rewritten in terms of the matrix "W̃ ":
β (r+1) = (X t W̃ X)−1 (X t W̃ )Xβ (r) − (X t W̃ X)−1 (X t W̃ M )(y − µ(r) ) = (X t W̃ X)−1 (X t W̃ ){η (r) − M (y − µ(r) )}
So we replaced the expression "W = W̃ M " into the tinted factor: we perform this in order to have the expression
in terms on just one weighting matrix. This is a IRLS problem with respect to the new response variable:
z (r) = η (r) − M (y − µ(r) )
with:
(r) (r) (r) ∂ηi
zi = ηi + (yi − µi ) | (r)
∂µi ηi
where "z" is called the working response variable.

13.20 Fisher scoring with IRLS


This is basically the application of the IRLS with an additional point: we start from an arbitrary solution (usually
the first point is the standard linear regression), then we compute the first approximation of the expected value, then
we compute the weights and then we repeat the process computing another approximation. Now at each iteration
we also have to compute the working response variable: so we have the linear predictor, from it we compute the
expected value, from it we compute the weights and from them, before starting a new iteration, we compute the
new working response variable "z" (and then we iterate).

Fisher scoring with IRLS

Suppose the current estimate of "β" is "β̂ (r) ". Compute:


(r)
• ηi = xti β̂ (r)
(r) (r)
• µi = g (−1) (ηi )
 2
(r)
• w̃i = ∂ηi |η(r)
∂µi 1
(r)
i ϕV (µi )

(r) (r) (r)


• zi = ηi + (yi − µi ) ∂µ
∂ηi
| (r)
i η i

Now, apply the IRLS step. The updated value of "β̂" is obtained as WLS estimate to the regression of "Z"
on "X":
β̂ (r+1) = (XW (r) X)−1 X t W (r) x(r)
(r+1) (r+1)
and then compute "ηi " and "µi ",. . . , until convergence.
In practice, the convergence is reached when the difference between "β̂ (r+1) " and "β̂ (r) " is "small".

164
Credit card payments during Black Friday (II)

Let us consider again the Black Friday data and write explicitly the ingredients of the Fisher scoring
algorithm (remember that for the Poisson distribution "ϕ = 1"):
(r)
• ηi = xti β̂ (r)
(r) (r) (r)
• µi = g (−1) (ηi ) = exp(ηi )
 2  2
(r) (r) (r)
• w̃i = ∂µ i
|
∂ηi η (r)
1
(r) = exp(ηi ) 1
(r) = exp(ηi )
i V (µi ) exp(ηi )

(r) (r) (r) (r) (r) (r)


• zi = ηi + (yi − µi ) ∂µ
∂ηi
| (r) = ηi
i η
+ (yi − µi )(ηi )−1
i

This is easily written in Software R with:

y=c(0,0,2,1,5,7,12)
x=c(14.1,16.3,16.8,20.4,30.2,38.0,39.9)
X=cbind(rep(1,length(x)),x)
beta=c(0,0)
#beta=c(log(mean(y)),0) with a clever choice of init values#
for (i in 1:20) #unconditional cycle with 20 iterations#
{
eta=X%*%beta #the "X" is the design matrix#
mu=exp(eta)
tw=exp(eta)
TW=diag(as.vector(tw))
z=eta+(y-mu)*exp(eta)^(-1)
beta=solve(t(X)%*%TW%*%X)%*%t(X)%*%TW%*%z #the WLS solution#
print(list(i,beta))
}

and so we printed a list with the number of iteration and the "β": for each iteration we have the intersect
and the slope. From the printed output we notice that after the 11-th iteration the fourth decimal places
of both "β̂0 " and "β̂1 " are stabilized: this means that we have reached the convergence (the solution). In
order to define a criteria (a sort of "stopping rule" for the cycle) we can define the "error". The error is the
absolute maximum value between two consecutive approximations: if it is is greater than a certain threshold
we repeat the iterations, otherwise we stop the process because we reached the convergence.

beta=c(0,0)
#beta=c(log(mean(y)),0) #with a clever choice of init values#
err=+Inf
i=0
while(err>10^-4) #conditional cycle#
{
i=i+1
eta=X%*%beta
mu=exp(eta)
tw=exp(eta)
TW=diag(as.vector(tw))
z=eta+(y-mu)*exp(eta)^(-1)
beta1=solve(t(X)%*%TW%*%X)%*%t(X)%*%TW%*%z
err=max(abs(beta1-beta))
beta=beta1
}
print(i)
print(beta)

165
The output is:

> print(i)
[1] 11

> print(beta)
[,1]
-2.1347085
x 0.1137496

The best fit is given by the curve:

log(µ) = −2.1347 + 0.1137x µ = e−2.1347+0.1137x

The coefficients yield a curve of the expected values, we can plot it on the scatter-plot. The best fit is given
by the curve:

So now we will have to replicate the analysis we previously did for the Gaussian linear regression about the
significance of the parameters: we need to have a test for the global model and for each predictor.

13.21 Asymptotic distribution of the MLE


The asymptotic distribution of the estimator "B" comes from the likelihood theory we have developed in past
lectures.

Theorem
We know that the "β" is the ML estimate:
·
B ∼ Np+1 (β, I −1 (β))

or, more precisely: √


n(B − β) −→ Np+1 (0, I −1 (β))
·
where "∼" means "is approximately distributed as" and "−→" is the "convergence in distribution". So we
have that the ML estimate converges to a multivariate normal distribution.

166
Since the information matrix is estimated by plugging "β̂" and "ϕ̂" into "I(β)" obtaining "Î(β)", we have:

V AR(Bj ) = Î −1 (β)jj

The only difference between the Gaussian theory and the GLM theory is that in the Gaussian model we have
exact distributional results for the Student-t and for the Fisher distribution: this means that we can check the
significance for all the parameters also for small models. In GLM on the other hand we work into the asymptotic
framework so all the tests are valid only for large samples.

13.22 The Wald test


The Wald test is used to check the significance of a single coefficient (parameter). It is the counterpart for
the Student-t test we previously used:

H0 : βj = 0 H1 : βj ̸= 0

Wald test statistic


The Wald test statistic is:
Bj2
Wj2 =
V AR(Bj )
with asymptotic "χ21 " distribution under "H0 ". Of course the "variance" is the asymptotic variance we
previously described which is the inverse of:

V AR(Bj ) = Î −1 (β)jj

In most cases the test statistic is written as:


Bj
Wj = p
V AR(Bj )

with asymptotic standard normal distribution under "H0 ". The test can be easily generalized to a multi-parameter
version.

13.23 The likelihood ratio test


The likelihood ratio test compares two nested models:

• Obtain the best fitting model in the large model "β̂M LE "
• Obtain the best fitting model in the reduced model "β̂0,M LE "

Likelihood ratio test statistic


Under "H0 ", the distribution of:
2(ℓ(β̂M LE , y) − ℓ(β̂0,M LE , y))
has an asymptotic distribution "χ2q " under "H0 ", where the number of degrees of freedom "q" is the number
of null parameters in the educed model.

If the p-value is larger that the threshold we imposed then the Likelihood ratio test accepts the null hypothesis
under "H0 " (which establishes that the reduced model is good enough to describe our data), otherwise we reject it.

167
13.24 Wald test vs Likelihood ratio test
The Wald test and the likelihood ratio test measure the discrepancy between two nested models by checking different
quantities. The following plot (sort of parabola) represents a classic likelihood function for a GLM:

So the log-likelihood ratio measures the difference on the "y" axis, so the difference between the complete and
reduced models. The Wald test on the other hand measures the difference over the "x" axis.

13.25 Deviance
To actually compute the likelihoods it is useful to use the notion of deviance. For a given model compute:
n
X
ℓ(µ̂, ϕ̂, y) = ℓ(µ̂i , ϕ̂i , yi )
i=1

and contrast it with the likelihood of a saturated model [it is the model where we reach the perfect fit] (one
parameter for each observation), where the perfect fit is reached (i.e. "µ̂i = yi "):
n
X
ℓ(y, ϕ̂, y) = ℓ(yi , ϕ̂i , yi )
i=1

Deviance
The following quantity is the deviance of the model:

ℓ(y, ϕ̂, y) − ℓ(µ̂, ϕ̂, y)


D(µ̂, y) = −2
ϕ̂
where the first element is the log-likelihood of the saturated model and the second one is the log-likelihood
of the complete model.
For normally distributed random variables (at least for the Gaussian distribution) the deviance is equal to
the sum of squares of the residuals:
Xn
D(µ̂, y) = (yi − µ̂i )2
i=1

The deviance is a good measure because it is based on the log-likelihood, which is additive for all the
observations: this means that if we have a sample, the log-likelihood of the sample is equal to the sum of all
the single log-likelihood values.

168
Deviance for the Poisson distribution
Find the deviance for the Poisson distribution.

So we compute:
λyi i
fλ (y) = e−λi
yi !
ℓ = log L = −λi + yi log(λi ) − log(yi !)
So the current model is:
ℓ(yi , ϕ̂ , λ̂i ) = −λ̂i + yi log(λ̂i ) − log(yi !)
=1

The saturated model is instead:

ℓ(yi , ϕ̂, yi ) = −yi + yi log(yi ) − log(yi !)


n
X ℓ(yi , ϕ̂, yi ) − ℓ(yi , ϕ̂, λ̂i )
D = −2
i=1 ϕ̂

where "ϕ̂ = 1". Then we obtain:


n
X
−2 (−yi + yi log(yi )
−log(y !)) − (−λ̂i + yi log(λ̂i )
−log(y !))
X XX  XX X 
iX
X XiX
i=1

So the solution is:


n  
X yi
D = −2 (−yi + λ̂i ) + yi log
i=1 λ̂i
where we have the sum of the single i-th contributions of the sample to the deviance.

13.26 Deviance residuals


The residuals "(yi − µi )" we studied in linear regression are not appropriate in general when a GLM is used. Usually
we consider the contributions to the deviance:

ℓ(yi , ϕ̂, yi ) − ℓ(µ̂i , ϕ̂, yi )


di = −2
ϕ̂
and defines the deviance residuals as: p
ri = sign(y− µ̂i ) di
Since here the values are all positive, the the "sign" means that the measures above the curve are to consider
"positive", while the ones below are to consider negative:

169
Exercise (I)

From our computations we can recover:

• The estimates of the coefficients:


> print(beta)

[,1]
-2.1347085
x 0.1137496

• Their standard deviations:


> stdevs=sqrt(diag(solve(t(X)%*%TW%*%X)))
> print(stdevs)

x
0.93679732 0.02606235

• The log-likelihood:

> ll=sum(-mu+y*log(mu)-log(factorial(y)))
> print(ll)

[1] -10.59079

• The model deviance [just pay attention to the "0 · log(0)" problem]

##rdev=-2*sum(y-mu-*y*log(y/mu)) ##attention to the zeros!

rdev=0
for (i in 1:length(y))
{
if(y[i]==0){rdev=rdev-2*(y[i]-mu[i])}
else {rdev=rdev-2*(y[i]-mu[i]-y[i]*log(y[i]/mu[i]))}
}

> print(rdev)

[1] 4.94302

• The null deviance:


munull=mean(y)
ndev=0
for (i in 1:length(y))
{
if(y[i]==0){ndev=ndev-2*(y[i]-munull)}
else {ndev=ndev-2*(y[i]-munull-y[i]*log(y[i]/munull))}
}

> print(ndev)
[1] 32.85143

170
• The deviance residuals
d.r=c(NA,length(y))
for (i in 1:length(y))
{
if(y[i]==0)
{d.r[i]=sign(y[i]-mu[i])*sqrt(-2*(y[i]-mu[i]))}
else
{d.r[i]=sign(y[i]-mu[i])*sqrt(-2*(y[i]-mu[i]-y[i]*log(y[i]/mu[i])))}
}

> print(d.r) #fivenum(d.r) ##if there are too many residuals#


[1] -1.08454 -1.22909 1.12541 -0.19172 0.65692 -0.66682 0.27684

Remark: our worked example has only the purpose of doing computations with a very small data set. The results
here can not be used for inference, since the inference for GLM’s is based on asymptotic distributions, and therefore
it requires at least a moderate sample size.

Exercise (II)

Finally, we compare with the results of the "glm" function in Software R:

> mod=glm(y~x,family=poisson(link = "log"))


> summary(mod)

Call:
glm(formula = y ~ x, family = poisson(link = "log"))

Deviance Residuals:
1 2 3 4 5 6 7
-1.0845 -1.2291 1.1254 -0.1917 0.6569 -0.6668 0.2769

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.13471 0.93680 -2.279 0.0227 *
x 0.11375 0.02606 4.365 1.27e-05 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

(Dispersion parameter for poisson family taken to be 1)


Null deviance: 32.851 on 6 degrees of freedom
Residual deviance: 4.943 on 5 degrees of freedom
AIC: 25.182
Number of Fisher Scoring iterations: 5

Notice that now we don’t have the "t − value" but we have the "z − value" because here in the GLM all the tests
are asymptotic for large samples using the normal distribution (we are no more in the t-distribution).
While the computation of the estimates is good even for small samples, on the contrary the estimation of "z −value",
"P r > (|z|)" and the tests "N ulldeviance" and "Residualdeviance", based on the deviance, are based on large
samples only.

171
13.27 Computational remarks
Recall that, for linear regression, a full-rank design matrix "X" implied that there was exactly one unique solution
"β̂" which minimizes the residual sum of squares. A similar result holds for generalized linear models: if "X" is not
full rank, then there is no unique solution which maximizes the likelihood. However, two additional issues arise in
generalized linear models:

• Although a unique solution exists, the Fisher scoring algorithm is not guaranteed to find it

• It is possible for the unique solution to be infinite, in which case the estimates are not particularly useful and
inference breaks down

172
14 Logistic and multinomial regression
We now investigate special examples of generalized linear model: the logistic and multinomial regressions.

• The Logistic regression is a regression model where the response variable is a binary variable, a
categorical variable with two levels "0" and "1".
• The Multinomial regression is a generalization of the logistic regression for response variables with
"k" levels and in general for general categorical response variables. We now start investigating the simplest
example with is the logistic regression.

14.1 Logistic regression – Introduction


The first example of a GLM with discrete response variable is the logistic regression. In the logistic regression
the response variable is a binary variable:

• "0" or "1" in the Bernoulli scheme


• "Live" or "die"
• "Fail" or "succeed"
• . . . or any other categorical variable

For the purpose of using a Bernoulli distribution as the response variable, we conventionally use the two levels "0"
and "1".

Lead in blood dataset (I)

Let us consider a simple (it means we have just one response variable and one predictor) logistic regression
with:
• Response variable: "highbld" ("1" if high level, "0" otherwise)

• Predictor: "soil" (lead level in the backyard soil)


The data are in the file "soillead.txt". The first rows are:
highbld soil
1 1290
0 90
1 894
0 193
1 1410

So we now try to investigate if there’s a connection between the lead level in the blood and in the soil:

soillead <- read.csv(". . ./soillead.txt", sep="")


View(soillead)

plot(highbld~soil,data=soillead)
plot(jitter(highbld,factor=0.1)~soil,data=soillead,ylab="(jittered) highbld")

In the scatter-plot of the data we consider the response variable on the "y" axis and the predictor over the
"x" axis. Since in these kinds of graph we usually get some overlapping points we also plot a slightly different
representation (the graph on the right side), which is a jittered version (we add some Gaussian noise in order
to have the data plotted ad different levels) of the plot.

173
So from the plot we notice that when the lead-level in the blood is normal (so the response variable is "0")
we have small values of the predictor, whilst on the contrary we have higher levels for the soil: the data
suggests a sort of "correlation".
Of course we can’t apply the standard linear regression because it would define a straight line (for example
for positive association we would get a positive line from "−∞" to "+∞") over the previous plot which makes
no sense. So when a linear regression is applied, several unwanted things happen:

• The homoskedasticity property based on the standard Gaussian regression:

εi ∼ N (0, σ 2 )

is not reasonable (try to plot the regression line. . . ) because we have different values of the variance
for different values of the mean of the response variable
• Predicted values are not "0" or "1" and can also lie outside the interval "[0, 1]". So if for example we
mistakenly consider the linear regression, the regression line we obtain is:

wmod=lm(highbld~soil,data=soillead)
summary(wmod)
abline(wmod, col="blue")

and the output is:

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3178371 0.0506531 6.275 4.29e-09 ***
soil 0.0002754 0.0000426 6.465 1.65e-09 ***

and for "x = 5000" one obtains "ŷ = 1.6948", which is a value outside our boundaries and so not
reasonable for our case.
174
14.2 Logistic regression parametrization
So we can use our knowledge about the GLM and apply it with the correct distribution for the response variable.
Here the correct response variables is the Bernoulli distribution (it is a distribution with two possible outcomes
which are "0" and "1"):
Yi ∼ Bern(pi )
where the parameter of the distribution "pi = P(Yi = 1)" is the expected value and also the probability of success
of the i-th trial. In other words:
pi = P(Yi = 1|Xi )
From the plots we see that "pi " depends on the value of "xi ". So we model each response variable with a Bernoulli
distribution of probability "pi ": it is the conditional probability of success for the i-th element of our sample given
the value of the predictor "xi ". Remember as we said that "pi " is the expected value of "Yi ":

pi = E(Yi )

Generalized Linear Model


We use a GLM of the form:
g(pi ) = β0 + β1 xi
where we have a linear regression (intercept and slope).

The canonical link function (Logit function)

Exploiting our examples about the canonical form of the exponential families, we already have the canonical
link function for the Bernoulli distribution:
 
pi
g(pi ) = log
1 − pi

The function above is named as the Logit function:


 
pi
logit(p) = log
1 − pi

Thus:
logit(pi ) = ηi = β0 + β1 xi

We now use the "glm" function in Software R in order to compute the estimates of the "β"’s, to make inference.

175
Lead in blood dataset (II)

We store the results into an object called "mod" and then we can explore the results by using the "summary"
function:
mod=glm(highbld~soil,data=soillead,family=binomial(link="logit"))
summary(mod)

The result is:


Call:
glm(formula = highbld ~ soil, family = binomial(link = "logit"), data = soillead)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.26400 -0.79968 0.06168 0.85754 1.80346

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.5160568 0.3380483 -4.485 7.30e-06 ***
soil 0.0027202 0.0005385 5.051 4.39e-07 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

(Dispersion parameter for binomial family taken to be 1)


Null deviance: 191.82 on 138 degrees of freedom
Residual deviance: 139.62 on 137 degrees of freedom
AIC: 143.62
Number of Fisher Scoring iterations: 6

In this output the "slope" (the value we have for the predictor "soil") is not totally correct since we don’t have
a linear function. Notice that we have no dispersion parameter since in the Bernoulli distribution we know
that "ϕ = 1" in the exponential family representation. We then have the values for the "null deviance"
(the deviance in model with only the intercept) and the "residual deviance" (the deviance in the current
model). Remember that from these two values of deviance we can compute the value of the Likelihood Ratio
Test. The last line is about the convergence of the Fisher scoring: as we said in the previous part there are
cases where the Fisher scoring doesn’t converge even when the "X" matrix is non-singular. So in order to
be sure that the algorithm reached the convergence the output prints the number of iterations.

14.3 Read the output


With the output we can:

• See some basic information about the deviance residuals

• Read the parameter estimates, their standard errors, and the corresponding Wald tests (this test considers
one parameter and is used for testing the significance of a predictor)
• Read the value of the dispersion parameter (here equal to "1", since the variance is a function of the mean)
• Deviance to be used in the likelihood ratio test

• Information on the convergence of the Fisher scoring algorithm


• To perform the likelihood ratio test we just use the deviance as we compute the difference between the
"null deviance" and the "residual deviance". This difference has a "χ2 " distribution with "1" degree of freedom
(we have "1" because it is the difference between the degrees of freedom of the two models):

176
> 1-pchisq(191.82-139.62,1) #Likelihood Ratio Test using the Deviance#
[1] 5.012657e-13

The likelihood ratio test is a one-tailed right test: we need to compute the p-value which is the probability
on the right side (we have to compute "1" minus the cumulative distribution function).
We can perform the same by defining the empty model (the model where we just have the intercept [we have
no predictors]). In this case we compute the GLM with only the intercept and then we perform the "anova"
test using the "chisqd" between the two models:
mod0=glm(highbld~1,data=soillead,family=binomial(link="logit"))
anova(mod0,mod,test="Chisq") #Likelihood Ratio Test using the empty model and Anova#

Of course we obtain the same results:


Analysis of Deviance Table

Model 1: highbld ~ 1
Model 2: highbld ~ soil
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 138 191.82
2 137 139.62 1 52.201 5.009e-13 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

So we have a significant Likelihood Ratio test as well as significant values for the Wald test: the model is
significant. There’s a significant increase in the probability of being "highlead" in the response variable
depending on an increasing predictor (lead level in the soil).
We can then also compute the "predicted values", the "deviance" and the "deviance residuals" by running:
yhat=predict(mod,type="response")
dind=soillead$highbld*log(yhat)+(1-soillead$highbld)*log(1-yhat)
dev=-2*sum(dind)
dev
devres=sign(soillead$highbld-yhat)*dind

14.4 Interpretation of the parameters


The "problem" now is how to read the values of the parameters. The interpretation of "β0 " and "β1 " is easy since
"β0 " is the value of the intercept and "β1 " is the increase of the response variable caused by an unitary increase of
the predictor. This means that, in standard linear regression, the interpretation of the "β1 " is very straightforward.
But now we are in a different framework and we also need to use the information contained in the link-function.
We have the linear predictor "ηi " on one side and the expected value of the response variable "µi " (or "pi " for the
Bernoulli distribution).
To plot the regression line (not a straight line actually) and to obtain the fitted values, we need to use the
function "g −1 " the inverse of the link function:
 
pi
g(pi ) = log
1 − pi
and therefore:
eηi 1
pi = g −1 (ηi ) = =
1+e ηi 1 + e−ηi
where "ηi = β0 + β1 xi " is the linear predictor for observation "i". For any given value of "x" (not necessarily one of
the "xi "’s in the data set), we compute the predicted value as:
eη̂ eβ̂0 +β̂1 x
p̂ = =
1+e η̂
1 + eβ̂0 +β̂1 xi

177
So we have the predicted expected value for each value of "x". This equation describes a curve called "Logistic
curve", whose plot is:

newdat = data.frame(soil=seq(min(soillead$soil), max(soillead$soil),len=300))


hatp=predict(mod, newdata=newdat, type="response")
plot(highbld~soil,data=soillead, col="red")
lines(hatp~soil, data=newdat, col="black", lwd=2)

of course here from the plot we can’t see the values before "0" but we know that this function asymptotically goes
to "0" when approaching "x = −∞". So if we take the estimates "β̂0 " and "β̂1 " in the previous output and we plug
them into the logistic function we obtain this plot which represents the expected value of the "y" as a function of
the values of the predictor.
From:
eηi 1
pi = g −1 (ηi ) = =
1 + eηi 1 + e−ηi
and "ηi = β0 + β1 xi " we see that:

• When "β̂1 " is positive: as "X" gets larger, "p̂i " goes to "1", as "X" gets smaller, "p̂i " goes to "0"
• When "β̂1 " is negative: as "X" gets larger, "p̂i " goes to "0", as "X" gets smaller, "p̂i " goes to "1"

The intercept "β̂0 " is the estimate of the linear predictor when "xi = 0", and thus:

eβ̂0
p̂i =
1 + eβ̂0
is the estimate of the mean response when the predictor is set to "0". From this we don’t get exactly an interpretation
of the "β̂1 " but of its sign as we obtain two different plots: one for positive and one for negative values of the
parameter.

178
So we have two different interpretations:

• Positive sign of "β̂1 " means that the probability of success "1" increases when the "X" increases (the case of
the previous graph)

• Negative sign of "β̂1 " means that the probability of success decreases when the "X" increases (the opposite
case: the horizontal symmetric graph)

14.5 The odds ratio


While the interpretation of the sign of "β̂1 " is easy, this is no longer true for its absolute value.

Odds of an event
The odds of an event "E" is the ratio between the probability of the event and its own complement to "1":

P(E)
odds(E) =
1 − P(E)

In logistic regression we write (the "Yi " has a Bernoulli distribution with parameter "pi "):

P(Yi = 1) P(Yi = 1) Prob. of Success


odds(Yi = 1) = = =
1 − P(Yi = 1) P(Yi = 0) Prob. of Failure

Odds ratio for one event


Let us suppose we have found that the probability of heart attack in non-smokers is "0.0018" and in smoker
is "0.0036". The two odds are:
0.0018 0.0036
odds(N.S.) = = 0.001803 odds(S.) = = 0.003613
1 − 0.0018 1 − 0.0036
From this very simple example we notice an important thing: for small probabilities of success the denomi-
nator is a value close to "1".

Odds ratio of two events


The odds ratio for two events "E1 ", "E2 " is the ratio between the odds of the two events:

odds(E1 )
O.R.(E1 , E2 ) =
odds(E2 )

Odds ratio for two events


With the two odds:
0.0018 0.0036
= 0.001803 = 0.003613
1 − 0.0018 1 − 0.0036
the odds ratio is:
0.001803
O.R.(E1 , E2 ) = = 0.4991
0.003613

In logistic regression, for the i-th element of the sample, we write:

P(Yi = 1) P(Yi = 1)
odds(Y1 = 1) = =
1 − P(Yi = 1) P(Yi = 0)

and:
odds(Y1 = 1)
O.R.(Yi = 1, Yj = 1) =
odds(Y2 = 1)

179
This means that:

• O.R. ≥ 0

• If "0.R. = 1", then "Yi " and "Yj " have the same expected value "pi = pj "
• If "O.R. > 1", then the mean of "Yi " is greater than that of "Yj ", so success is most probable on the observation
"i"
• If "O.R. < 1", then the mean of "Yi " is less than that of "Yj ", so success is less probable on the observation "i"

It’s important to move from the sign of "β1 " to the odds ratio because the sign of "β1 " can only be interpreted as
sign: this means we can’t say anything about its absolute value. On the other hand with the odds-ratio we can also
give an interpretation of the absolute value of the parameter.
Now we take two events (probabilities of success given a value "X") and we consider a unitary increase of "X":

• p0 = P(Y = 1|X = x)
• p1 = P(Y = 1|X = x + 1)

The logarithm of the odds ratio (the log-odds-ratio to be concise) is precisely:


p1
eβ0 eβ1 (x+1)
   
1−p1
log p0 = log = log(eβ1 ) = β1
1−p0 eβ0 eβ1 x

So we can now give an interpretation of the "β1 " in terms of its absolute value:

• The "β1 " is the change in the log-odds-ratio when the predictor increases for "1" unit.
• The "eβ1 " is the odds-ratio comparing responses with "1" unit of difference.

Lead in blood dataset (III)

In our example on blood and soil lead, we have:

> summary(mod)

[1] Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.5160568 0.3380483 -4.485 7.30e-06 ***
soil 0.0027202 0.0005385 5.051 4.39e-07 ***
---

From "β̂1 = 0.0027202" we get:


eβ̂1 = 1.0027
So a "1" unit increase in the "soil lead" increases the odds of having a high blood level by a factor of "1.0027".

Difference between the expected value and odds-ratio

What is the difference between the expected value "P(E)" and the odds-ratio?

In the previous example we have seen that for small values of "P" the probability and the odds ratio are
very similar: this means that the odds-ratio are a good approximation of the probability. In most cases of
logistic regression we have small numbers of success and high numbers of failures so the use of the odds-ratio
is perfectly consistent with the structure of the data.

180
14.6 The deviance
The deviance for the logistic regression is computed from the individual contributions:

−2(ℓ(yi , p̂i ) − ℓ(yi , yi )) =


−2((yi log(p̂i ) + (1 − yi ) log(1 − p̂i )) − (yi log(yi ) + (1 − yi ) log(1 − yi ))) =
−2(y1 log(p̂i ) + (1 − yi ) log(1 − p̂i )) = di

and the deviance residuals are: p


sign(yi − p̂i ) di
Computational issue

In the previous formulas we used the assumption:

0 · log(0) = 0

14.7 Multiple logistic regression


Let us now consider a multiple (multiple means that we have one response variable and several predictors) logistic
regression with two predictors. We consider the following dataset:

Incidence of bird on islands


• Response variable: "incidence": "1" if the island is occupied by birds, "0" otherwise
• Predictor: "area": area in km2

• Predictor: "isolation": distance from mainland in km

The data are in the file "islands.txt" and the first rows are here:
islands <- read.delim(". . . /islands.txt")
View(islands)

And so our data is organized in the following way:

incidence area isolation


1 7.928 3.317
0 1.925 7.554
1 2.045 5.883
0 4.781 5.932
0 1.536 5.308

So first of all we consider the scatter-plot with one predictor at the time (we consider the response variable
on the "y" axis and the predictor on the "x" axis):

plot(jitter(incidence,factor=0.1)~area,data=islands,ylab="(jittered) incidence")
plot(jitter(incidence,factor=0.1)~isolation,data=islands,ylab="(jittered) incidence")

The first graph suggests an "S" shaped logistic curve (the probability of success increases when "x" increases)
[positive association between the "incidence" response variable and the "area" predictor], whilst the second
graph suggests the horizontal-symmetric shape [negative association between the response variable and the
"isolation" predictor].

181
When using a model with two predictors, we expect a positive coefficient for "area" and a negative coefficient
for "isolation":
mod=glm(incidence~area+isolation,data=islands, family=binomial(link="logit"))
summary(mod)

We can also plot the logistic curve in both cases:

#model with just "area" predictor#


mod1=glm(incidence~area,data=islands,family=binomial(link="logit"))
#model with just "isolation" predictor#
mod2=glm(incidence~isolation,data=islands,family=binomial(link="logit"))

newdat1=data.frame(area=seq(min(islands$area),max(islands$area),len=300))
hatp=predict(mod1, newdata=newdat1, type="response")
plot(incidence~area,data=islands, col="red")
lines(hatp~area, data=newdat1, col="black", lwd=2)

newdat2=data.frame(isolation=seq(min(islands$isolation),max(islands$isolation),len=300))
hatp=predict(mod2, newdata=newdat2, type="response")
plot(incidence~isolation,data=islands, col="red")
lines(hatp~isolation, data=newdat2, col="black", lwd=2)

182
14.8 The output

Call:
glm(formula = incidence ~ area + isolation, family = binomial(link = "logit"), data = islands)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.8189 -0.3089 0.0490 0.3635 2.1192

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.6417 2.9218 2.273 0.02302 *
area 0.5807 0.2478 2.344 0.01909 *
isolation -1.3719 0.4769 -2.877 0.00401 **
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

(Dispersion parameter for binomial family taken to be 1)


Null deviance: 68.029 on 49 degrees of freedom
Residual deviance: 28.402 on 47 degrees of freedom
AIC: 34.402
Number of Fisher Scoring iterations: 6

So here we find all the coefficients of the predictors and their corresponding Wald test (test for the significance of
the single coefficient). On the last part of the output we find the deviances which are used in the LR test for the
global significance of the model. Notice that here we have different degrees of freedom: the difference is "2" because
we have "2" parameters in our complete model. We can perform the LR test in two different ways:

#Likelihood Ratio Test using the Deviance#


1-pchisq(68.029-28.402,2)

#Likelihood ratio test using the anova test#


mod0=glm(incidence~1,data=islands,family=binomial(link="logit"))
anova(mod0,mod,test="Chisq")

The LR test yields a p-value of "2.484e − 09", so the model is globally significant. In general when several predictors
are involved we can use the LR test for any reduced model (not only the empty model). Both predictors are
significant (at the 5% level); the use of the Wald test remains unchanged.
The interpretation of the log-odds-ratio must be taken one predictor at a time, the other remaining being
fixed (it’s similar to the mechanism we use for partial derivatives). For instance:

β̂isolation = −1.3719 eβ̂isolation = 0.2536

means that "1km" increase in isolation is associated with a decrease in the odds of seeing a bird by a factor of
"0.2536".

183
14.9 Model with interaction

We can add an interaction parameter by using the multiplicative notation "∗" in the Software R formula:

modwi=glm(incidence~area*isolation,data=islands,family=binomial(link="logit"))
summary(modwi)

so for example in the previous script we take the first parameter for the "area", the second for the "isolation" and
the third parameter for the interaction between the firsts two. This means that in our output we will have one
more line:

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.0313 7.1747 0.562 0.574
area 1.3807 2.1373 0.646 0.518
isolation -0.9422 1.1689 -0.806 0.420
area:isolation -0.1291 0.3389 -0.381 0.703

Notice that since we added the interaction between the predictors we have a strange behavior: the introduction of
the interaction led us to a completely non-significant model. The interaction and also the predictors (which were
significant before) are now non-significant [this clearly suggests us to remove the interaction from the model].
This is immediate from the Wald test (last line), but can also try the LR test.

184
14.10 Multinomial logistic regression
When the response variable is categorical with three or more categories (levels) we need to use a multinomial
logistic regression model. We have two types of multinomial response:

• Unordered (for example the "voting preference" or the "cause of death")


• Ordered (for example the "attitude scales (strongly disagree, . . . , strongly agree)" or the "exam grades")

The difference between these two is that in ordered random variables the possible outcomes have "natural order":
this means that we can sort them between two extremes. On the contrary unordered random variables can’t be
sorted.
In this class we consider the case of unordered categories and we’ll give some insight on how to deal with ordered
categories.

British Election Study (I)

The British Election Study dataset is a cross-sectional study on the 2019 elections freely available upon
registration at:
https://www.britishelectionstudy.com/
The whole answers are in the file "bes_rps_2019_1.1.1.dta", a foreign data format from STATA (use
the R-Studio import menu to import the data). In the dataset there are several possible predictors but here
we just consider a simple regression:

• Response variable: "b01 " the party voted in the 2019 election
• Predictor: "Age" the age (here negative values are missing data)
The response variable "b01 " has indeed several levels (it’s pretty complicated):

Labels:
value label
-999 Not stated
-2 Prefer not to say/Refuse
-1 Don’t know
1 Labour Party
2 Conservative Party
3 Liberal Democrats
4 Scottish National Party
5 Plaid Cymru
6 Green Party
7 United Kingdom Independence Party (UKIP)
8 Brexit Party
9 Other
10 An independent candidate
11 Specified name- no party mentioned
12 Spoilt ballot paper
13 None

If we look at the data and consider the frequencies of the outcomes we notice that the greatest part of the
sample is concentrated in only 3 parties. So in order to simplify a bit this example (reduce its complexity)
we restrict it to the main categories "1 - Labour", "Party, 2- Conservative Party", "3 - Liberal
Democrats". After removing the missing values we have a dataset with 2543 voters. We now want to study
the effect of the "age" on the voting preferences.

We now have to incorporate a dataset of this kind into the logistic regression.

185
14.11 The multinomial model
Let us consider a response variable "Yi " for the i-th statistical unit with "k" categories and probabilities:

pi,1 , . . . , pi,k

We choose a reference category and then we define a number of models (2 in our case) to compare each category to
the baseline category. So to extend the logistic regression model to the multinomial case, we need to fix a reference
or baseline category. By default we choose the last category "k" as the "reference category" and we write "k − 1"
logistic regressions:  
pi,1
log = β0,1 + β1,1 x1,i + β2,1 x2,i + . . .
pi,k
 
pi,2
log = β0,2 + β1,2 x1,i + β2,2 x2,i + . . .
pi,k
 
pi,k−1
log = β0,k−1 + β1,k−1 x1,i + β2,k−1 x2,i + . . .
pi,k
When "k ≥ 3" we have that the denominator is not the complement of the numerator. We also obtain different
coefficients for each model: this means that we obtain different "β0 " for each model we have.
The interpretation of the coefficients is given by the log-odds-ratios:

• "β1,1 " is the change in the log-odds of category "1" as opposed to (vs) category "k" for a "1" unit increase of
the predictor "x1 "
• ...

• "β1,k−1 " the change in the log-odds of category "k − 1" as opposed to category "k" for a "1" unit increase of
the predictor "x1 "

The probabilities for the individual "i" are:

• For category "1":

eβ0,1 +β1,1 x1,i +β2,1 x2,i +...


p̂i,1 =
1 + (eβ0,1 +β1,1 x1,i +β2,1 x2,i + . . . eβ0,k−1 +β1,k−1 x1,i +β2,k−1 x2,i +... )

• ...

• For category "k − 1":

eβ0,k−1 +β1,k−1 x1,i +β2,k−1 x2,i +...


p̂i,k−1 =
1 + (eβ0,1 +β1,1 x1,i +β2,1 x2,i + . . . eβ0,k−1 +β1,k−1 x1,i +β2,k−1 x2,i +... )

• For category "k":


p̂i,k = 1 − (p̂i,1 + · · · + p̂i,k−1 )

186
British Election Study (II)

We use the "multinom" function in the "nnet" Software R package. First, we need to choose the level of
our outcome that we wish to use as our baseline and specify this in the "relevel" function. Then, we run our
model using "multinom". The "multinom" function does not include p-value calculation for the regression
coefficients, so we calculate p-values using Wald tests (here z-tests).
So since we have 3 levels it means that we have to define 2 logistic regression equations: from the first one
we obtain the probability of the first level (in this case the "Labour Party"), from the second one we obtain
the probability of the second level (the "Conservative Party") and the last one is given by the difference.

bes_rps_2019_1_1_1 <- read_dta(". . ./bes_rps_2019_1.1.1.dta")


View(bes_rps_2019_1_1_1)
table(bes_rps_2019_1_1_1$b02)

bes2=bes_rps_2019_1_1_1[(bes_rps_2019_1_1_1$b02>=1)&(bes_rps_2019_1_1_1$b02<=3)
&(is.na(bes_rps_2019_1_1_1$b02)==F),]
table(bes2$b02)
bes3=bes2[bes2$Age>0,]
table(bes3$b02)

bes3$b02= factor(as.character(bes3$b02))
bes3$b02= relevel(bes3$b02, ref = 3)
mod= multinom(b02~Age, data=bes3)
summary(mod)

The output is:

Call:
multinom(formula = b02 ~ Age, data = bes3)

Coefficients:
(Intercept) Age
1 1.9009766 -0.02082697 #
2 0.3198647 0.01677515 ##

Std. Errors:
(Intercept) Age
1 0.1978886 0.003611568 #
2 0.1976460 0.003425222 ##

Residual Deviance: 4814.203


AIC: 4822.203

so for the "1" (#) we have:  


pLabouri
log = β0,Labour + β1,Labour · age
pLiberali
and for the "2" (##):  
pConsi
log = β0,Cons + β1,Cons · age
pLiberali
So the output of this function is different because it is arranged differently. In logistic regression the Wald test
is used to check the significance of each parameter and so we expect a counterpart for our four parameters.
As we can see the previous output doesn’t provide any p-value.

187
For the Wald tests we have:
z = summary(mod)$coefficients/summary(mod)$standard.errors
p = (1 - pnorm(abs(z), 0, 1)) * 2

> z #Test statistics#


(Intercept) Age
1 9.606296 -5.766740
2 1.618372 4.897538

> p #p-values#
(Intercept) Age
1 0.0000000 8.081969e-09
2 0.1055825 9.704498e-07

Remember that the test statistic for the Wald test is very simple as we just need to divide the estimate by
the standard error: this means that we can take all the values from the first output. If we perform this we
obtain 4 values for the test statistic and 4 p-values for "β0,Labour ", "β1,Labour ", "β0,Cons " and "β1,Cons ". From
this we can see for example that the predictor "Age" is significant for both the log-odds-ratio equations.
It’s important to underline that we introduced the Wald test for the logistic regression in order to check the
significance of the predictors. Now if we decide that the "age" is significant (non-significant) we must consider
(remove) it in both equations simultaneously. This means that it’s not so important to have the single p-value
for all the parameters because (for example) the whole parameter "age" is considered or removed. Problems
arise for example when our predictor is significant in just one of the equations and we don’t know if we have
to consider it or remove it. We have to change strategy: we consider the likelihood-ratio test.
So we consider the "residual deviance" and then we define the empty-model as the model where we remove
the "age" predictor. We compute the residual deviance of the null-model and then perform the likelihood-
ratio test. The degrees of freedom in the difference between the number of parameters so in this case it is
equal to "4 − 2 = 2".
Here there are the graphs of log-odds-ratios as a function of age (red for "Labour", blue for "Conservative"):

age.d=seq(18,100,0.1)
beta0.1=summary(mod)$coefficients[1,1]
beta1.1=summary(mod)$coefficients[1,2]
beta0.2=summary(mod)$coefficients[2,1]
beta1.2=summary(mod)$coefficients[2,2]
plot(age.d,(beta0.1+beta1.1*age.d),type="l",col="red",ylim=c(-0.5,2.1),ylab="log-odds")
lines(age.d,(beta0.2+beta1.2*age.d),col="blue")

188
And here are the graphs of the predicted probabilities as a function of age (red for "Labour", blue for
"Conservative, yellow for "Lib-Dem").

p1=exp(beta0.1+beta1.1*age.d)/(1+exp(beta0.1+beta1.1*age.d)+exp(beta0.2+beta1.2*age.d))
p2=exp(beta0.2+beta1.2*age.d)/(1+exp(beta0.1+beta1.1*age.d)+exp(beta0.2+beta1.2*age.d))
p3=1-p1-p2
plot(age.d,p1,type="l",col="red",ylim=c(0,1),ylab="pred prob")
lines(age.d,p2,col="blue")
lines(age.d,p3,col="darkgoldenrod1")

where we have that the predicted probability of the function "age" [blue function] is:

eβ0,Cons +β1,Cons ·Age


p̂Cons (Age) =
1 + (eβ0,Cons +β1,Cons ·Age

We have the same for the "Labour" and we can compute the yellow line by difference:

p̂ = 1 − p̂Labour − p̂Cons

189
14.12 Ordered multinomial model
We now investigate how to analyze multinomial regression models when the underlying distribution has ordered
categories: the idea here is to use the cumulative probabilities instead of simple probabilities.
When there is an underlying ordering to the categories, a convenient parameterization is to work with cumulative
probabilities. For instance suppose that the response variable has "k = 4" ordered levels "1, 2, 3, 4" ordered from
lowest to highest. For the i-th individual we have:

As in the previous example, for "k" categories, only "k − 1" equations are needed, because the last cumulative
probability is always "γi,k = 1". We then write a model equation for each cumulative probability:
 
γi,1
log = β0,1 + hi
1 − γi,1
...
 
γi,k−1
log = β0,k−1 + hi
1 − γi,k−1
where "β0,k " is the intercept of each equation and "hi " is a common linear predictor:

hi = β1 xi,1 + · · · + βp xi,p

So in this case we are now moving in the middle between the logistic regression and the quantitative regression (for
example the Gaussian): we have again only one coefficient for each predictor. This means that for example we have
the same "β1 " coefficient for all the equations: we can then perform the Wald test for the single parameter.
Common linear predictor means that the log-odds-ratios for threshold category membership are independent
of the predictors. For instance the log-odds-ratio for categories "1" and "2":

odds of ≤ 2
 
(β0,2 + hi ) − (β0,1 + hi ) = log
odds of ≤ 1

is constant for all values of the predictors. Similarly for the other log-odds-ratios. This is called the proportional
odds assumption. Of course we can remove this by taking different slopes in the defining equations, but the
model becomes less parsimonious and less easy to apply.

14.13 Ordered multinomial model with Software R


With Software R you can use again the "multinom" function in the "nnet" package or other functions in the
"MASS" package. In all cases, it is essential to correctly specify the structure of the data (unordered or ordered
response variable) and of the model (proportional odds or not) for a correct interpretation of the results.

190
15 Lab 6
15.1 Logistic Regression
15.1.1 Exercise 1 - Credit card default

The data to be analyzed are in the file "default.txt" in the data folder on AulaWeb and the description of the
data is in the file default "info.txt" in the same folder. The data file contains "10′ 000" observations of credit card
holders with 4 variables: "default", "student", "balance", and "income".

1. Load the data in Software R. First, give a univariate analysis of all the variables.
First of all as always we load the data into the software:

default <- read.csv(". . ./default.txt", sep="")


View(default)

Here we have that the response variable is "default" which is described by 3 predictors which are "student"
(a categorical variable), "balance", and "income". The main thing to notice here (but we will deal with it
later in the exercise) is the fact that our response variable is a character variable with two possible outcomes
which are "yes" and "no": this is a "problem" since the GLM function only accepts values of "0" and "1" (this
is not a problem with the Student-t function because it’s not an issue for the linear predictor as it can handle
also categorical variables [Software R automatically creates the Dummy variable for the "k − 1" levels]).
We can then analyze our variables by running the following code:

table(default$default) #Analyze the response variable#


table(default$student) #Analyze the categorical predictor#
summary(default$balance) #Analyze the numerical predictor#
boxplot(default$balance, main="Balance")
summary(default$income) #Analyze the numerical predictor#
boxplot(default$income, main="Income")

2. Fit a logistic regression model to classify the response variable default with respect to the explanatory variables
"student", "balance", and "income" (re-code the variables if needed). Evaluate the significance of the model
and of the regressors, and define an appropriate reduced model in the case of non-significant regressors.
Here we have a possible solution for the re-definition of the response variable:

default$def=rep(0,length(default$default))
for (i in 1:length(default$default))
{if (default$default[i]=="Yes"){default$def[i]=1}}

So we defined a vector of "0"s and then we check when the "default" is equal to "yes". If so we set the value
equal to "1" or "0" otherwise. Remember that there are also special functions and packages which allows to

191
re-define the categorical variables automatically. The only important point is that the new variable must be
included into the dataframe, so we can use it with the GLM function.
We can define the GLM to explain the response variable: now the response variable has been recalled with the
values of "0" and "1" since the GLM functions needs a response variable of this kind (it can’t handle character
variables). So the generalized linear model is:

mod=glm(def~student+balance+income,data=default,family=binomial(link="logit"))
summary(mod)

The output we obtain is the following:

Call:
glm(formula = def ~ student + balance + income, family = binomial(link = "logit"),
data = default)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.4691 -0.1418 -0.0557 -0.0203 3.7383

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
studentYes -6.468e-01 2.363e-01 -2.738 0.00619 **
balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
income 3.033e-06 8.203e-06 0.370 0.71152
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2920.6 on 9999 degrees of freedom


Residual deviance: 1571.5 on 9996 degrees of freedom
AIC: 1579.5

Number of Fisher Scoring iterations: 8

The first important part of our output is the table of coefficients where we have one description line
for each predictor (each explanatory variable). Since "student" is a categorical variable we have created a
Dummy variable (for the other two quantitative predictors it’s just a standard output). From the sign of the
estimates we can see that the "direction of the dependence" of our response variable "default" with respect to
the predictors. This means that the response variable is positively correlated with "income" and "balance",
whilst is negatively correlated with "student". In the last column we encounter the p-values of the Wald test:
we skip the intercept since it’s not interesting (we consider it even when it’s not significant) and so we just
consider the last 3 predictors. From the p-values we notice that they are all significant expect for the "income"
(this means that if we want to create a reduced model we will have to discard this last predictor).
The second part of the output is about the deviance: these values allow us to perform the Likelihood-ratio test.
This test is not automatically displayed in the output and so we have to compute it by taking the difference
between the "null deviance" and the "residual deviance" with a chi-squared distribution (in this case we have 3
degrees of freedom):

p=1-pchisq(2920.6-1571.5,3)

The result we obtain from the test is "0" which means that we have a high-significant model (in this case we
mean less that the threshold of "5%"). Of course here our result is an approximation (since the function goes
to infinity) but we can say that the two models almost have a perfect fit.
We now define the reduced model as the model where we discard the non-significant "income" predictor:

192
mod.red=glm(def~student+balance,data=default,family=binomial(link="logit"))
summary(mod.red)

The output is:

Call:
glm(formula = def ~ student + balance, family = binomial(link = "logit"),
data = default)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.4578 -0.1422 -0.0559 -0.0203 3.7435

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.075e+01 3.692e-01 -29.116 < 2e-16 ***
studentYes -7.149e-01 1.475e-01 -4.846 1.26e-06 ***
balance 5.738e-03 2.318e-04 24.750 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2920.6 on 9999 degrees of freedom


Residual deviance: 1571.7 on 9997 degrees of freedom
AIC: 1577.7

Number of Fisher Scoring iterations: 8

Here, as we expected from the chi-squared results we obtain, we basically obtained the exact same value for the
"residual deviance".

3. To validate the model, compute the predicted classification, classifying "1" when the probability is greater
than "0.5" and "0" otherwise. Compute in particular the percentages of false-positives and false-negatives
observed. Try with different cutoffs, for example "0.2" and "0.8".
This point is about the validation of the model: when performing linear-regression you perform the tests and
then check the value of the R-squared (which is in some sense the correlation between the predicted values and
the observed values). Here we can’t perform this since the response variable has only two possible outputs which
are "0" and "1" while the prediction is computed on the mean of the response variable (so it’s a probability).
So the idea here is that we have to find a way to check if our model correctly predicts the outcomes of the
response variable. In order to do this we need to separate the probabilities: we set a threshold and if the
probabilities are higher than that particular value we assign the value of "1" or "0" otherwise.

193
So to recap: the observed response variable can assume just the values "0" and "1" but our fitted values don’t
as they follow a function of this kind (see the graph). We have to define a particular threshold to discriminate
the values:

We can for example define a particular rule (a cut-off) for values of "0.2", "0, 5", "0.8" (and so on) and check
the results. For example if we want to set a cut-off of "0.5" (the simplest threshold) we have to write:

y.pred=(mod$fitted.values>0.5)
table(y.pred,default$default)

what we obtain is a logical vector which is "true" for "yes" and "false" for the "no". We then have to cross-
classify the predicted values with the observed values (this is given by the "table" function):

y.pred No Yes
FALSE 9627 228
TRUE 40 105

From this output we can then compute the "false positive rate" (the proportion of "TRUE" estimates that
are "No" in the predicted) and "false negative rate" (the proportion of "FALSE" estimates that are "Yes" in
the predicted.

t=table(y.pred,default$default)
false.neg=t[1,2]/rowSums(t)[1] #228/(228+9627)#
false.pos=t[2,1]/rowSums(t)[2] #40/(40+105)#
false.neg
false.pos

We can also try to change the cut-off value to investigate if we can obtain better performance in sense of
reducing the proportion of "false positive" (it is the higher proportion in this case). Remember that when we
move the threshold we improve one classification (for example reduce the false positive rate) but we obtain a
worse classification of the other one (we increase the false negative rate).

194
The output we obtain is:

> false.neg
FALSE
0.02313546

> false.pos
TRUE
0.2758621

4. This item is an introduction to cross-validation (we will develop this topic in the last section). Split the
dataset into two datasets (with "5′ 000" observations each) and repeat item 2 using the first dataset and
item 3 using the second one.
Cross-validation is a set of procedures we use to have a better evaluation of the statistical model. The best
method is to script the dataset into two parts:
• The first part is used to estimate the parameters
• The second part is used to check the goodness of fit of the model (!NOT ON THE SAME DATA!)

So the idea is to repeat the previous point but on different datasets. We then split our "10′ 000" observations
into two subsets (in this case we chose the 50/50 proportion): with the first set we estimate the parameters
and with the second one we compute the false positive and false negative rate. This means that the predicted
values are computed on the second subset of the dataset by using the coefficients computed in the first one. So
first of all we randomly sample the first "5′ 000" lines from the dataset and then split the main dataset into
two subsets:
selection=sample(1:10000,5000,F)
training.set=default[selection,]
test.set=default[-selection,]

Now we use the first set to compute the coefficients of the model and then we use the second one to perform
the tests (table of cross-classification). Remember that every run gives slightly different results!
Here we compute the parameters on the training set:

mod.tr=glm(def~student+income+balance,data=training.set,family=binomial(link="logit"))
summary(mod.tr)

and we obtain:
. . .

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.013e+01 6.568e-01 -15.429 <2e-16 ***
studentYes -6.001e-01 3.175e-01 -1.890 0.0587 .
income -7.254e-06 1.147e-05 -0.633 0.5270
balance 5.503e-03 3.117e-04 17.653 <2e-16 ***

. . .

195
We now use this set of coefficients to define the predicted values on the test subset:

coef=mod.tr$coefficients
test.set$pred=rep(NA,5000)
for (i in 1:5000)
{
lin.pred=coef[1]+coef[2]*(test.set$student[i]=="Yes")+coef[3]*test.set$income[i]
+coef[4]*test.set$balance[i]
test.set$prob[i]=exp(lin.pred)/(1+exp(lin.pred))
}
test.set$pred=(test.set$prob>0.5)

table(test.set$pred,test.set$default)

The only problem from the computational point of view is the fact that we have a categorical predictor. If we
only have quantitative predictors the machinery is easy since we just have to define the linear function and
then change it by using the inverse of the link-function. This isn’t the case since "student" is a categorical
predictor and we can’t perform the multiplication. We then have to define two different functions in order to
incorporate the predictor.
By using the coefficients we found we can then compute the linear predictors for each line of our test-set.
From this we compute the inverse of the link function and then the estimate of the probability. We find the
fitted value, we define a threshold and then compute the prediction and the cross-classification.

196
16 Regression for count data
16.1 Introduction
In the previous chapter our response variable was a finite random variable (binary random variable [with the theory
of the logistic regression] and then multinomial random variable [multinomial logistic regression]). We now move
to the framework of "count": our response variable is an integer variable without any bounds.
When the response variable is a count, the first option is to model it with a Poisson distribution (we already saw
this when we dealt with the Fisher Scoring).

Remember
• The Poisson distribution belongs to the exponential family. If:

Y ∼ P(λ)

we have the following density function for each response variable:


λy
f (y|λ) = e−λ
y!
and so the log-likelihood can be written in the form:

ℓ(λ|y) = −λ + y log(λ) − log(y!)

• The special property of the Poisson distribution is that the variance is equal to the mean:

E(Y ) = λ V AR(Y ) = λ

This is a good property but also a limitation since we can only consider counts for which the variance
must be at least similar to the expected value (mean). This means that when the difference between
these two values becomes relevant we are forced to consider another distribution.

16.2 Poisson regression


From the log-likelihood:
ℓ(λ|y) = −λ + y log(λ) − log(y!)
we obtain that the canonical link function (the function which gives the canonical parameter in the exponential
family) for the Poisson distribution is:
g(λ) = log(λ)
Poisson regression model

Given a sample "(yi , xi )" with "i = 1, . . . , n", the model (the link function of the expected value):

g(E(Yi )) = log(λi ) = xti β = β0 + β1 xi,1 + β2 xi,2 + . . .

is equal to the linear predictor. This function is defined as Poisson regression model.

Some remarks:

• This function is simpler than the logit function of the logistic regression: instead of a "logit function" we have
now the "log" and still we have
λi > 0 =⇒ −∞ < log(λi ) < +∞
thus the linear predictor (the "log(λi )") can range over the whole real numbers.
• In Poisson regression we assume that "log(λi )" is linearly related to the predictors.

197
• The Poisson distribution is heretoskedastic by definition because the variance is equal to the expected value:
this means that if the expected value increases so does the variance.

Elephant dataset (I)

Let us consider a simple Poisson regression to model the number of matings of the elephants as a function
of their age.
• Response variable: "N umber_of _M atings"

• Predictor: "Age_in_Y ears"


The data are in the file "elephant.txt". In order to visualize the data import the dataset:

elephant <- read.csv(". . ./elephant.txt", sep="")


View(elephant)

The first rows are:


Age_in_Years Number_of_Matings
1 27 0
2 28 1
3 28 1
4 28 1
5 28 3

To visualize the scatterplot of the data we run the script:

plot(Number_of_Matings~Age_in_Years,data=elephant)

and from the plot we already observes that the number of matings (count response variable) seems to
increase with the age (also notice that the variance seems to increase with the expected value: again a
sort of megaphone effect). This means that a Poisson distribution may be a good representation (a good
probability model) for this dataset:
• Elephants with around 30 years have between 0 and 3 mates

• Elephants with around 45 years have between 0 and 9 mates

198
16.3 Poisson regression parametrization
Let us consider the response variables as:
Yi ∼ P(λi )
where "λi = E(Yi )" is the mean response for the i-th trial (here we are using the Poisson distribution inside the
GLM framework).

GLM model

We use a GLM (with the canonical link function) of the form:

g(λi ) = log(λi ) = β0 + β1 xi

Elephant dataset (II)

So we now use the "glm" function in Software R:

mod=glm(Number_of_Matings~Age_in_Years,data=elephant, family=poisson(link="log"))
summary(mod)

and visualize the output:

Call:
glm(formula = Number_of_Matings ~ Age_in_Years, family = poisson(link = "log"),
data = elephant)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.80798 -0.86137 -0.08629 0.60087 2.17777

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.58201 0.54462 -2.905 0.00368 **
Age_in_Years 0.06869 0.01375 4.997 5.81e-07 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
(Dispersion parameter for poisson family taken to be 1)

Null deviance: 75.372 on 40 degrees of freedom


Residual deviance: 51.012 on 39 degrees of freedom
AIC: 156.46
Number of Fisher Scoring iterations: 5

As usual we have the Wald tests: in this case we have one test for the intercept (usually it is not relevant
and we skip the analysis of this since we never remove the intercept from the model even when it is non-
significant) and one for the predictor (which is significant).
From the table of the deviances we can also compute the likelihood ratio test. Since the dispersion parameter
is equal to "ϕ = 1", in order to compute the likelihood ratio test it’s enough to compute the difference between
the deviances and then compare them to the chi-square distribution with one degree of freedom (complete-
null).

199
From the output we have that:
• The dispersion parameter is not estimated, since Poisson doesn’t have an unknown dispersion parameter
(the dispersion parameter is equal to "1" since the variance is equal to the expected value)
• The likelihood ratio test gives:

> 1-pchisq(75.372-51.012,1)
[1] 7.991084e-07

(try also with the "anova" function).

• The Wald test here is equivalent to the likelihood ratio test, because we work with one predictor. From
this we have that "Age" is a significant predictor of the number of mates: we confirm the positive
association.

16.4 Predicted values


The predicted values are obtained through inversion of the canonical link function. So for a given "x", since
the link function is a logarithm, to compute the expected value we have to use the exponential function:

ŷ = eβ̂0 +β̂1 x
For instance, if "x = 30" we have:

> exp(-1.58201+0.06869*30)
[1] 1.613959

This means that for an age of "30" the mean number of mates is "1.61", and also the variance is "1.61". The plot
of the (exponential since the "β1 > 0" ) curve is:

newdat=data.frame(Age_in_Years=seq(min(elephant$Age_in_Years),max(elephant$Age_in_Years),
len=300))
hatlambda=predict(mod, newdata=newdat, type="response")
plot(Number_of_Matings~Age_in_Years,data=elephant, col="red")
lines(hatlambda~Age_in_Years, data=newdat, col="black", lwd=2)

200
16.5 Interpretation of the parameters
In the logistic regression framework we have seen that, when we interpret the parameters, the absolute value value
of the parameters is obtained by the mean of the odds-ratio.
Here we are in a simpler case as we don’t have to compute the odds-ratio: we just need to consider the expected
value "λ" for a given level of the predictor and then same lambda for the same predictor increased by a unit. We
then consider the ratio between these two values to obtain (notice that the predicted values are the exponential
of the linear predictor):
λ(x + 1) eβ0 +β1 (x+1) eβ0 eβ1 x eβ1
= = = eβ1
λ(x) eβ0 +β1 x eβ0 eβ1 x
This means that when we consider the expected value for a predictor increase by a unit we have:

λ(x + 1) = λ(x)eβ1

So the difference between the linear regression and the Poisson regression is that in linear regression the
slope "β1 " represents a linear increase of the response variable "y" given a unit increase of the "x". On the contrary
in the Poisson case we have a multiplicative effect: when we increase the predictor by a unit the response variable
is multiplied by a factor equal to "eβ1 ". From this we notice again that if "β1 > 0" then the factor "eβ1 > 1" and
then we have a positive association between the "x" and the "y". Otherwise if "β1 < 0" the factor "0 < eβ1 < 1" and
so we have an inverse association.
An increase of the "X" by "1" unit has a multiplicative effect on the mean by "eβ1 ".

• If "β1 > 0", the expected value of "Y " increases with "X"
• If "β1 < 0" the expected value of "Y " decreases as "X" increases

The "eβ0 " is the mean of "Y " when "X = 0". For the "elephant" data we have (we look again at the table of the
coefficients):

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.58201 0.54462 -2.905 0.00368 **
Age_in_Years 0.06869 0.01375 4.997 5.81e-07 ***

• "β1 > 0" so the number of mates increases with the age (from the Wald test this increase is significant). More
precisely, the increase of "1" year in age yields an increase of the number of mates by a multiplicative factor
of "e0.06869 "
• The "β0 " is not meaningful in this example since the age "0" makes no sense.

201
We now consider an example with several predictors:

Video games sales (I)

• Response variable: "Global_Sales": total sales across the globe


• Predictor: "Genre": genre of the game
• Predictor: "Publisher": publisher of the game
• Predictor: "User_Score": Score given by Metacritic’s subscribers

• Another response variable: "NA_Sales": total sales in North America

The data is in the file "Video_Games.csv": it is a quite big dataset and it also contains several other
variables. Note that the "User_Score" has a lot of missing values.
Remark: first of all we have to visualize the data and perform a data-cleaning. For instance take:

library(readr)
Video_Games <- read.csv(". . ./Video_Games.csv")
View(Video_Games)

sort(table(Video_Games$Publisher))

There are "581" different publishers (a categorical predictor): this means that we have to define "k − 1"
dummy variables. This is not actually difficult to implement but it’s quite difficult to interpret as the
number is too high. In order to overcome this issue we then consider only the 9 most represented (with at
least 3% of the games).

Video_Games_red=Video_Games[(Video_Games$Publisher=="Sega" |
Video_Games$Publisher=="Sony Computer Entertainment" |
Video_Games$Publisher=="Nintendo" |
Video_Games$Publisher=="THQ" |
Video_Games$Publisher=="Konami Digital Entertainment" |
Video_Games$Publisher=="Ubisoft" |
Video_Games$Publisher=="Namco Bandai Games" |
Video_Games$Publisher=="Activision" |
Video_Games$Publisher=="Electronic Arts"),]
table(Video_Games_red$Publisher)
barplot(table(Video_Games_red$Publisher),las=2)

202
Moreover we make "integer" the response variable "sales" (we multiply for "1000"):

Video_Games_red$NA_Sales=1000*Video_Games_red$NA_Sales
Video_Games_red$Global_Sales=1000*Video_Games_red$Global_Sales

Now that we have a "clean" dataset we consider the GLM function define the model. First We consider the
Poisson regression with two predictors "Genre" and "Publisher":

mod=glm(Global_Sales~Genre+Publisher,data=Video_Games_red,family=poisson(link="log"))
summary(mod)

The output we obtain is:

Call:
glm(formula = Global_Sales ~ Genre + Publisher, family = poisson(link = log),
data = Video_Games_red)

Deviance Residuals:
Min 1Q Median 3Q Max
-90.65 -23.09 -13.05 2.04 633.99

Coefficients:
Estimate Std. Error z value
(Intercept) 6.386848 0.001482 4309.64
GenreAdventure -0.562623 0.003060 -183.88
GenreFighting 0.421894 0.002232 188.98
GenreMisc -0.040151 0.001754 -22.89
GenrePlatform 0.472822 0.001653 285.97
GenrePuzzle -0.291474 0.002796 -104.25
GenreRacing 0.327363 0.001790 182.85
GenreRole-Playing 0.232490 0.001812 128.33
GenreShooter 0.657660 0.001615 407.21
GenreSimulation 0.090501 0.002180 41.52
GenreSports 0.225532 0.001512 149.21
GenreStrategy -0.504285 0.003259 -154.75
PublisherElectronic Arts 0.088556 0.001552 57.05
PublisherKonami Digital Entertainment -0.703406 0.002247 -313.02
PublisherNamco Bandai Games -0.907799 0.002356 -385.28
PublisherNintendo 1.306718 0.001461 894.46
PublisherSega -0.513940 0.002271 -226.30
PublisherSony Computer Entertainment 0.205951 0.001760 117.03
PublisherTHQ -0.398753 0.002113 -188.72
PublisherUbisoft -0.309771 0.001887 -164.17

(Dispersion parameter for poisson family taken to be 1)


Null deviance: 12594787 on 7792 degrees of freedom
Residual deviance: 9369983 on 7773 degrees of freedom
AIC: 9427953
Number of Fisher Scoring iterations: 6

Since both the predictors are categorical, we have a lot of parameters in our output (for each predictor we
have to define Dummy variables).
Notice that in our output we didn’t consider the p-values: this is because when we are dealing with categorical
predictors, and each predictor gives rise to a set of parameter (it’s not interesting to discriminate the
significance of each parameter). Remember that we perform the tests in order to obtain a reduced model
(to eliminate predictors from the complete model).

203
In this particular case we can’t remove a single predictors (for example "PublisherElectronic Arts") and keep
maintaining all the other predictors of the same category ("publisher"): if we want to remove a predictor we
have to remove the entire category (in this case we have two "blocks").

[we can’t perform the Wald test]


Remember that the dispersion parameter is "ϕ = 1" as we have fixed the Poisson distribution.
So the idea is that when we need to check the simultaneous significance of several parameters we need to use
a likelihood ratio test even for the determination of the significance of the single predictor. So to check the
significance of a single predictor (for example "publisher"), we define the reduced model (the one without
the predictor "publisher") and then we perform the likelihood ratio test between the two models. Since the
output is quite big we can simply use the "anova" function with a chi-square test:

modred=glm(Global_Sales~Genre,data = Video_Games_red,family = poisson(link = log))


anova(modred,mod,test="Chisq")

And so from the output we can see that the predictor "publisher" is highly significant (the p-value is less
than the machine precision):

Analysis of Deviance Table


Model 1: Global_Sales ~ Genre
Model 2: Global_Sales ~ Genre + Publisher

Resid. Df Resid. Dev Df Deviance Pr(>Chi)


1 7781 11943370
2 7773 9369983 8 2573387 < 2.2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

If for example we want to investigate a model with the interaction between the predictors we just have to
replace the "+" with a "∗" inside the script:

mod.int=glm(Global_Sales~Genre*Publisher,data=Video_Games_red,family=poisson(link="log"))
summary(mod.int)

Here we don’t print the output since is quite long. If we want to check the significance of the interaction
we again use the "anova" function using this last model as the complete model and the model without
interaction as the reduced model:
anova(mod,mod.int,test="Chisq")

From the output we see that the interaction is highly significant but it’s not easy at all to give an interpre-
tation of the results as we need to check every possible level.

As an exercise, perform the other tests. What are the equations for this model, i.e. how to use the coefficients to
obtain predicted values? What does it happens if you add the interaction term(s)?

Video games sales (II)

Repeat the analysis by using also the "User_Score" predictor (be careful as there are several "missing
values").

204
16.6 Poisson regression for rates: the offset
In some applications, your count response variable is the numerator of a rate. This means that in this
case it’s not important to have the absolute value but the proportion of our count with respect to some quantities
(different for each observation).
Remark: remember that the Poisson distribution is a good approximation of the binomial distribution when "n"
is large and "p" is small.
In such a case the response "Y " has a Poisson distribution, but we want to model:
 
Yi
ti

where "ti " is index of the time or space. If we want to model the expected value of the rate "Y /t", by means of a
set of predictors "X1 , . . . , Xp ", the link function must be the logarithm of the expected value of the rate "Y /t". The
Poisson GLM regression model for the expected rate of the occurrence of event is:
 
µi
log = ηi = βo + β1 xi,1 + . . . , +βp xi,p
ti

This can be rewritten as:


log(µi ) = log(ti ) + β0 + β1 xi,1 + . . . , βp xi,p
The term "log(ti )" is referred to as an offset (we allow each observation in the dataset to have different intersect
in the linear predictor). It is an adjustment term and a group of observations may have the same offset, or each
individual may have a different value of "t". The term "log(t)" is an observation and it will change the value of
estimated counts:
µ̂i = elog(ti )+β̂0 +β̂1 xi,1 +···+β̂p xi,p = ti · eβ̂0 +β̂1 xi,1 +···+β̂p xi,p
This means that the mean count is proportional to "ti ". The interpretation of the parameter estimates "β̂" is the
same as for the model of counts. Only the expected values change, since we need to multiply each expected count
by the corresponding "ti ". So in this case we have the same structure of the Poisson distribution but the intercept
assumes different values for each observation).

COVID-19 mortality

We want to model the COVID-19 mortality rate in various European regions as a function of the air pollution.
All the relevant data are in the Excel file "COVID19_Europeandata_new.xlsx". The columns we use
here are:

• "COVID19_D" (number of deaths): this isn’t our real response variable as the number of deaths de-
pends on the number of inhabitant of a particular region (the correct response variable is the proportion
between the number of deaths and the population).
• "NO2_avg" (average NO2 concentration, as an indicator of the air pollution).

• "POPULATION " (our offset)


So our aim is to check if the "air pollution" is somehow associated to the number of death of Covid-19. First
of all we import and visualize the dataset:

library(readxl)
COVID19_European_data_new<-read_excel(".../COVID19_European data_new.xlsx",sheet="Data")
View(COVID19_European_data_new)

After removing rows with missing values of the response variable, we perform Poisson regression
with the offset:
dat2=COVID19_European_data_new[is.na(COVID19_European_data_new$COVID19_D)==F,]
dim(dat2)

205
So in order to obtain the model with the offset we just need to add a further information into the "glm"
function:
mod_wo=glm(COVID19_D~NO2_avg,data=dat2,family = poisson(link ="log"),
offset=log(POPULATION))
summary(mod_wo)

and so we obtain the following output:

Call:
glm(formula = COVID19_D ~ NO2_avg, family = poisson(link = "log"), data = dat2,
offset = log(POPULATION))

Deviance Residuals:
Min 1Q Median 3Q Max
-38.602 -8.289 -4.897 1.506 111.521

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.167130 0.006608 -1235.98 <2e-16 ***
NO2_avg 0.051509 0.001457 35.34 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

From the table of coefficients we see that the "pollution" is significant: this means that we have a significant
(and positive) effect of the air pollution over the Covid-19 mortality.
Note that using a linear model on the proportion (so we don’t consider the offset) gives a worse model:

mod_an=lm((COVID19_D/POPULATION)~NO2_avg,data=dat2)
summary(mod_an)

Call:
lm(formula = (COVID19_D/POPULATION) ~ NO2_avg, data = dat2)

Residuals:
Min 1Q Median 3Q Max
-0.0003046 -0.0002246 -0.0001161 0.0001264 0.0042280

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.248e-04 2.575e-05 8.730 <2e-16 ***
NO2_avg 1.407e-05 5.940e-06 2.368 0.0181 *
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 0.000331 on 816 degrees of freedom
(222 observations deleted due to missingness)
Multiple R-squared: 0.006827,Adjusted R-squared: 0.00561
F-statistic: 5.609 on 1 and 816 DF, p-value: 0.0181

206
16.7 Overdispersion
A strong limitation of the Poisson distribution is that the (conditional) variance is equal to the (conditional) mean.
When the variance exceeds the mean (expected value), we say that there is overdispersion (of course there’s also
the phenomenon of underdispersion but it is not that usual). There are at least two possible solutions for dealing
with overdispersion:

• To use a Negative Binomial distribution for the response variable (it has 2 parameters and we can set
the expected value and variance at difference values):

r(1 − p) r(1 − p)
E(Y ) = V AR(Y ) =
p p2

• To use a Zero-Inflated Poisson distribution for the response variable (we use this distribution when we
have an excess of zeros in our distribution):

Y = αδ0 + (1 − α)P (λ)

Remember that in both cases we move outside the framework of exponential families so the estimation
of the parameters and the hypothesis test is more complicated with respect to the strategy we saw for the GLM
theory.

16.8 Negative Binomial regression


The density of a Negative Binomial random variable "Y " is rewritten as a function of two parameters "µ" and "θ"
(dispersion parameter) with:
µ2
E(Y ) = µ V AR(Y ) = µ +
θ
Dispersion parameter

The parameter "θ" is a positive parameter and has the role of a dispersion parameter, but it is not the
dispersion parameter of the exponential family.

The parameter "θ" is estimated separately and then used in the distribution of the Negative Binomial as a known
parameter. In this sense, the Negative Binomial is a one-parameter exponential family. The idea is to consider the
two values of separately: we estimate (fix) the "θ" and then estimate the GLM with only the mean "µ". From the
previous equation:
µ2
V AR(Y ) = µ +
θ
we see that:

• The variance is always greater than the mean because "θ" is a dispersion parameter (it is non-negative)
and the second term is positive (the Negative Binomial is used only for overdispersion, not for under-
dispersion)
• As the parameter goes to infinity "θ → ∞" the Negative Binomial distribution has a limit which is the Poisson
distribution (the second term goes to "0" and so the variance becomes equal to the expected value)
• As the parameter goes to zero "θ → 0" we have that the variance goes to infinity "V AR(Y ) → ∞"

Depending on where our estimate "θ̂" is placed in the real line we can decide if the Poisson distribution is enough
to describe the data or not (large values of "θ̂" describe a Poisson model choice: we have a small overdispersion):

207
In practice, the Negative Binomial regression works as a Poisson regression, the canonical link function is "log(µ)".
The Negative Binomial regression is not in the standard Software R packages: one possible implementation of the
Negative Binomial regression is with the function "glm.nb" in the package "MASS".

Negative binomial regression

Let us reconsider the dataset on "Videogames" under the Negative Binomial regression. So first of all we
import the data and then we compute the GLM:

library(readr)
Video_Games <- read_csv(". . ./Video_Games.csv")
View(Video_Games)

library(MASS)
mod.nb1=glm.nb(Global_Sales~Genre+Publisher,data=Video_Games_red)
summary(mod.nb1)

And so what we get is:

......
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6823 -1.1216 -0.5844 0.0623 6.4889
Coefficients:
Estimate Std. Error z value
(Intercept) 6.359054 0.043378 146.596
GenreAdventure -0.530008 0.067314 -7.874
......
(Dispersion parameter for Negative Binomial(0.758) family taken to be 1)
Null deviance: 12116.2 on 7792 degrees of freedom
Residual deviance: 9295.1 on 7773 degrees of freedom
AIC: 114705
Number of Fisher Scoring iterations: 1
Theta: 0.7580 #Dispersion parameter#
Std. Err.: 0.0105
2 x log-likelihood: -114663.3600

The important point here is the line concerning the dispersion parameter. It is estimated at "θ̂ = 0.7580"
showing strong overdispersion (it is small). This result tells us that is better to maintain the Negative
Binomial distribution instead of considering the Poisson distribution. In the same line we also notice that
"ϕ̂ = 1". In the last line we also obtain the "2 x log − likelihood" value used for the likelihood ratio test.
If we want to compare our model with any other model we have to compare the log-likelihood (we can use
the "logLik" function). For example we compare our model with the Poisson model:

l1=logLik(mod.nb1)
l2=logLik(mod)
chi2 = as.numeric(-2*(l2-l1))
pchisq(chi2,1,lower.tail=F)

From the output we see that the chi-square test is equal to "0": this means that the reduced model is not a
good model and so we have to stick to the Negative Binomial distribution (we reject the Poisson).
If the p-value is greater than the threshold of "5%" we can assume the Poisson distribution.

208
Negative Binomial regression

Try with the following model:


[frame=single]
Global_Sales ~ User_Score + Publisher

Negative Binomial regression

On the elephant data, check that the Poisson model is reasonable.

209
16.9 ZIP regression

ZIP regression

Let us consider the following data concerning the ear infections in swimmers. The relevant variables here
are:
• Response variable "Infections": number of infections

• Categorical predictor "Swimmer": "Occasional" or "Frequent"


• Categorical predictor "Location": "Beach"or "NonBeach"
The data are available in the file "earinf.txt". First of all we import and take a look at the dataset:

earinf <- read.delim(". . ./earinf.txt")


View(earinf)

Swimmer Location Age Sex Infections


1 Occas NonBeach 15-19 Male 0
2 Occas NonBeach 15-19 Male 0
3 Occas NonBeach 15-19 Male 0
4 Occas NonBeach 15-19 Male 0
5 Occas NonBeach 15-19 Male 0

So if we procede with the standard Poisson model:

mod=glm(Infections~Swimmer+Location,data=earinf,family=poisson(link="log"))
summary(mod)
table(earinf$Swimmer,earinf$Location) #to count the observed counts vs the predicted#

The output we obtain is:

Call:
glm(formula = Infections ~ Swimmer + Location,
family = poisson(link = "log"), data = earinf)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1266 -1.5652 -1.2137 0.5128 6.2538
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3058 0.1059 -2.887 0.00389 **
SwimmerOccas 0.6130 0.1050 5.839 5.24e-09 ***
LocationNonBeach 0.5087 0.1028 4.948 7.49e-07 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 824.51 on 286 degrees of freedom
Residual deviance: 764.65 on 284 degrees of freedom
AIC: 1143
Number of Fisher Scoring iterations: 6

and we notice that both predictors are significant. From the following output we can compute the expected
value of each Poisson. Since the two predictors are categorical with 2 levels we only have 4 Poisson
distributions for the response variable.

210
So we compute the parameter for each one:

Estimate Std. Error z value Pr(>|z|)


(Intercept) -0.3058 0.1059 -2.887 0.00389 **
SwimmerOccas 0.6130 0.1050 5.839 5.24e-09 ***
LocationNonBeach 0.5087 0.1028 4.948 7.49e-07 ***

The (four) predicted values are:


• For Frequent Beach swimmers: e−0.3058 = 0.7365

• For Frequent NonBeach swimmers: e−0.3058+0.5087 = 1.2249


• For Occasional Beach swimmers: e−0.3058+0.6130 = 1.3596
• For Occasional NonBeach swimmers: e−0.3058+0.6130+0.5087 = 2.2612
But if we take a look at the plots below (observed vs expected counts) we see that the observed and predicted
values aren’t close when we consider "x = 0" (the observed is always higher than the predicted):

par(mfrow=c(2,2))
barplot(table(factor(earinf$Infections[(earinf$Swimmer=="Freq"
&earinf$Location=="Beach")],levels=0:17)),
ylim=c(0,50),xlab="Number of infections",ylab="Counts",space=0,main="Freq+Beach")
points((0:17)+0.5,72*dpois(0:17,0.7365),pch=18,cex=1.5)
barplot(table(factor(earinf$Infections[(earinf$Swimmer=="Freq"
&earinf$Location=="NonBeach")],levels=0:17)),
ylim=c(0,40),xlab="Number of infections",ylab="Counts",space=0,main="Freq+NonBeach")
points((0:17)+0.5,71*dpois(0:17,1.2249),pch=18,cex=1.5)
barplot(table(factor(earinf$Infections[(earinf$Swimmer=="Occas"
&earinf$Location=="Beach")],levels=0:17)),
ylim=c(0,45),xlab="Number of infections",ylab="Counts",space=0,main="Occas+Beach")
points((0:17)+0.5,75*dpois(0:17,1.3596),pch=18,cex=1.5)
barplot(table(factor(earinf$Infections[(earinf$Swimmer=="Occas"
&earinf$Location=="NonBeach")],levels=0:17)),
ylim=c(0,30),xlab="Number of infections",ylab="Counts",space=0,main="Occas+NonBeach")
points((0:17)+0.5,69*dpois(0:17,2.2612),pch=18,cex=1.5)
par(mfrow=c(1,1))

211
What we notice is that the predicted value of "0 infection" is largely underestimated (this means that there’s
an excess of zeros).

So when there is an excess of observed zeros as (luckily) in this case, we may use a Zero Inflated Poisson (ZIP)
model.

Remember

A Zero Inflated Poisson distribution is the mixture of a constant (non-random) variable in "0" (with
a given probability) and a Poisson random variable with density "P(λ)".

212
ZIP model
The idea is that we have two nested GLM: first we implement a logistic regression which tells us if we are
in the "0"s or if we are in a Poisson regression. Then, for the elements coming from a Poisson distribution,
we compute the Poisson regression to perform the estimation.
So there are two steps: the first one is a logistic regression to discriminate the observations ("0" or
Poisson), and then we estimate the "λ" with the standard Poisson regression [the problem here is
not with the positive counts which of course come from the Poisson distribution but with the "0" which may
come both from the Poisson or be a structural zero].
So the ZIP model is: (
0 with probability pi
Yi |xi ∼
P(λi ) with probability 1 − pi
The ZIP model with this equation is estimated with a two-step procedure:

1. First we estimate the probability "pi " with a logistic regression model:

logit(pi ) = xti γ

2. Second we estimate the parameter "λi " with a Poisson regression model:

log(λi ) = xti β

Zip regression model

To perform ZIP regression in Software R, we use the function "zeroinfl" in the package "pscl" (analog of
the "glm" function for the ZIP). We now use a ZIP regression model on the ear infection data:

library(pscl)
mod.zi=zeroinfl(Infections~Swimmer+Location,data=earinf,dist="poisson")
summary(mod.zi)

Note that we must specify "dist="poisson"", because the same function also models zero inflation in
negative binomials and other discrete distributions.
The output is:

Call:
zeroinfl(formula = Infections ~ Swimmer + Location, data = earinf, dist = "poisson")
Pearson residuals:
Min 1Q Median 3Q Max
-0.9896 -0.7199 -0.5957 0.3513 7.4226
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.6079 0.1224 4.965 6.87e-07 ***
SwimmerOccas 0.5154 0.1229 4.194 2.74e-05 ***
LocationNonBeach 0.1221 0.1151 1.061 0.289
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.3868 0.2441 1.585 0.11295
SwimmerOccas -0.1957 0.2748 -0.712 0.47640
LocationNonBeach -0.7543 0.2695 -2.799 0.00513 **
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Number of iterations in BFGS optimization: 13
Log-likelihood: -475.2 on 6 Df

213
So as we can see the output has exactly two sections: the first one is about the estimation of the parameters
of the Poisson distribution (the standard Poisson regression), whilst the second one is the logistic regression
where we decide if the single observation must be considered a structural "zero" or a "0" coming from the
Poisson distribution. The parameter estimates are (we define for each of the 4 distribution the probability
of "0" and then the expected value of the Poisson distribution):
• For "Frequent/Beach":
e0.3868
⋄ p̂ = (1+e0.3868 ) = 0.5955
⋄ λ̂ = e0.6079 = 1.8366
• ...

[Compute all the parameter estimates for the ZIP regression.]

We can now plot the expected counts vs the predicted counts:

table(earinf$Swimmer,earinf$Location)

c1=mod.zi$coefficients[1]
c2=mod.zi$coefficients[2]

par(mfrow=c(2,2))
barplot(table(factor(earinf$Infections[(earinf$Swimmer=="Freq"
&earinf$Location=="Beach")],levels=0:17)),
ylim=c(0,50),xlab="Number of infections",ylab="Counts",space=0,main="Freq+Beach")
points((0:17)+0.5,72*(0.5955*c(1,rep(0,17))+(1-0.5955)*dpois(0:17,1.8366)),
pch=18,cex=1.5)
barplot(table(factor(earinf$Infections[(earinf$Swimmer=="Freq"
&earinf$Location=="NonBeach")],levels=0:17)),
ylim=c(0,40),xlab="Number of infections",ylab="Counts",space=0,main="Freq+NonBeach")
points((0:17)+0.5,71*(0.4091*c(1,rep(0,17))+(1-0.4091)*dpois(0:17,2.0751)),
pch=18,cex=1.5)
barplot(table(factor(earinf$Infections[(earinf$Swimmer=="Occas"
&earinf$Location=="Beach")],levels=0:17)),
ylim=c(0,45),xlab="Number of infections",ylab="Counts",space=0,main="Occas+Beach")
points((0:17)+0.5,75*(0.5476*c(1,rep(0,17))+(1-0.5476)*dpois(0:17,3.0750)),
pch=18,cex=1.5)
barplot(table(factor(earinf$Infections[(earinf$Swimmer=="Occas"
&earinf$Location=="NonBeach")],levels=0:17)),
ylim=c(0,30),xlab="Number of infections",ylab="Counts",space=0,main="Occas+NonBeach")
points((0:17)+0.5,69*(0.3628*c(1,rep(0,17))+(1-0.3628)*dpois(0:17,3.4743)),
pch=18,cex=1.5)
par(mfrow=c(1,1))

214
The fit is now improved as the "0" are correctly estimated (the boxes are the ZIP distribution [the Poisson
distributions with excess of zeros]):

215
17 Lab 7
17.1 Regression for count data
This is an exercise on the Negative Binomial regression.

17.1.1 Exercise 1 - AirBnB’s in Nanjing (China)

In this exercise we model the spatial distribution of AirBnB’s in Nanjing (China) using several possible predictors.
The city has been divided into "177"square zones (see the picture in the next page) and for each zone the data in
the file "nanjing.xlsx" records several variables.

Sun S, Zhang S, Wang X (2021) Characteristics and influencing factors of Airbnb spatial distribution in China’s
rapid urbanization process: A case study of Nanjing. PLoS ONE 16(3): e0248647

[https://doi.org/10.1371/journal.pone.0248647]

Here we consider only the following (selected) variables:

• Response variable - "Num_airbn" : the number of AirBnB’s (the number of red dots)

• Predictor - "Subway_Station": the distance from the nearest subway station


• Predictor - "Cultural_Attraction": the distance from the nearest cultural attraction
• Predictor - "Living_Facility": the number of living facilities

• Predictor - "Percentage_construction_area": the percentage of urban constructions

216
1. Import the data into a Software R data frame. Give some descriptive analysis of the variables.
What we do now is importing the data and giving a general description of the dataset (we use the "summary"
function for the quantitative variables, we plot the histogram or the boxplot [to check if there are outliers] and
so on):

library(readxl)
nanjing <- read_excel(". . ./nanjing.xlsx")
View(nanjing)

summary(nanjing$Num_airbn)
boxplot(nanjing$Num_airbn,main="Num_airbn")

summary(nanjing$Subway_Station)
boxplot(nanjing$Subway_Station,main="Subway_Station")

summary(nanjing$Living_Facility)
boxplot(nanjing$Living_Facility,main="Living_Facility")

summary(nanjing$Cultural_Attraction)
boxplot(nanjing$Cultural_Attraction,main="Cultural_Attraction")

summary(nanjing$Percentage_construction_area)
boxplot(nanjing$Percentage_construction_area,main="Percentage_construction_area")

2. Use a Negative Binomial "glm" to model the number of AirBnB’s as a function of the other four variables,
and define a reduced model if you think that one or more predictors can be removed.
We now perform the "negative binomial GLM" (we run this on the complete model) [since it is a Negative
Binomial we are using it’s specific function which doesn’t need any specification for the distribution] and we
perform the log-likelihood (we will use it later):

library(MASS)
mod=glm.nb(Num_airbn~Subway_Station+Living_Facility+Cultural_Attraction+
Percentage_construction_area,data=nanjing)
summary(mod)
llmod=logLik(mod)

217
From the output we see that there are 3 predictors with significant p-values, whilst the predictor "cultural_attraction"
is not: this means that we can define a reduced model removing this last predictor.

Call:
glm.nb(formula = Num_airbn ~ Subway_Station + Living_Facility +
Cultural_Attraction + Percentage_construction_area, data = nanjing,
init.theta = 0.7070913205, link = log)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.30795 -1.03389 -0.54729 0.05536 3.12145

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.526e+00 2.799e-01 12.596 < 2e-16 ***
Subway_Station -6.689e-04 8.232e-05 -8.126 4.44e-16 ***
Living_Facility 3.019e-03 8.122e-04 3.717 0.000201 ***
Cultural_Attraction -1.386e-04 9.058e-05 -1.529 0.126142
Percentage_construction_area 1.417e-02 3.260e-03 4.347 1.38e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(0.7071) family taken to be 1)
Null deviance: 425.37 on 176 degrees of freedom
Residual deviance: 208.30 on 172 degrees of freedom
AIC: 1562.5
Number of Fisher Scoring iterations: 1
Theta: 0.7071
Std. Err.: 0.0775
2 x log-likelihood: -1550.4550

The reduced model is indeed:


mod.r=glm.nb(Num_airbn~Subway_Station+Living_Facility+Percentage_construction_area,
data=nanjing)
summary(mod.r)
llmod.r=logLik(mod.r)

Call:
glm.nb(formula = Num_airbn ~ Subway_Station + Living_Facility +
Percentage_construction_area, data = nanjing, init.theta = 0.6962030633,
link = log)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.3144 -1.0097 -0.5880 0.0817 3.2812

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.255e+00 2.368e-01 13.748 < 2e-16 ***
Subway_Station -6.928e-04 8.147e-05 -8.504 < 2e-16 ***
Living_Facility 3.009e-03 8.091e-04 3.718 2e-04 ***
Percentage_construction_area 1.572e-02 3.281e-03 4.792 1.65e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(0.6962) family taken to be 1)
Null deviance: 419.50 on 176 degrees of freedom
Residual deviance: 207.99 on 173 degrees of freedom

218
AIC: 1562.8
Number of Fisher Scoring iterations: 1
Theta: 0.6962
Std. Err.: 0.0760
2 x log-likelihood: -1552.8060

As we can see from the output, in the reduced model all the 3 predictors are significant
3. For your reduced model, compute the log-likelihood of the model, the log-likelihood of the null model, and do
the likelihood ratio test.
We now perform the likelihood ratio test to check the significance of our reduced model: we need to compare
the reduced model with the null model (we need the log-likelihood). So we run the "glm.nb" function for the
null-model (which is the model with no predictors: "∼ 1"). We then compute the log-likelihood for this model
and we compare it with the first one (reduced model):

mod.null=glm.nb(Num_airbn~1,data=nanjing)
llmod.null=logLik(mod.null)

llmod.r
llmod.null
-2*(llmod.null-llmod.r) #is positive#
p=1-pchisq(as.numeric(-2*(llmod.null-llmod.r)),3)

The first log-likelihood (reduced model) is based on 5 degrees of freedom (3 predictors + 2 parameters
of the Negative Binomial distribution).
The second log-likelihood (null model) is based on only 2 degrees of freedom (2 parameters of the
Negative Binomial distribution) since in this case we don’t have the predictors.
The difference between the log-likelihoods is positive "143.4091" and is based on 3 degrees of freedom (5-2).
In the end the p-value we obtain is equal to "0".
4. Using the variables in your reduced model, compare with a Poisson model defining the appropriate likelihood
ratio test.
In this point we compare our reduced model with a Poisson distribution. We can foresee that the Poisson
distribution is not a good idea here by looking at the value of "θ" in our output. The value for the dispersion
parameter for the Negative Binomial is indeed "θ = 0.6962", which is a very small value "θ < 1". This is not
a good result since in our theory we studied that the Poisson distribution is the limit of the Negative Binomial
distribution as "θ → ∞": this means that the Poisson distribution is not a good approximation of our model.
We can conclude the same by performing the test: first of all we define the Poisson model (we use the standard
"glm" function):

mod.pois=glm(Num_airbn~Subway_Station+Living_Facility
+Percentage_construction_area,data=nanjing,family=poisson(link="log"))
llmod.pois=logLik(mod.pois)

llmod.r
llmod.pois
p=1-pchisq(as.numeric(-2*(llmod.pois-llmod.r)),1)

The p-value we obtain is "0" which confirms what we previously said about the Poisson distribution. So in our
example the Negative Binomial distribution is needed to model the overdispersion of the response variable.
Here the Negative Binomial distribution is the big model and the Poisson distribution is the reduced model:
since the p-value is small we reject the reduced model (Poisson) and we accept the large model (Negative
Binomial) with the two parameters of the Negative Binomial.

219
5. In the reduced model compute the predicted values and plot "predicted vs observed".
We can find the fitted values in the output of the "glm" function. We can plot the figure:

mod.r$fitted.values
nanjing$Num_airbn
plot(nanjing$Num_airbn,mod.r$fitted.values)
cor(nanjing$Num_airbn,mod.r$fitted.values)

From the graph we notice that there’s some overdispersion and so there are several outliers on the right
direction. We also computed the correlation and we see there’s a high correlation between the observed and
the predicted.

So the idea is to:

1. Define the model and check its significance with the likelihood ratio test

2. Check the significance of the predictor with the Wald test and with the likelihood ratio test for categorical
predictors (as we has seen for the Logistic regression // count data)
3. Check the overdispersion by comparing (in this case) the Negative Binomial distribution and the Poisson
distribution

220
18 Cross validation and model selection
18.1 Introduction
Model selection is a statistical technique used to guide reduced model in regression of classification problems where
the number of predictors is large.
For a small number of predictors in the complete model we can perform the Student-t test for the regression or the
Wald test for the GLM and then perform a model selection at the end. In general this is not possible when the
number of predictors is large.
So the idea is that we need to work in a sort of "tradeoff" of two opposite situations: we can select a very small
model (we have a very handy model but we lose information) or a very big model (the complete model has the
entire information but it’s not easy to handle).
When a lot of predictors are available, the great temptation is to use all of them in order to create a big model
which fit our data with great precision. But in practice a small model may be more appropriate for two reasons:

• A small model has a simple interpretation


• A big model is often inappropriate when we move from model fitting to forecast

Remember our early discussion on the figure below: in the first figure we have a lot of information but its chaotic
whilst the second one is more clear but has far less information:

We are now ready to approach this problem with a strong statistical background.

221
Single regression example

In this simple regression case we have one regressor "x" and one response variable "y". We now consider the
7 points in the figure below:

x=c(1,2,3,4,5,6,7)
y= c(-1.31,1.49,2.30,3.70,0.18,4.91,1.30)
plot(x,y,pch=20,cex=2)

From the mathematics we know that for "7" points we can define a polynomial with degree 6 in order to
obtain a perfect fit (a perfect interpolation) [since we have just one predictor we use the powers to define
multiple parameters]:

mod7=lm(y~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6))


summary(mod7)

coe=mod7$coefficients
x1=seq(0.5,7.5,0.01)
y1=coe[1]+coe[2]*x1+coe[3]*x1^2+coe[4]*x1^3+coe[5]*x1^4+coe[6]*x1^5+coe[7]*x1^6
plot(x,y,pch=20,cex=2)
lines(x1,y1,col="red",lwd=2)

This means that in this example the sum of the errors is null and the R2 is equal to:
X
e2i = 0 R2 = 1

so we have the best possible model. However it’s not very practical to use such a polynomial to predict
values of the response variable: in this case a simple linear function is better (we consider a simpler model).
Of course the regression line seems to be a worse model since each data point presents some error but we
moved to a much simpler model.

222
mod1=lm(y~x)
plot(x,y,pch=20,cex=2)
lines(x1,y1,col="red",lwd=2)
abline(mod1,col="blue",lwd=2)

But if we add few new points (the 3 new observations in green) it’s now not obvious that our red line is still
the best model: it seems that for these new observation the regression line is a better approximation.

xnew=c(1.5,3.5,6.5)
ynew=c(1.13,2.06,4.49)
plot(x,y,pch=20,cex=2)
lines(x1,y1,col="red",lwd=2)
abline(mod1,col="blue",lwd=2)
points(xnew,ynew,col="green",pch=17)

So for model building the red line is better as it has a perfect fit, whilst if we seek the best prediction
the blue line is better as it has a lower error.
Here we compute the test error for polynomials of different degrees:

mod1=lm(y~x) #Degree1#
coe=mod1$coefficients
yhattest=coe[1]+coe[2]*xnew
MSE1=mean((ynew-yhattest)^2)

mod1=lm(y~x+I(x^2)) #Degree2#
coe=mod1$coefficients
yhattest=coe[1]+coe[2]*xnew+coe[3]*xnew^2
MSE2=mean((ynew-yhattest)^2)

mod1=lm(y~x+I(x^2)+I(x^3)) #Degree3#
coe=mod1$coefficients
yhattest=coe[1]+coe[2]*xnew+coe[3]*xnew^2+coe[4]*xnew^3
MSE3=mean((ynew-yhattest)^2)

223
mod1=lm(y~x + I(x^2) + I(x^3) + I(x^4)) #Degree4#
coe=mod1$coefficients
yhattest=coe[1]+coe[2]*xnew+coe[3]*xnew^2+coe[4]*xnew^3+coe[5]*xnew^4
MSE4=mean((ynew-yhattest)^2)

mod1=lm(y~x + I(x^2) + I(x^3) + I(x^4)+I(x^5)) #Degree5#


coe=mod1$coefficients
yhattest=coe[1]+coe[2]*xnew+coe[3]*xnew^2+coe[4]*xnew^3+coe[5]*xnew^4+coe[6]*xnew^5
MSE5=mean((ynew-yhattest)^2)

mod1=lm(y~x + I(x^2) + I(x^3) + I(x^4)+I(x^5)+I(x^6)) #Degree6#


coe=mod1$coefficients
yhattest=coe[1]+coe[2]*xnew+coe[3]*xnew^2+coe[4]*xnew^3+coe[5]*xnew^4+coe[6]*xnew^5
+coe[7]*xnew^6
MSE6=mean((ynew-yhattest)^2)

18.2 Overfitting
This is a general problem in the model selection framework and is named as overfitting: it means that if we
consider a large number of predictors and parameters of course we have a good performance on the data used to
compute the parameters but the same model becomes worse in terms of forecasting new observations. The best way
to solve this problem is to perform the model-building and the model-testing on two different sets of data. This
means that we divide our dataset in two subsets:

• The first subset is used to perform model-building, so to compute the parameter estimation
• The second subset is used to perform model-testing, so to check the goodness of fit using the parameters
computed on the first subset

[We already saw one example of cross-validation in the lab of the Logistic regression]

Overfitting

Overfitting means that the fit is very good (and actually exact in our previous example) but the model
has bad performance in terms of prediction.
How to overcome this problem?

Two classes of methods:


• Use separate data for parameter estimation and for goodness of fit evaluation (validation-set ap-
proach and cross-validation)
• use a penalty function to favor small models ("classical" model selection)

18.3 Bias-variance trade-off


This name comes from the fact that we can decompose the error of each observation (the difference between
the observed and predicted value) into two parts:

• One can be considered as bias


• The other one considered as variance

224
In the validation-set approach the basic idea is to use two different data sets to estimate the parameters of
the model and to measure the fit. We divide the data into:

• A training set, used for parameter estimation: using this set we compute the estimates "β̂0 , β̂1 , . . . , β̂p "

• A test set, used for model fitting evaluation: using this set we compute the mean square error, id est the
mean of:
(Y − (β̂0 + β̂1 x1 + · · · + β̂p xp )2 )

We separate the sets because, in order to avoid overfitting, we compute the parameter estimates on the training set
and then we use the estimates in the test set. The important point is that when we use the estimates of the error
"(β̂0 + β̂1 x1 + · · · + β̂p xp )" we operate on the test set but we use the values coming from the training set.

Square error on the test set

In the small introductory example, compute the mean square error (the mean of the squared residuals)
on the test set (green data points) for all possible polynomial models (the actual data are in the R script).
The mean square error on the test set captures both the two components of bias and variance: the rule here
is to consider as the "best model" the model with the lowest least square error.

• For a low number of parameters "p", the model is simple, and so it cannot capture the full complexity of the
data: this is called bias.
• For a great number of parameters "p", the model is complex, and it tends to "overfit" to spurious properties
of the data: this is called variance.

Bias and Variance


The terms "bias" and "variance" here have a different meaning with respect to the usual way they are
used in Statistic. The notion of bias-variance trade-off comes from more advanced techniques of model
selection.

Best model
The rule is that the best model is the model with lowest MSE on the test set.

In most cases the bias-variance trade-off is thought of by drawing a picture like this (notice that the "error" in
prediction is the sum of the two other components: by minimizing this "error" we obtain a compromise between
the low and high complexity):

225
• The training error is the average error on the data points used to estimate the parameter (and it is what
we have computed in all our models until now).
• The test error is the average error in estimating the response variable of the test set using the model
parameters computed with the training set.

The two errors are often quite different (look at our 7 points example), and in particular the training error can
dramatically underestimate the test error.

Auto
We use the dataset "Auto" in the package "ISLR". The aim is to predict the fuel consumption "mpg"
(miles per gallon) as a function of the engine power.
• There are "392" observations and we split them into two parts (50/50 here) and we sample "196" rows
from the entire set (we use the sampling method since in most datasets the data is sorted in some way:
this method prevents us from considering observations non randomly):

library(ISLR)
View(Auto)
train=sample(1:392,196,replace=F)

• We use the training set to build the model [here we just used one predictor to have a simple model]:

lm.fit=lm(mpg~horsepower,data=Auto,subset=train)

• We use use the "predict" function to estimate the response and then we compute the MSE (the Mean
Square Error on the test set is the difference between the observed response variable and the predicted
response variable [we predict it in the previous step]) of the "196" observations in the validation set:

mean((Auto$mpg-predict(lm.fit,Auto))[-train]^2)

With my random selection, the result is "24.33445" but of course you’ll have different values due to the
random split.
Since we just considered one predictor we can add some complexity to the model by adding powers: we now
consider a polynomial model of degree-2:

lm.fit2=lm(mpg~poly(horsepower ,2) ,data=Auto ,subset =train )


mean((mpg-predict(lm.fit2 ,Auto))[-train ]^2)

As we perform this we notice we obtained a slight improvement in the model since we obtained a lower MSE
of "21.97939" (we obtained a better model). So we have an improvement, but what happens if we add terms
to the polynomial model? Here for example we plot the MSE for different degrees (1:10):

MSEs=rep(NA,10)
for (i in 1:10)
{
lm.fit=lm(mpg~poly(horsepower ,i) ,data=Auto ,subset =train )
MSEs[i]=mean((Auto$mpg -predict(lm.fit ,Auto))[-train ]^2)
}
plot(1:10,MSEs,type="b",pch=20)

226
We can see the results in the following plot:

From the plot we see that our goodness of fit increases until the 2-degree: beyond the second degree we
don’t have any relevant improvement. This means that it’s not useful to consider the higher degrees since
they only complicate the model without increasing the goodness of the model.
We can also try this procedure for different selections of the training set.

18.4 Drawbacks of the validation-set approach


This method presents at least two core problems:

• The validation estimate of the test error can be highly variable, depending on precisely which
observations are included in the training set and which observations are included in the validation set (it
depends on the choice of the sets)

• In the validation approach, only a subset of the observations (those that are included in the training
set rather than in the validation set) is used to fit the model.

This suggests that the validation set error may tend to overestimate the test error for the model fit on the entire
data set. A possible solution is the K-fold cross validation process.

18.5 K-fold cross validation


The K-fold cross validation method is a procedure used to overcome the problems which rise when separating the
dataset into the training and the test sets. This approach is widely used to estimate the test error. Estimates can be
used to select best model, and to give an idea of the test error of the final chosen model. Idea is to randomly divide
the data into "K" equal-sized parts. We leave out part "K", fit the model to the other "K − 1" parts (combined),
and then obtain predictions for the left-out k-th part. This is done in turn for each part "k = 1, 2, . . . , K" and then
the results are combined.

227
We divide our dataset into "K" equally sized parts (approximately) and then use "K − 1" blocks for the training
phase and the last one for the validation part [we repeat this process "K" times in order to use the whole dataset
in both phases].

Let the "K" parts be "C1 , . . . , Ck " where "Ck " denotes the indices of the observations in part "k". There are "nk "
observations in part "k". If "n" is a multiple of "K", then "nk = n/K ".

K-fold computation

Compute the quantity:


K
X nK
CV(K) = M SEk
n
k=1

where:
1 X
M SEk = (yi − ŷi )2
nk
i∈Ck

and "ŷi " is the fit for observation "i" computed from the data with part "k" removed.

The idea is that the cross validation is simply the (weighted) mean of the MSE in each of the configurations: indeed
we compute the MSE of each model in each possible configuration (in our previous example we had 5 configurations).

18.6 Special case: LOOCV


If we set "K = n" we obtain the leave-one-out cross validation "LOOCV" (it means that "n−1" observations are
used for the training phase and obly one observation is used for the validation phase) [it is very hard to perform,
especially for large datasets]. For quantitative response variables the formula of "CV (K)" simplifies to:
n  2
1 X yi − ŷi
CV(n) =
n i=1 1 − hi,i

where "ŷi " is the predicted value from the original least square problem (with the full data set), and "hi,i " is the
i-th diagonal element of the hat matrix "H = X(X t X)−1 X t ".
LOOCV is simple to apply but typically the estimates from each fold are highly correlated and hence their average
can have a high variance. A good trade-off is to choose "K = 5" or "K = 10".

18.6.1 Recap

In standard regression (least squares algorithm) the idea was to use the whole set of information (the whole dataset)
to perform model estimation (to compute the estimates): this method can lead us to the overfitting problem (as we
have previously seen complex models give us the perfect fit but when we try to add new observations we notice that
their forecasting-performance is not good). So the idea is: instead of using the whole dataset to perform the training
and testing phases we divide it into several (also just two) subsets and perform these two operations separately.

228
18.7 Cross-validation in Software R
We use the function "train" in the package "caret":

library(caret)
train.control=trainControl(method = "cv", number = 10)
model=train(mpg~poly(horsepower ,2), data = Auto, method = "lm", trControl = train.control)
print(model)

The output is:

Linear Regression
392 samples
1 predictor

No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 353, 353, 354, 353, 352, 353, ...
Resampling results:
RMSE Rsquared MAE
4.332105 0.6973377 3.257559
Tuning parameter ’intercept’ was held constant at a value of TRUE

For the LOOCV method we have:

train.control=trainControl(method = "loocv")
model=train(mpg~poly(horsepower ,2), data = Auto, method = "lm",
trControl = train.control)
print(model)

and the output is:

Linear Regression

392 samples
1 predictor

No pre-processing
Resampling: Leave-One-Out Cross-Validation
Summary of sample sizes: 391, 391, 391, 391, 391, 391, ...
Resampling results:

RMSE Rsquared MAE


3.272041 NaN 3.272041

Tuning parameter ’intercept’ was held constant at


a value of TRUE

229
18.8 Cross-validation for classification
Cross-validation can be applied in the same way also for classification problems (i.e. in logistic regression). The
only difference is that the fit is evaluated through the misclassification proportion.

Classification problem

Compute:
K
X nk
CV(K) = Errk
n
k=1

where:
1 X
Errk = 1(yi ̸= ŷi )
nk
i∈Ck

18.9 Cross-validation for classification in Software R


We use again the function "train" in the package "caret", but we need the additional package "e1071":

library(e1071)
train_control = trainControl(method = "cv", number = 10)
model = train(y~x,data=d,trControl = train_control,method = "glm",family=binomial())
print(model)

A classic example of an output is:

Generalized Linear Model

100 samples
1 predictor
2 classes: ’0’, ’1’

No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 89, 91, 91, 90, 90, 89, ...
Resampling results:
Accuracy Kappa
0.7226263 0.4300388

18.10 Cp , AIC, BIC


The cross-validation technique is a method that estimates the test error by holding out a subset of the training
observations from the fitting process, and then applying the model prediction to those held out observations. There
are other ways to adjust the estimated error rate on the training set, such as the "Cp ", the "AIC" and the "BIC".
We will see in details the "AIC" in the next lectures. They are essentially based on mathematical corrections of the
error.

230
19 Model selection
19.1 Introduction
The problem

In model selection the idea is to find the smallest set of predictors which provide an adequate description of
the data.

When dozens, or hundreds, of predictors are available, we need to choose among the candidate predictors the
best subset to define a suitably small statistical model.

Computational issue

Model selection can be challenging: if we have "p" candidate predictors, there are potentially "2p " models to
consider (i.e. each term being in or out of a given model) [so it means that this approach is possible only
for small model].

19.2 Backward and forward stepwise


This approach is called "stepwise": this means that we add or remove a predictor from the model once at the time
(at each iteration of our algorithm). The stepwise approch can be performed in two "directions":

Backward stepwise algorithm

1. Start with the largest possible model (the complete model with, the one where there are all the
candidate predictors)

2. Choose a measure to quantify what makes a good model (a measure for the goodness of fit)
3. Remove the "worst predictor"
4. Continue to remove terms one at a time while the removal still provides a "better model" (a better fit)

5. When the removal of another predictor would give a worse model, we stop the algorithm

So for example if we have a complete model of this kind:

x1 , x2 , . . . xp

at the first step we consider all the sub-models with one predictor removed, so for example we have:

Model 1 - We remove ”x1 ” Y ∼ x2 + x3 + x4 + . . . xp

Model 2 - We remove ”x2 ” Y ∼ x1 + x3 + x4 + . . . xp


...
Model p - We remove ”xp ” Y ∼ x1 + x2 + x3 + . . . xp−1
and so we obtain "p" models: from this set of models we have to choose the best sub-model. If for example we
pick "Model 2" (since it is the best), this model becomes our current model from which we again remove another
predictor (we obtain other sub-models) [we iterate the procedure until the goodness of fit stop increasing].

231
The forward selection algorithm proceeds in the opposite direction:

Forward stepwise algorithm

1. Start with the smallest possible model (the empty model: only the intercept is in).
2. Choose a measure to quantify what makes a good model.
3. Add the "best predictor".

4. Continue to add terms one at a time while the addition still provides a "better model"
5. When the addition of the next predictor would give a worse model, then stop.

It is the same procedure we saw for the backward algorithm but in this case we proceed in the opposite direction.

19.3 How to measure a "good model"


The measure we need to define a "good model" must take into account two opposite requirements (the tradeoff
bias-variance):

• To have a good fit


• To have a reasonably small model (involving few predictors)

From linear regression we know a goodness of fit measure, the "R2 ":
RSSC
R2 = 1 −
V AR(Y )
but "R2 " is not a good choice because it chooses systematically the largest model (we need something that works
also for the GLM framework). Other possible choices are:

• The adjusted "R2 " based on the correlation:


RSSC /dfC
2
Radj =1− RSSR/dfR

where "dfC = n − p − 1" are the degrees of freedom of the complete model and "dfR = n − 1" are the degrees
of freedom of the empty model. We search for a large "Radj
2
".
• The cross-validation criterion. With LOOCV:
n  2
1 X yi − ŷi
CV(n) =
n i=1 1 − hi,i

We search for a small "CV(n) ".


• The Akaike information criterion (AIC):

AIC = −2ℓ(y, ŷ) + 2(p + 1)


Goodness of fit M.Complexity

For linear (Gaussian) regression this simplifies to:

AIC = n log(σ̂ 2 ) + 2(p + 1)

Note that the term "−2ℓ(y, ŷ)" is in favor of large models, while "2(p + 1)" is in favor of small models. It is a
correction of the deviance: it is a compromize between the first part (which gives an index for the goodness
of fit) and the second one which takes into account the model complexity (we have a graph similar to the one
we saw for the bias-variance tradeoff). So when we look for a "good model" from the likelihood point of view
we want to obtain a small value of the AIC.

232
• The Bayesian information criterion (BIC). It is very similar to the AIC but it also takes into account the
dimension of the dataset (it is a penalty factor):

BIC = −2ℓ(y, ŷ) + (p + 1) log(n)

For linear (Gaussian) regression this simplifies to:

BIC = n log(σ̂ 2 ) + (p + 1) log(n)

BIC tends to favor smaller models than AIC, since the penalty for large "p" is heavier. Of course we again
look for a small value of BIC.

Crime rate dataset (model selection problem in multiple regression)

The dataset contains the crime rate "RATE" in 47 US States in 1960, together with 13 possible predictors
(this means that we have "213 " possible sub-models):
• "Age": the number of males of age 14-24 per 1000 population
• "S": indicator variable for Southern states (0 = No, 1 = Yes)
• "Ed": Mean number of years of schooling (x10) for persons of age 25 or older

• "Ex0 ": 1960 per capita expenditure on police by state and local government
• "Ex1 ": 1959 per capita expenditure on police by state and local government
• "LF ": Labor force participation rate per 1000 civilian urban males age 14-24

• "M ": The number of males per 1000 females


• "N ": State population size in hundred thousands
• "NW ": The number of non-whites per 1000 population
• "U1 ": Unemployment rate of urban males per 1000 of age 14-24

• "U2 ": Unemployment rate of urban males per 1000 of age 35-39
• "W ": Median value of transferable goods and assets or family income in tens of dollars
• "Pov": The number of families per 1000 earning below 1⁄2 the median income

The data are in the file "uscrime.dat", where the response variable is the variable "Crime_rate": we
consider the Gaussian random variable so we use the standard linear regression. The first rows are here:

RATE Age S Ed Ex0 Ex1 LF M N NW U1 U2 W Pov


1 79.1 151 1 91 58 56 510 950 33 301 108 41 394 261
2 163.5 143 0 113 103 95 583 1012 13 102 96 36 557 194
3 57.8 142 1 89 45 44 533 969 18 219 94 33 318 250

For the model selection we use the "step" function. For the backward selection we define the complete model
"mod.c" and the direction:
library(readr)
uscrime <- read.delim(".../uscrime.dat")
uscrime=as.data.frame(uscrime)
View(uscrime)
mod.c=lm(RATE ~ . ,data=uscrime) #This is the complete model#
summary(mod.c)
AIC(mod.c)

233
So with the "backward" option we start from the complete model and then we compute all the possible
sub-models and we reduce the model (we iterate this until we obtain a worse result).

mod.sel=step(mod.c,direction="backward")

The output we obtain is:

Start: AIC=301.66
RATE~Age+S+Ed+Ex0+Ex1+LF+M+N+NW+U1+U2+W+Pov
Df Sum of Sq RSS AIC
- NW 1 6.1 15885 299.68
- LF 1 34.4 15913 299.76
- N 1 48.9 15928 299.81
- S 1 149.4 16028 300.10
- Ex1 1 162.3 16041 300.14
- M 1 296.5 16175 300.53
<none> 15879 301.66
- W 1 810.6 16689 302.00
- U1 1 911.5 16790 302.29
- Ex0 1 1109.8 16989 302.84
- U2 1 2108.8 17988 305.52
- Age 1 2911.6 18790 307.57
- Ed 1 3700.5 19579 309.51
- Pov 1 5474.2 21353 313.58

From the output we obtain the column of the "AIC" values: the row called "<none>" represents the complete
model whilst all the other rows represent the sub-models (the models where we remove one single predictor).
They are sorted in ascending order so we can easily look for the model with the smallest "AIC" value (in this
case the "N W " model) [this means that the first predictor that we will remove from our complete model will
be "N W "]. In the next step we repeat the same procedure considering the new sub-model we have defined:

Step: AIC=299.68
RATE~Age+S+Ed+Ex0+Ex1+LF+M+N+U1+U2+W+Pov
Df Sum of Sq RSS AIC
- LF 1 28.7 15913 297.76
- N 1 48.6 15933 297.82
- Ex1 1 156.3 16041 298.14
- S 1 158.0 16043 298.14
- M 1 294.1 16179 298.54
<none> 15885 299.68
- W 1 820.2 16705 300.05
- U1 1 913.1 16798 300.31
- Ex0 1 1104.3 16989 300.84
- U2 1 2107.1 17992 303.53
- Age 1 3365.8 19251 306.71
- Ed 1 3757.1 19642 307.66
- Pov 1 5503.6 21388 311.66

As we did before we remove again a predictor and in this case the best choice is to remove the "LF " predictor
[and we continue operating in this way].

234
Step: AIC=297.76
RATE~Age+S+Ed+Ex0+Ex1+M+N+U1+U2+W+Pov
Df Sum of Sq RSS AIC
- N 1 62.2 15976 295.95
- S 1 129.4 16043 296.14
- Ex1 1 134.8 16048 296.16
- M 1 276.8 16190 296.57
<none> 15913 297.76
- W 1 801.9 16715 298.07
- U1 1 941.8 16855 298.47
- Ex0 1 1075.9 16989 298.84
- U2 1 2088.5 18002 301.56
- Age 1 3407.9 19321 304.88
- Ed 1 3895.3 19809 306.06
- Pov 1 5621.3 21535 309.98

Step: AIC=295.95
RATE~Age+S+Ed+Ex0+Ex1+M+U1+U2+W+Pov
Df Sum of Sq RSS AIC
- S 1 104.4 16080 294.25
- Ex1 1 123.3 16099 294.31
- M 1 533.8 16509 295.49
<none> 15976 295.95
- W 1 748.7 16724 296.10
- U1 1 997.7 16973 296.80
- Ex0 1 1021.3 16997 296.86
- U2 1 2082.3 18058 299.71
- Age 1 3425.9 19402 303.08
- Ed 1 3887.6 19863 304.19
- Pov 1 5896.9 21873 308.71

Step: AIC=294.25
RATE~Age+Ed+Ex0+Ex1+M+U1+U2+W+Pov
Df Sum of Sq RSS AIC
- Ex1 1 171.5 16252 292.75
- M 1 563.4 16643 293.87
<none> 16080 294.25
- W 1 734.7 16815 294.35
- U1 1 906.0 16986 294.83
- Ex0 1 1162.0 17242 295.53
- U2 1 1978.0 18058 297.71
- Age 1 3354.5 19434 301.16
- Ed 1 4139.1 20219 303.02
- Pov 1 6094.8 22175 307.36

235
Step: AIC=292.75
RATE~Age+Ed+Ex0+M+U1+U2+W+Pov
Df Sum of Sq RSS AIC
- M 1 691.0 16943 292.71
<none> 16252 292.75
- W 1 759.0 17011 292.90
- U1 1 921.8 17173 293.35
- U2 1 2018.1 18270 296.25
- Age 1 3323.1 19575 299.50
- Ed 1 4005.1 20257 301.11
- Pov 1 6402.7 22654 306.36
- Ex0 1 11818.8 28070 316.44

Step: AIC=292.71
RATE~Age+Ed+Ex0+U1+U2+W+Pov
Df Sum of Sq RSS AIC
- U1 1 408.6 17351 291.83
<none> 16943 292.71
- W 1 1016.9 17959 293.45
- U2 1 1548.6 18491 294.82
- Age 1 4511.6 21454 301.81
- Ed 1 6430.6 23373 305.83
- Pov 1 8147.7 25090 309.16
- Ex0 1 12019.6 28962 315.91

Step: AIC=291.83
RATE~Age+Ed+Ex0+U2+W+Pov #FINAL MODEL#
Df Sum of Sq RSS AIC
<none> 17351 291.83
- W 1 1252.6 18604 293.11
- U2 1 1628.7 18980 294.05
- Age 1 4461.0 21812 300.58
- Ed 1 6214.7 23566 304.22
- Pov 1 8932.3 26283 309.35
- Ex0 1 15596.5 32948 319.97

What we obtained now in the last output is that the "complete model" (the model from which we don’t
remove any additional predictor) is the best one. This means that, using the AIC criteria, every sub-models
is worse than the current complete one (we stop here: this is our final model).

From the different outputs we displayed you may notice that for the same model you can obtain different values of
"AIC": this is because there are several different definitions for the likelihood and sometimes different functions use
different definitions. This is not a huge problem if you stay constant using the same function (so using the same
definition of likelihood) [here we are not interested in the single value of "AIC" as we are only interested in the
sorting of the predictors].

236
Example 1

Use the R "help" to see how to:

• Use the "BIC" criterion


mod.sel.bic=step(mod.c,k=log(47))

• Use the forward selection algorithm or the mixed one (both addition and removal are checked at
each step)

mod0=lm(RATE~1,data=uscrime)
mod.sel=step(mod0,scope=formula(mod.c),direction = "forward")

In the forward selection we start from the null model (model with only the intercept) and we need to
define the "scope", which is the formula of the largest possible model.
See the difference in the output in our example (sometimes they can lead to different results!).

19.3.1 Mixed selection

Notice that there’s also a third possible procedure which is a mixture of the previous two: in this case at each
iteration we try to remove or add one predictor.

mod0=lm(RATE~1,data=uscrime)
mod.sel=step(mod0,scope=formula(mod.c),direction = "both")

The main difference we have to underline is that here, once a predictor has been removed from the model, it isn’t
"lost" forever as it can be re-integrated in further models, whilst in the backward method it is removed forever.

19.4 Forward vs Backward


Starting with the full model is generally more reliable. Forward selection methods are important in the context of
big data, where datasets with more variables than observations are commonly held.

237
20 Lab 8
20.1 Model selection
Since the model selection based on stepwise is quite simple we have two different datasets: one example under the
standard linear model with a Gaussian response variable and the other one bsaed on the logistic regression.

20.1.1 Exercise 1 - Credit card balance

For this exercise we use the dataset "Credit" in the "ISLR" package. Load the package "ISLR" and the dataset
is automatically available. The aim here is to predict the "Balance" by means of several quantitative predictors:
"Age", "Cards" (number of credit cards), "Education" (years of education), "Income" (in thousands of dollars),
"Limit" (credit limit), "Rating" (credit rating). We also have four categorical predictors: "Gender", "Student"
(student status), "Married" (marital status), and "Ethnicity" (Caucasian, African American or Asian).

1. Give short descriptive statistics for all variables.


As we did multiple times before, the easiest thing we can do to give a short description of the variables is to
plot a "boxplot" for the quantitative variables and to perform the "table" for categorical variables.

library(ISLR)
View(Credit)

##Description for Quantitative Variables##


summary(Credit$Income)
boxplot(Credit$Income,main="Income")

summary(Credit$Limit)
boxplot(Credit$Limit,main="Limit")

summary(Credit$Rating)
boxplot(Credit$Rating,main="Rating")

summary(Credit$Cards)
boxplot(Credit$Cards,main="Cards")

summary(Credit$Age)
boxplot(Credit$Age,main="Age")

summary(Credit$Education)
boxplot(Credit$Education,main="Education")

##Description for Categorical Variables##


table(Credit$Gender)

table(Credit$Student)

table(Credit$Married)

table(Credit$Ethnicity)

238
Here we have the boxplots of the quantitative variables:

2. Fit a linear regression model with all the predictors. Note that the first column is simply a row-label, so that
it must be excluded from the analysis.
The first step is to compute the complete model. Since we have a linear model we use the "lm" function in
order to perform a linear regression.

mod=lm(Balance~Income+Limit+Rating+Age+Cards+Education+Gender+Student+Married+Ethnicity,
data=Credit)
summary(mod)

And this is the "summary" output:

Call:
lm(formula = Balance ~ Income + Limit + Rating + Age + Cards +
Education + Gender + Student + Married + Ethnicity,
data = Credit)

. . .

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -479.20787 35.77394 -13.395 < 2e-16 ***
Income -7.80310 0.23423 -33.314 < 2e-16 ***
Limit 0.19091 0.03278 5.824 1.21e-08 ***
Rating 1.13653 0.49089 2.315 0.0211 *
Age -0.61391 0.29399 -2.088 0.0374 *
Cards 17.72448 4.34103 4.083 5.40e-05 ***
Education -1.09886 1.59795 -0.688 0.4921
GenderFemale -10.65325 9.91400 -1.075 0.2832
StudentYes 425.74736 16.72258 25.459 < 2e-16 ***
MarriedYes -8.53390 10.36287 -0.824 0.4107
EthnicityAsian 16.80418 14.11906 1.190 0.2347
EthnicityCaucasian 10.10703 12.20992 0.828 0.4083
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 98.79 on 388 degrees of freedom
Multiple R-squared: 0.9551, Adjusted R-squared: 0.9538
F-statistic: 750.3 on 11 and 388 DF, p-value: < 2.2e-16

239
From the coefficient-table we obtained one coefficient for each quantitative predictor ("income", "limit", "rat-
ing", "age", "cards" and "education"), one Dummy variable for each categorical binary predictor ("gender",
"student" and "married") and then two Dummy variables for the categorical predictor "ethnicity" which has
3 different levels. Here we are not really interested in the p-values and in the significance of the predictors
since we are not going to perform a "by-hand" selection of the predictors (we are going to use the stepwise
algorithm).

3. Apply the stepwise algorithm to find reasonable reduced models. Use both the forward and the backward
directions, and try with both the "AIC" and the "BIC" criteria. Compare the results.
In order to perform this we use the "step" function (we use the complete model as input and then perform the
iterations). First of all we perform the backward-stepwise algorithm with the default penalty function "AIC"
(remember that the "backward" is the default direction of the function):

mod.bA=step(mod)
summary(mod.bA)

After several iterations the final model we obtain is the following where we have 6 predictors:

Call:
lm(formula = Balance ~ Income + Limit + Rating + Age + Cards +
Student, data = Credit)

Residuals:
Min 1Q Median 3Q Max
-170.00 -77.85 -11.84 56.87 313.52

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -493.73419 24.82476 -19.889 < 2e-16 ***
Income -7.79508 0.23342 -33.395 < 2e-16 ***
Limit 0.19369 0.03238 5.981 4.98e-09 ***
Rating 1.09119 0.48480 2.251 0.0250 *
Age -0.62406 0.29182 -2.139 0.0331 *
Cards 18.21190 4.31865 4.217 3.08e-05 ***
StudentYes 425.60994 16.50956 25.780 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 98.61 on 393 degrees of freedom


Multiple R-squared: 0.9547, Adjusted R-squared: 0.954
F-statistic: 1380 on 6 and 393 DF, p-value: < 2.2e-16

We can perform the same (we iterate in the same direction) by using the "BIC" criterion. We define the
number of rows of our dataframe "n" and then we define the criterion by writing "k = log(n)" (the default
setting is "k = 2" as it corresponds to the Akaike information criterion):

n=dim(Credit)[1]
mod.bB=step(mod,k=log(n))
summary(mod.bB)

240
And so the output we obtain is the following:

Call:
lm(formula = Balance ~ Income + Limit + Cards + Student, data = Credit)

...

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.997e+02 1.589e+01 -31.449 < 2e-16 ***
Income -7.839e+00 2.321e-01 -33.780 < 2e-16 ***
Limit 2.666e-01 3.542e-03 75.271 < 2e-16 ***
Cards 2.318e+01 3.639e+00 6.368 5.32e-10 ***
StudentYes 4.296e+02 1.661e+01 25.862 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 99.56 on 395 degrees of freedom


Multiple R-squared: 0.9536, Adjusted R-squared: 0.9531
F-statistic: 2029 on 4 and 395 DF, p-value: < 2.2e-16

What we obtain is a smaller model with respect to the previous output: the BIC has a higher penalty for large
models (it tends to select smaller models).
On the other hand if we want to perform the forward stepwise method we need to define the starting point
(which is the null model [the model which only contains the intercept]) and then the scope (the maximum
possible formula we can use, so the complete model) . First of all we perform the algorithm using the "AIC":

mod.null=lm(Balance~1,data=Credit)
mod.fA=step(mod.null,formula(mod),direction="forward")
summary(mod.fA)

The final model we obtain is the following:

Step: AIC=3679.89
Balance ~ Rating + Income + Student + Limit + Cards + Age

Df Sum of Sq RSS AIC


<none> 3821620 3679.9
+ Gender 1 10860.9 3810759 3680.7
+ Married 1 5450.6 3816169 3681.3
+ Education 1 5241.7 3816378 3681.3
+ Ethnicity 2 11517.3 3810102 3682.7

So as we can see from the "summary" in our final model we have "6" predictors:

Call:
lm(formula = Balance ~ Rating + Income + Student + Limit + Cards +
Age, data = Credit)

. . .

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -493.73419 24.82476 -19.889 < 2e-16 ***
Rating 1.09119 0.48480 2.251 0.0250 *
Income -7.79508 0.23342 -33.395 < 2e-16 ***

241
StudentYes 425.60994 16.50956 25.780 < 2e-16 ***
Limit 0.19369 0.03238 5.981 4.98e-09 ***
Cards 18.21190 4.31865 4.217 3.08e-05 ***
Age -0.62406 0.29182 -2.139 0.0331 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 98.61 on 393 degrees of freedom


Multiple R-squared: 0.9547, Adjusted R-squared: 0.954
F-statistic: 1380 on 6 and 393 DF, p-value: < 2.2e-16

We can of course perform the same by using also the "BIC" criterion:

mod.null=lm(Balance~1,data=Credit)
mod.fB=step(mod.null,formula(mod),direction="forward",k=log(n))
summary(mod.fB)

We obtain:
Step: AIC=3706.46
Balance ~ Rating + Income + Student + Limit + Cards

Df Sum of Sq RSS AIC


<none> 3866091 3706.5
+ Age 1 44472 3821620 3707.8
+ Gender 1 11350 3854741 3711.3
+ Education 1 5672 3860419 3711.9
+ Married 1 3121 3862970 3712.1
+ Ethnicity 2 14756 3851335 3716.9

and the "summary" we see that in this case we obtained a final model with "5" predictors:

Call:
lm(formula = Balance ~ Rating + Income + Student + Limit + Cards,
data = Credit)

. . .

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -526.15552 19.74661 -26.645 < 2e-16 ***
Rating 1.08790 0.48700 2.234 0.026 *
Income -7.87492 0.23145 -34.024 < 2e-16 ***
StudentYes 426.85015 16.57403 25.754 < 2e-16 ***
Limit 0.19441 0.03253 5.977 5.10e-09 ***
Cards 17.85173 4.33489 4.118 4.66e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 99.06 on 394 degrees of freedom


Multiple R-squared: 0.9542, Adjusted R-squared: 0.9536
F-statistic: 1640 on 5 and 394 DF, p-value: < 2.2e-16

So, based on the criterion (penalty applied) and method we chose (the direction of the algorithm), we obtained
different solution (different number of predictors) for our dataset.

242
20.1.2 Exercise 2 - Cancer remission

In this exercise we use the data in the file "cancer_remission.txt" available in the Aulaweb folder. The data
come from an experiment to predict the "cancer_remission" (response variable with values "1" for remission and
"0" otherwise) on the basis of six possible predictors (the other columns in the data file).

1. Give short descriptive statistics for all variables.


The predictors we are considering in this dataset are all quantitative (this means that we won’t need Dummy
variables). Notice that on the contrary our response variable is a binary variable: this means that we will need
to use the logistic regression to define the complete model before starting the stepwise procedure. So first of all
we import the dataset and we analyze the variables:

cancer_remission <- read.csv(". . ./cancer_remission.txt", sep="")


View(cancer_remission)

table(cancer_remission$remiss)

summary(cancer_remission$cell)
boxplot(cancer_remission$cell)

summary(cancer_remission$smear)
boxplot(cancer_remission$smear)

summary(cancer_remission$infil)
boxplot(cancer_remission$infil)

summary(cancer_remission$li)
boxplot(cancer_remission$li)

summary(cancer_remission$blast)
boxplot(cancer_remission$blast)

summary(cancer_remission$temp)
boxplot(cancer_remission$temp)

243
2. Fit a logistic regression model with all the predictors.
For our regression we apply the function "glm":

mod=glm(remiss~cell+smear+infil+li+blast+temp,data=cancer_remission,
family=binomial(link="logit"))
summary(mod)

The output we obtain for the complete model is the following:

Call:
glm(formula = remiss ~ cell + smear + infil + li + blast + temp,
family = binomial(link = "logit"), data = cancer_remission)

. . .

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 58.0385 71.2364 0.815 0.4152
cell 24.6615 47.8377 0.516 0.6062
smear 19.2936 57.9500 0.333 0.7392
infil -19.6013 61.6815 -0.318 0.7507
li 3.8960 2.3371 1.667 0.0955 .
blast 0.1511 2.2786 0.066 0.9471
temp -87.4339 67.5735 -1.294 0.1957
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 34.372 on 26 degrees of freedom
Residual deviance: 21.751 on 20 degrees of freedom
AIC: 35.751

Number of Fisher Scoring iterations: 8

3. Apply the stepwise algorithm to find reasonable reduced models. Use both the forward and the backward
directions, and try with both the "AIC" and the "BIC" criteria. Compare the results.
As we did in the previous exercise we now perform the stepwise algorithm to find the reduced models (we use
both "AIC" and "BIC" criteria). We start with the backward stepwise with AIC:

mod.bA=step(mod)
summary(mod.bA)

The last reduced model we obtain is the following (we have 3 predictors ):

Step: AIC=29.95
remiss ~ cell + li + temp

Df Deviance AIC
<none> 21.953 29.953
- temp 1 24.341 30.341
- cell 1 24.648 30.648
- li 1 30.829 36.829

and its summary is:

Call:
glm(formula = remiss ~ cell + li + temp, family = binomial(link = "logit"),

244
data = cancer_remission)

. . .

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 67.634 56.888 1.189 0.2345
cell 9.652 7.751 1.245 0.2130
li 3.867 1.778 2.175 0.0297 *
temp -82.074 61.712 -1.330 0.1835
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 34.372 on 26 degrees of freedom


Residual deviance: 21.953 on 23 degrees of freedom
AIC: 29.953

Number of Fisher Scoring iterations: 7

We now perform the backward stepwise with BIC:

n=dim(cancer_remission)[1]
mod.bB=step(mod,k=log(n))
summary(mod.bB)

The last iteration we obtain is the following. In this case we obtained a reduced model with just one predictor:

Step: AIC=32.66
remiss ~ li

Df Deviance AIC
<none> 26.073 32.665
- li 1 34.372 37.668

and the summary is:

Call:
glm(formula = remiss ~ li, family = binomial(link = "logit"),
data = cancer_remission)

. . .

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.777 1.379 -2.740 0.00615 **
li 2.897 1.187 2.441 0.01464 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 34.372 on 26 degrees of freedom
Residual deviance: 26.073 on 25 degrees of freedom
AIC: 30.073
Number of Fisher Scoring iterations: 4

245
We now perform the forward stepwise method. First of all we consider the "AIC":

mod.null=glm(remiss~1,data=cancer_remission,family=binomial(link="logit"))
mod.fA=step(mod.null,formula(mod),direction="forward")
summary(mod.fA)

From the last iteration we have:


Step: AIC=30.07
remiss ~ li

Df Deviance AIC
<none> 26.073 30.073
+ cell 1 24.341 30.341
+ temp 1 24.648 30.648
+ infil 1 25.491 31.490
+ smear 1 25.937 31.937
+ blast 1 25.981 31.981

and the summary is:

Call:
glm(formula = remiss ~ li, family = binomial(link = "logit"),
data = cancer_remission)

. . .

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.777 1.379 -2.740 0.00615 **
li 2.897 1.187 2.441 0.01464 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 34.372 on 26 degrees of freedom
Residual deviance: 26.073 on 25 degrees of freedom
AIC: 30.073
Number of Fisher Scoring iterations: 4

And in the end we consider the forward stepwise with the "BIC":

mod.null=glm(remiss~1,data=cancer_remission,
family=binomial(link = "logit"))
mod.fB=step(mod.null,formula(mod),direction="forward",k=log(n))
summary(mod.fB)

246
The last iteration is:
Step: AIC=32.66
remiss ~ li

Df Deviance AIC
<none> 26.073 32.665
+ cell 1 24.341 34.228
+ temp 1 24.648 34.535
+ infil 1 25.491 35.378
+ smear 1 25.937 35.825
+ blast 1 25.981 35.869

And from the summary we see again that we obtained a model with one predictor:

Call:
glm(formula = remiss ~ li, family = binomial(link = "logit"),
data = cancer_remission)

. . .

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.777 1.379 -2.740 0.00615 **
li 2.897 1.187 2.441 0.01464 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 34.372 on 26 degrees of freedom
Residual deviance: 26.073 on 25 degrees of freedom
AIC: 30.073
Number of Fisher Scoring iterations: 4

247
21 Last farewell . . .
First of all I would like to thank professor Rapallo for his magnificent lectures during this global pandemic, and then
I thank you students who enjoyed these notes. It has been a pleasure and an honor offering you this manuscript
and I hope they will prove profitable.
So remember that, as Socrates said,

"Knowledge will set you free, ignorance will gaol you"

and, if I may add:

"knowledge is free and as such it shall be."

One last warm farewell from your dearest comrade G.M., aka Peer2PeerLoverz.

248

You might also like