4.5-Bootstrap_Variations
4.5-Bootstrap_Variations
5 Bootstrap Variations
Contents
4.5.0 Bootstrap 2
Estimate for F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Parametric Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Non-Parametric Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Overview
• We redefine our measure of inaccuracy to include multiple samples and test sets.
1
# samples <- combn(popSharks, 5) N_s <- ncol(samples)
N_s <- 10ˆ4
n = 6
set.seed(341)
samples <- sapply(1:N_s, FUN = function(b) sample(popSharks, n, replace = TRUE))
4.5.0 Bootstrap
• So far, to bootstrap we have been sampling with replacement from the sample S.
– Because the sample S was viewed as an estimate of the population P
– This sampling scheme is equivalent to sampling from the empirical distribution function, Fb.
• In other words, we would like to sample from the distribution F , but instead,
– we obtain a sample using an estimate Fb.
• What other possible estimates are there for the cumulative distribution function F ?
Estimate for F
• Varies empirical distribution functions using the argument type in the quantile function.
– Generate 10 observations from G(5, 1)
2
set.seed(341)
x = rnorm(10, mean = 5)
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
proportion
proportion
proportion
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
4.0 4.5 5.0 5.5 6.0 6.5 4.0 4.5 5.0 5.5 6.0 6.5 4.0 4.5 5.0 5.5 6.0 6.5
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
proportion
proportion
proportion
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
4.0 4.5 5.0 5.5 6.0 6.5 4.0 4.5 5.0 5.5 6.0 6.5 4.0 4.5 5.0 5.5 6.0 6.5
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
proportion
proportion
proportion
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
4.0 4.5 5.0 5.5 6.0 6.5 4.0 4.5 5.0 5.5 6.0 6.5 4.0 4.5 5.0 5.5 6.0 6.5
• Refer to the help documentation of the quantile function for details on the argument type.
3
• All on the quantile functions on one plot
Different Empirical Distribution Functions
1.0
0.8
proportion
0.6
0.4
0.2
0.0
Parametric Estimate
• We can estimate the distribution function F (x) using a parametric model F (x; θ) which is indexed by
some parameters.
• Generate 100 observations from G(µ = 5, σ = 1)
set.seed(341)
x = rnorm(100, mean = 5)
c(mean(x), sd(x))
plot(ecdf(x), xlim = extendrange(x), ylim = c(0, 1), ylab = "proportion", xlab = "",
main = "Empirical Distribution Function")
4
lines(xseq, pnorm(xseq, mean(x), sd = sd(x)), col = 2)
1.0
0.4
0.8
0.3
0.6
proportion
proportion
0.2
0.4
0.1
0.2
0.0
2 3 4 5 6 7 0.0 2 3 4 5 6 7
Non-Parametric Bootstrap
• For a given sample S and non-parametric method
– Obtain an estimate Fb(x) using the sample: this estimate is the empirical CDF
• Note: Alternatively, we could estimate the density function with some fb, and do the same thing.
– Here, we will generate samples from the model, NOT through sampling with replacement from
the sample.
5
Agricultural census (USA): Parametric Example
• Consider the West region and suppose we obtain a sample of size 50 from the 422 farms in Western
region and measure the number of acres in 1987
agpop <- read.csv("../../../Data/agpop_data.csv", header = TRUE)
set.seed(341)
acres87 = agpop[agpop$region == "W", "acres87"]
N = length(acres87)
n = 50
acres87Sam = acres87[sample(1:N, n)]
• From a the histogram and empirical distribution function, it seems an exponential distribution with
rate equal to 1/x fits the data well.
– note that 1/x is the maximum likelihood estimate of the parameter of the exponential distribution.
par(mfrow = c(1, 2))
hist(acres87Sam, breaks = seq(0, max(acres87Sam, na.rm = TRUE), length.out = 15),
prob = TRUE, xlab = "Acres from 1987", main = "Acres from W region in 1987")
1.0
0.8
0.6
6.0e−07
Density
Fn(x)
0.4
0.2
0.0e+00
0.0
6
• The smooth curve is the fitted exponential model on the histogram/empirical CDF based on the data.
• Based on these graphs, exponential distribution seems to be a good fit for the data.
• This means that to repeat the experiment to get more samples each of size n, one can simply generate
n random samples from EXP (θ = 1 / x) distribution.
theta = 1/mean(acres87Sam)
B = 10ˆ4
Sstar <- sapply(1:B, FUN = function(b) rexp(n, rate = theta))
bootAvg = apply(Sstar, 2, mean)
The summary statistics for the sample averages of the parametric bootstrap sample are
sd(bootAvg)
## [1] 111808.4
summary(bootAvg)
## [1] 105484.4
summary(bootAvg0)
7
pseq = seq(0, 1, length.out = 1000)
par(mfrow = c(1, 2), mar = 2.5 * c(1, 1, 1, 0.1))
800
600
600
Frequency
Frequency
400
400
200
200
0
Example - Median
• Now consider the median for the same data, i.e.
– the West region in the US agricaulture data, and suppose we obtain a sample of size 50 from the
422 farms in Western region and measure the number of acres in 1987.
bootMed = apply(Sstar, 2, median)
bootMed0 = apply(Sstar0, 2, median)
8
hist(bootMed, main = "Parametric Bootstrap", breaks = hhPopMed)
hist(bootMed0, main = "Bootstrap Sampling with \n Replacement", breaks = hhPopMed)
1200
600
Frequency
400
200
0
0
200000 600000 1000000 1400000 200000 600000 1000000 1400000
• What do you observe?
bootMed bootMed0
• Try other attributes like IQR, min, max, mid-hinge, CV, etc.
Summary
• We introduced the Parametric bootstrap and illustrated how it can estimate the sampling distribution.
• For this section, for simplicity we will treat each data-set as a sample.
9
Iris Data
• Here we will explore the bootstrap in analyzing the famous Iris data-set.
– Iris is a data set with 150 cases and 5 variables named Sepal.Length, Sepal.Width, Petal.Length,
Petal.Width, and Species.
– We will limit ourselves to the Setosa flower, in which Sepal.Length is our x-covariate and the
Sepal.Width is our response.
10
4.0
Sepal.Width
3.5
3.0
2.5
Sepal.Length
• Here we assume that
– Sepal.Length is our x-covariate and
– the Sepal.Width is our response.
x = iris.s$Sepal.Length
y = iris.s$Sepal.Width
n = length(y)
## (Intercept) x
## -0.5694327 0.7985283
data(iris)
iris.s = iris[iris[, 5] == "setosa", -c(3, 4, 5)]
# head(iris.s)
plot(iris.s, col = adjustcolor("firebrick", 0.5), pch = 19)
beta.hat = lm(y ~ I(x - mean(x)))$coef
abline(beta.hat + c(-beta.hat[2] * mean(x), 0))
11
4.0
Sepal.Width
3.5
3.0
2.5
Sepal.Length
• The sample is
S = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}
– We might sample with replacement from the pairs of observations. i.e. sample (xi , yi ).
12
alpha beta
120
1.2
100
150
80
1.0
Frequency
Frequency
100
beta
60
0.8
40
50
20
0.6
0
3.35 3.45 3.55 0.6 0.8 1.0 1.2 3.35 3.40 3.45 3.50 3.55
• Notice that the estimates seem to be generated from a bivariate normal (how so?)
– which indicates that the bootstrap and the “theoretical” confidence intervals based on the errors
being Gaussian should agree.
13
lambda beta
140
1.2
120
150
100
1.0
100
80
Frequency
Frequency
beta
0.8
60
40
50
0.6
20
0
• Note the strong dependence between the estimates of the two parameters (negative correlation) which
does not exist when using the centralized model, i.e. Yi = α + β(xi − x) + Ri .
b + βb (x − x)
y=α
bb? + βbb? (x − x)
y=α
14
4.0
Sepal.Width
3.5
3.0
2.5
Sepal.Length
• How can we construct a confidence interval for the regression line?
b?b (x0 ) = α
µ bb? + βbb? (x0 − x) where b = 1, . . . , B
µ?1 (x0 ), . . . , µ
So we have {b b?B (x0 )}
• A 95% bootstrap confidence interval is the 2.5 and 97.5 quantiles from the bootstrap replicates
(percentile interval)
µ
blower (x0 ) = Qb
µ? (x0 )
(0.025) and µ
bupper (x0 ) = Qb
µ? (x0 )
(0.975)
x0 = 4.5
15
sum(z * a)
}, a = c(1, x0 - mean(x)))
Histogram of
mu.star.hat(x0)
6
4.0
5
4
Sepal.Width
3.5
Density
3.0
2
1
2.5
0
mu0.star.hat Sepal.Length
16
boot.ci = matrix(0, nrow = length(x.seq), 2)
for (i in 1:length(x.seq)) {
y.hat = apply(beta.boot, 1, function(z, a) {
sum(z * a)
}, a = c(1, x.seq[i] - mean(x)))
boot.ci[i, ] = quantile(y.hat, prob = c(0.025, 0.975))
}
round(boot.ci, 2)
## [,1] [,2]
## [1,] 2.88 3.14
## [2,] 3.35 3.49
## [3,] 3.76 4.05
par(mfrow = c(1, 2), mar = 2.5 * c(1, 1, 1, 0.1))
4.0
Sepal.Width
Sepal.Width
3.5
3.5
3.0
3.0
2.5
2.5
Sepal.Length Sepal.Length
17
• We can add more x0 values and then compare the bootstrap percentile interval to a confidence interval
that uses the assumption of Gaussian errors.
x.seq = seq(min(x), max(x), length.out = 100)
boot.ci = matrix(0, nrow = length(x.seq), 2)
for (i in 1:length(x.seq)) {
y.hat = apply(beta.boot, 1, function(z, a) {
sum(z * a)
}, a = c(1, x.seq[i] - mean(x)))
boot.ci[i, ] = quantile(y.hat, prob = c(0.025, 0.975))
}
plot(iris.s, pch = 19, col = adjustcolor("firebrick", 0.5), main = "Bootstrap Confidence Interval")
abline(coef = beta.hat + c(-beta.hat[2] * mean(x), 0))
lines(x.seq, boot.ci[, 1], col = 4, lwd = 2)
lines(x.seq, boot.ci[, 2], col = 4, lwd = 2)
plot(iris.s, pch = 19, col = adjustcolor("firebrick", 0.5), main = "Gaussian Confidence Interval")
abline(coef = beta.hat + c(-beta.hat[2] * mean(x), 0))
lines(x.seq, ci[, 2], col = 3, lwd = 2)
lines(x.seq, ci[, 3], col = 3, lwd = 2)
4.0
Sepal.Width
Sepal.Width
3.5
3.5
3.0
3.0
2.5
2.5
Sepal.Length Sepal.Length
• A confidence interval using the assumption of Gaussian Errors and a confidence interval using the
bootstrap and the percentile method match.
18
• Important question: what does the coverage probability of 95% for a regression line mean?
– note that a proper definition of a regression line is E(Y | X = x) = α + βx
Parametric Bootstrap
• How would we apply the parametric bootstrap in the context of regression?
• The assumed regression model is
Yi = α + β (xi − x) + Ri
with Ri ∼i.i.d G(0, σ)
yi? = α
b + βb (xi − x) + Ri?
Illustration
• Obtain the bootstrap samples using α
b = 3.428, βb = 0.7985, and σ
b = 0.2565.
B = 1000
par.boot.sam = Map(function(b) {
Rstar = rnorm(n, mean = 0, sd = 0.2565)
y = 3.428 + 0.7985 * (x - mean(x)) + Rstar
data.frame(x = x, y = y)
}, 1:B)
19
4.0 Resampling Pairs: non−parametric Parametric Bootstrap
4.0
Sepal.Width
Sepal.Width
3.5
3.5
3.0
3.0
2.5
2.5
4.5 5.0 5.5 4.5 5.0 5.5
• We notice that the results are very similar in this example.
Sepal.Length Sepal.Length
• The parametric bootstrap motivates another way to re-sample data, i.e. sampling the errors.
• We can use the sample of residuals to estimate F using the empirical cdf.
n
1X
Fb(t) = ri ≤ t)
I (b
n i=1
20
• We perform the bootstrap using Fb,
– we generate a bootstrap sample of errors Ri? by resampling from R
b and obtain
b + βb (xi − x) + Ri?
yi? = α
Illustration
• The sample of residuals and the empirical cdf is
ecdf(R)
1.0
0.8
0.6
Fn(x)
0.4
0.2
0.0
x
• Obtain the bootstrap samples
B = 1000
nonpar.boot.sam = Map(function(b) {
Rstar = R[sample(n, n, replace = TRUE)]
y = 3.428 + 0.7985 * (x - mean(x)) + Rstar
data.frame(x = x, y = y)
}, 1:B)
21
4.0 Resampling Pairs: non−parametric Parametric Bootstrap Resampling Errors
4.0
4.0
Sepal.Width
Sepal.Width
Sepal.Width
3.5
3.5
3.5
3.0
3.0
3.0
2.5
2.5
2.5
4.5 5.0 5.5 4.5 5.0 5.5 4.5 5.0 5.5
• All threeSepal.Length
methods agree Sepal.Length Sepal.Length
library(robustbase)
data(Animals2)
# Animals2 = Animals2[-c(63,64,65),]
x = log(Animals2$body)
y = log(Animals2$brain)
n = length(y)
plot(x, y, pch = 19, col = adjustcolor("grey", 0.5))
beta.hat = lm(y ~ I(x - mean(x)))$coef
sd.hat = sqrt(sum(lm(y ~ I(x - mean(x)))$residualsˆ2)/(n - 2))
abline(coef = beta.hat + c(-beta.hat[2] * mean(x), 0), col = adjustcolor("firebrick",
1))
22
8
6
4
y
2
0
−2
−5 0 5 10
x
• LS confidence Intervals
Resampling Pairs: non−parametric Parametric Bootstrap Resampling Errors
8
8
6
6
4
4
y
y
2
2
0
0
−2
−2
−2
−5 0 5 10 −5 0 5 10 −5 0 5 10
x x x
23
# Animals2 = Animals2[-c(63,64,65),]
x = log(Animals2$body)
y = log(Animals2$brain)
n = length(y)
plot(x, y, pch = 19, col = adjustcolor("grey", 0.5))
beta.hat = rlm(y ~ I(x - mean(x)), psi = "psi.huber")$coef
2
0
−2
−5 0 5 10
x
• Robust Regression Confidence Intervals:
Resampling Pairs: non−parametric Parameteric Bootstrap Resampling Errors
8
8
6
6
4
4
y
y
2
2
0
0
−2
−2
−2
−5 0 5 10 −5 0 5 10 −5 0 5 10
x x x
24
• What do you learn from these plots? Do they all agree?
The bootstrap “can blow the head off any problem if the statistican can stand the resulting mess.”
– John Tukey
Summary
• We saw there is three ways to apply the bootstrap in the context of regression.
– resampling pairs (x is random)
– resampling errors (x is fixed)
– parametric bootstrap (x is fixed, has most assumptions)
25