0% found this document useful (0 votes)
1 views

4.5-Bootstrap_Variations

The document discusses various bootstrap methods, including parametric and non-parametric approaches, for estimating statistical parameters. It provides examples using agricultural census data and regression analysis, highlighting the differences between parametric and non-parametric results. Additionally, it emphasizes the importance of empirical distribution functions and the generation of bootstrap samples for statistical inference.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

4.5-Bootstrap_Variations

The document discusses various bootstrap methods, including parametric and non-parametric approaches, for estimating statistical parameters. It provides examples using agricultural census data and regression analysis, highlighting the differences between parametric and non-parametric results. Additionally, it emphasizes the importance of empirical distribution functions and the generation of bootstrap samples for statistical inference.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

4.

5 Bootstrap Variations

Contents
4.5.0 Bootstrap 2
Estimate for F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Parametric Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Non-Parametric Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4.5.1 Parametric Bootstrap 5


Agricultural census (USA): Parametric Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Agricultural census (USA): Non-parametric Example . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Comparing parametric and non-parametric results . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Example - Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.5.2 Bootstrap in Regression 9


Iris Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Resampling the Pairs: non-parametric bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Bootstrap the Regression Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Aside: Comparing Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Bootstrap Regression lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Bootstrap Confidence Interval (Percentile Method) . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Parametric Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Resampling the Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Some other Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Animals Data and LS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Animals Data and Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Overview
• We redefine our measure of inaccuracy to include multiple samples and test sets.

sharks <- read.csv("../../../Data/sharks.csv")


popSharks <- rownames(sharks)

1
# samples <- combn(popSharks, 5) N_s <- ncol(samples)
N_s <- 10ˆ4
n = 6
set.seed(341)
samples <- sapply(1:N_s, FUN = function(b) sample(popSharks, n, replace = TRUE))

avePop <- mean(sharks[, "Length"])


avesSamp <- apply(samples, MARGIN = 2, FUN = function(s) {
mean(sharks[s, "Length"])
})
sampleErrors <- avesSamp - avePop

tmpAve <- mean(avesSamp)


tmpSD <- sd(avesSamp)

sdsSamp <- apply(samples, MARGIN = 2, FUN = function(s) {


sd(sharks[s, "Length"])
})
agpop <- read.csv("../../../Data/agpop_data.csv", header = TRUE)

missing92 <- agpop[, "acres92"] == -99


rowNumsMissing <- which(agpop[, "acres92"] == -99)
agpop[missing92, "acres92"] <- NA
agpop[agpop[, "acres87"] == -99, "acres87"] <- NA
agpop[agpop[, "acres82"] == -99, "acres82"] <- NA

4.5.0 Bootstrap
• So far, to bootstrap we have been sampling with replacement from the sample S.
– Because the sample S was viewed as an estimate of the population P
– This sampling scheme is equivalent to sampling from the empirical distribution function, Fb.

• In other words, we would like to sample from the distribution F , but instead,
– we obtain a sample using an estimate Fb.

• What other possible estimates are there for the cumulative distribution function F ?

Estimate for F
• Varies empirical distribution functions using the argument type in the quantile function.
– Generate 10 observations from G(5, 1)

2
set.seed(341)
x = rnorm(10, mean = 5)

pseq = seq(0, 1, length.out = 1000)


par(mfrow = c(3, 3), mar = 2.5 * c(1, 1, 1, 0.1))
for (i in 1:9) {
plot(quantile(x, probs = pseq, type = i), pseq, type = "l", xlim = extendrange(x),
ylim = c(0, 1), ylab = "proportion", xlab = "", main = paste("Type=",
i))
segments(x0 = c(-5, 10), y0 = c(0, 1), x1 = c(min(x), max(x)), y1 = c(0,
1))
}
Type= 1 Type= 2 Type= 3
1.0

1.0

1.0
0.8

0.8

0.8
0.6

0.6

0.6
proportion

proportion

proportion
0.4

0.4

0.4
0.2

0.2

0.2
0.0

0.0

0.0
4.0 4.5 5.0 5.5 6.0 6.5 4.0 4.5 5.0 5.5 6.0 6.5 4.0 4.5 5.0 5.5 6.0 6.5

Type= 4 Type= 5 Type= 6


1.0

1.0

1.0
0.8

0.8

0.8
0.6

0.6

0.6
proportion

proportion

proportion
0.4

0.4

0.4
0.2

0.2

0.2
0.0

0.0

0.0

4.0 4.5 5.0 5.5 6.0 6.5 4.0 4.5 5.0 5.5 6.0 6.5 4.0 4.5 5.0 5.5 6.0 6.5

Type= 7 Type= 8 Type= 9


1.0

1.0

1.0
0.8

0.8

0.8
0.6

0.6

0.6
proportion

proportion

proportion
0.4

0.4

0.4
0.2

0.2

0.2
0.0

0.0

0.0

4.0 4.5 5.0 5.5 6.0 6.5 4.0 4.5 5.0 5.5 6.0 6.5 4.0 4.5 5.0 5.5 6.0 6.5

• Refer to the help documentation of the quantile function for details on the argument type.

3
• All on the quantile functions on one plot
Different Empirical Distribution Functions
1.0
0.8
proportion

0.6
0.4
0.2
0.0

4.0 4.5 5.0 5.5 6.0 6.5

Parametric Estimate
• We can estimate the distribution function F (x) using a parametric model F (x; θ) which is indexed by
some parameters.
• Generate 100 observations from G(µ = 5, σ = 1)
set.seed(341)
x = rnorm(100, mean = 5)
c(mean(x), sd(x))

## [1] 4.994535 1.070854


• Overlay the fitted Gaussian distribution G(µ = µ
b, σ = σ
b) on the empirical CDF.
par(mfrow = c(1, 2), mar = 2.5 * c(1, 1, 1, 0.1))
xseq = seq(-10, 10, length.out = 1000)
hist(x, breaks = "FD", xlim = extendrange(x), ylab = "proportion", xlab = "",
main = "Empirical Distribution Function", prob = TRUE)
lines(xseq, dnorm(xseq, mean(x), sd = sd(x)), col = 2)

plot(ecdf(x), xlim = extendrange(x), ylim = c(0, 1), ylab = "proportion", xlab = "",
main = "Empirical Distribution Function")

4
lines(xseq, pnorm(xseq, mean(x), sd = sd(x)), col = 2)

Empirical Distribution Function Empirical Distribution Function

1.0
0.4

0.8
0.3

0.6
proportion

proportion
0.2

0.4
0.1

0.2
0.0

2 3 4 5 6 7 0.0 2 3 4 5 6 7

Non-Parametric Bootstrap
• For a given sample S and non-parametric method
– Obtain an estimate Fb(x) using the sample: this estimate is the empirical CDF

• Generate B bootstrap samples S1? , . . . , SB


?
using Fb(x)
– when you sample with replacement from S, you are generating samples S1? , . . . , SB
?
using Fb(x)

• Note: Alternatively, we could estimate the density function with some fb, and do the same thing.

4.5.1 Parametric Bootstrap


• For a given sample S and parametric model F (x; θ)
– Obtain an estimate θb using the sample

• Generate B bootstrap samples S1? , . . . , SB


?
using F (x; θ).
b

– Here, we will generate samples from the model, NOT through sampling with replacement from
the sample.

5
Agricultural census (USA): Parametric Example
• Consider the West region and suppose we obtain a sample of size 50 from the 422 farms in Western
region and measure the number of acres in 1987
agpop <- read.csv("../../../Data/agpop_data.csv", header = TRUE)

missing92 <- agpop[, "acres92"] == -99


rowNumsMissing <- which(agpop[, "acres92"] == -99)
agpop[missing92, "acres92"] <- NA
agpop[agpop[, "acres87"] == -99, "acres87"] <- NA
agpop[agpop[, "acres82"] == -99, "acres82"] <- NA
agpop = na.omit(agpop)

set.seed(341)
acres87 = agpop[agpop$region == "W", "acres87"]
N = length(acres87)
n = 50
acres87Sam = acres87[sample(1:N, n)]

• From a the histogram and empirical distribution function, it seems an exponential distribution with
rate equal to 1/x fits the data well.
– note that 1/x is the maximum likelihood estimate of the parameter of the exponential distribution.
par(mfrow = c(1, 2))
hist(acres87Sam, breaks = seq(0, max(acres87Sam, na.rm = TRUE), length.out = 15),
prob = TRUE, xlab = "Acres from 1987", main = "Acres from W region in 1987")

zseq = seq(0, max(acres87Sam, na.rm = TRUE), length.out = 100)


lines(zseq, dexp(zseq, rate = 1/mean(acres87Sam)))

plot(ecdf(acres87Sam), main = "ECDF of Acres", xlab = "Acres")


lines(zseq, pexp(zseq, rate = 1/mean(acres87Sam)), col = 2, xlab = "Acres")

Acres from W region in 1987 ECDF of Acres


1.2e−06

1.0
0.8
0.6
6.0e−07
Density

Fn(x)

0.4
0.2
0.0e+00

0.0

0 500000 1500000 2500000 0 500000 1500000 2500000

Acres from 1987 Acres

6
• The smooth curve is the fitted exponential model on the histogram/empirical CDF based on the data.
• Based on these graphs, exponential distribution seems to be a good fit for the data.
• This means that to repeat the experiment to get more samples each of size n, one can simply generate
n random samples from EXP (θ = 1 / x) distribution.
theta = 1/mean(acres87Sam)
B = 10ˆ4
Sstar <- sapply(1:B, FUN = function(b) rexp(n, rate = theta))
bootAvg = apply(Sstar, 2, mean)

The summary statistics for the sample averages of the parametric bootstrap sample are
sd(bootAvg)

## [1] 111808.4
summary(bootAvg)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 442437 721441 792075 798314 870757 1342822

Agricultural census (USA): Non-parametric Example


• Consider the West region and suppose we obtain a sample of size 50 from the 422 farms in Western
region and measure the number of acres in 1987
• From the sample of size 50 we take B samples with replacement, each of size n,
– we calculate the average on each of the B sampels
– then we look at the summary statistics of these averages
The summary statistics for the sample averages of the non-parametric bootstrap sample are
Sstar0 <- sapply(1:B, FUN = function(b) acres87Sam[sample(n, n, replace = TRUE)])
bootAvg0 = apply(Sstar0, 2, mean)
sd(bootAvg0)

## [1] 105484.4
summary(bootAvg0)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 426363 727142 796295 799360 869553 1235615

Comparing parametric and non-parametric results


• We can quantify estimating the standard error for the average, or other attributes, using the parametric
and non-parametric bootstrap.

7
pseq = seq(0, 1, length.out = 1000)
par(mfrow = c(1, 2), mar = 2.5 * c(1, 1, 1, 0.1))

hhPopAve <- hist(extendrange(c(bootAvg, bootAvg0)), breaks = 50, plot = FALSE)$breaks

hist(bootAvg, main = "Parametric Bootstrap", breaks = hhPopAve)


hist(bootAvg0, main = "Bootstrap Sampling with \n Replacement", breaks = hhPopAve)

Bootstrap Sampling with


Parametric Bootstrap
Replacement
800

800
600

600
Frequency

Frequency
400

400
200

200
0

400000 800000 1200000 400000 800000 1200000


• The histograms show very similar distributions for the sample average across the two methods.
bootAvg bootAvg0
• Compare the summary statistics of the parametric and non-parametric samples to see how close the
findings of two methods are.
– This is because the exponential distribution was a good fit to the data.
– It may not work as well for statsitics other than the sample average though (see example below)

Example - Median
• Now consider the median for the same data, i.e.
– the West region in the US agricaulture data, and suppose we obtain a sample of size 50 from the
422 farms in Western region and measure the number of acres in 1987.
bootMed = apply(Sstar, 2, median)
bootMed0 = apply(Sstar0, 2, median)

pseq = seq(0, 1, length.out = 1000)


par(mfrow = c(1, 2), mar = 2.5 * c(1, 1, 1, 0.1))

hhPopMed <- hist(extendrange(c(bootMed, bootMed0)), breaks = 50, plot = FALSE)$breaks

8
hist(bootMed, main = "Parametric Bootstrap", breaks = hhPopMed)
hist(bootMed0, main = "Bootstrap Sampling with \n Replacement", breaks = hhPopMed)

Bootstrap Sampling with


Parametric Bootstrap
Replacement

1200
600

200 400 600 800


Frequency

Frequency
400
200
0

0
200000 600000 1000000 1400000 200000 600000 1000000 1400000
• What do you observe?
bootMed bootMed0
• Try other attributes like IQR, min, max, mid-hinge, CV, etc.

Summary
• We introduced the Parametric bootstrap and illustrated how it can estimate the sampling distribution.

4.5.2 Bootstrap in Regression


Overview
• We consider how to apply the bootstrap in the context of regression.

• For this section, for simplicity we will treat each data-set as a sample.

9
Iris Data
• Here we will explore the bootstrap in analyzing the famous Iris data-set.
– Iris is a data set with 150 cases and 5 variables named Sepal.Length, Sepal.Width, Petal.Length,
Petal.Width, and Species.
– We will limit ourselves to the Setosa flower, in which Sepal.Length is our x-covariate and the
Sepal.Width is our response.

Figure 1: Iris Flower

The Iris data is built-in to R and a part of it is shown below:


data(iris)
head(iris)

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species


## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
iris.s = iris[iris[, 5] == "setosa", -c(3, 4, 5)]
# head(iris.s)

The sepal width and sepal length seem to be correlated:


plot(iris.s, col = adjustcolor("firebrick", 0.5), pch = 19)

10
4.0
Sepal.Width

3.5
3.0
2.5

4.5 5.0 5.5

Sepal.Length
• Here we assume that
– Sepal.Length is our x-covariate and
– the Sepal.Width is our response.
x = iris.s$Sepal.Length
y = iris.s$Sepal.Width
n = length(y)

• The assumed regression model is


Yi = α + β (xi − x) + Ri
where Ri ∼ N (0, σ 2 ) and the least squares estimate of (b
α, β)
b is
lm(y ~ x)$coef

## (Intercept) x
## -0.5694327 0.7985283
data(iris)
iris.s = iris[iris[, 5] == "setosa", -c(3, 4, 5)]
# head(iris.s)
plot(iris.s, col = adjustcolor("firebrick", 0.5), pch = 19)
beta.hat = lm(y ~ I(x - mean(x)))$coef
abline(beta.hat + c(-beta.hat[2] * mean(x), 0))

11
4.0
Sepal.Width

3.5
3.0
2.5

4.5 5.0 5.5

Sepal.Length

Resampling the Pairs: non-parametric bootstrap


• How can one assess the sampling variability of the regression line?
– How can we assess the standard error of the regression line?

• The sample is
S = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}

– We might sample with replacement from the pairs of observations. i.e. sample (xi , yi ).

• For each bootstrap sample Sb? , we estimate the LS line to obtain (b


αb? , βbb? ), for b = 1, . . . , B.
B = 1000
beta.boot = t(sapply(1:B, FUN = function(b) lm(y ~ I(x - mean(x)), subset = sample(n,
n, replace = TRUE))$coef))

Bootstrap the Regression Coefficients


Plots of the boot estimates.

12
alpha beta
120

1.2
100

150
80

1.0
Frequency

Frequency

100

beta
60

0.8
40

50
20

0.6
0

3.35 3.45 3.55 0.6 0.8 1.0 1.2 3.35 3.40 3.45 3.50 3.55

alpha beta alpha

• Notice that the estimates seem to be generated from a bivariate normal (how so?)
– which indicates that the bootstrap and the “theoretical” confidence intervals based on the errors
being Gaussian should agree.

Aside: Comparing Regression Models


• Suppose we consider the alternative parameterization of the line.

Yi = α + β (xi − x) + Ri and Yi = λ + βxi + Ri

• The bootstrap replicates for this model are


beta.boot2 = t(sapply(1:B, FUN = function(b) lm(y ~ x, subset = sample(n, n,
replace = TRUE))$coef))

13
lambda beta
140

1.2
120

150
100

1.0
100
80
Frequency

Frequency

beta

0.8
60
40

50

0.6
20
0

−2 −1 0 1 0.4 0.6 0.8 1.0 1.2 −2 −1 0 1

lambda beta lambda

• Note the strong dependence between the estimates of the two parameters (negative correlation) which
does not exist when using the centralized model, i.e. Yi = α + β(xi − x) + Ri .

Bootstrap Regression lines


We can plot the data with the fitted model

b + βb (x − x)
y=α

and all of the bootstrap regression lines.

bb? + βbb? (x − x)
y=α

14
4.0
Sepal.Width

3.5
3.0
2.5

4.5 5.0 5.5

Sepal.Length
• How can we construct a confidence interval for the regression line?

Bootstrap Confidence Interval (Percentile Method)


Suppose we want a confidence interval for the fitted line at the value x0
• The fitted value is
µ b + βb (x0 − x)
b(x0 ) = α

• The bootstrap replicates for this fitted value are

b?b (x0 ) = α
µ bb? + βbb? (x0 − x) where b = 1, . . . , B

µ?1 (x0 ), . . . , µ
So we have {b b?B (x0 )}

• A 95% bootstrap confidence interval is the 2.5 and 97.5 quantiles from the bootstrap replicates
(percentile interval)

µ
blower (x0 ) = Qb
µ? (x0 )
(0.025) and µ
bupper (x0 ) = Qb
µ? (x0 )
(0.975)

x0 = 4.5

mu0.hat = sum(beta.hat * c(1, x0 - mean(x)))

mu0.star.hat = apply(beta.boot, 1, function(z, a) {

15
sum(z * a)
}, a = c(1, x0 - mean(x)))

boot.ci0 = quantile(mu0.star.hat, prob = c(0.025, 0.975))

• Using x0 = 4.5 the fitted value, µ


b(x0 ), is 3.02 and bootstrap confidence using the percentile method is
(2.88, 3.14).
par(mfrow = c(1, 2), mar = 2.5 * c(1, 1, 1, 0.1))

hist(mu0.star.hat, freq = FALSE, breaks = "FD", col = adjustcolor("grey", 0.5),


main = "Histogram of \n mu.star.hat(x0)")
abline(v = c(mu0.hat, boot.ci0), lty = c(1, 2, 2))

plot(iris.s, pch = 19, col = adjustcolor("firebrick", 0.5))


abline(coef = beta.hat + c(-beta.hat[2] * mean(x), 0))
lines(c(x0, x0), boot.ci0, col = 4, lwd = 2)

Histogram of
mu.star.hat(x0)
6

4.0
5
4

Sepal.Width

3.5
Density

3.0
2
1

2.5
0

2.8 2.9 3.0 3.1 3.2 4.5 5.0 5.5

mu0.star.hat Sepal.Length

• To obtain a confidence interval for the whole regression line.


– We vary x0 , perform the bootstrap again and
– connect the lower and upper confidence intervals
x.seq = c(4.5, 5, 5.6)

Using x0 = 4.5, 5, 5.6, we obtain the following confidence intervals.

16
boot.ci = matrix(0, nrow = length(x.seq), 2)

for (i in 1:length(x.seq)) {
y.hat = apply(beta.boot, 1, function(z, a) {
sum(z * a)
}, a = c(1, x.seq[i] - mean(x)))
boot.ci[i, ] = quantile(y.hat, prob = c(0.025, 0.975))
}

round(boot.ci, 2)

## [,1] [,2]
## [1,] 2.88 3.14
## [2,] 3.35 3.49
## [3,] 3.76 4.05
par(mfrow = c(1, 2), mar = 2.5 * c(1, 1, 1, 0.1))

plot(iris.s, pch = 19, col = adjustcolor("firebrick", 0.5))


abline(coef = beta.hat + c(-beta.hat[2] * mean(x), 0))

for (i in 1:length(x.seq)) lines(rep(x.seq[i], 2), boot.ci[i, ], col = 4, lwd = 2)

plot(iris.s, pch = 19, col = adjustcolor("firebrick", 0.5))


abline(coef = beta.hat + c(-beta.hat[2] * mean(x), 0))

lines(x.seq, boot.ci[, 1], col = 4, lwd = 2)


lines(x.seq, boot.ci[, 2], col = 4, lwd = 2)
4.0

4.0
Sepal.Width

Sepal.Width
3.5

3.5
3.0

3.0
2.5

2.5

4.5 5.0 5.5 4.5 5.0 5.5

Sepal.Length Sepal.Length

17
• We can add more x0 values and then compare the bootstrap percentile interval to a confidence interval
that uses the assumption of Gaussian errors.
x.seq = seq(min(x), max(x), length.out = 100)
boot.ci = matrix(0, nrow = length(x.seq), 2)

for (i in 1:length(x.seq)) {
y.hat = apply(beta.boot, 1, function(z, a) {
sum(z * a)
}, a = c(1, x.seq[i] - mean(x)))
boot.ci[i, ] = quantile(y.hat, prob = c(0.025, 0.975))
}

## A CI using the assumption of Gaussian Errors


ci = predict(lm(y ~ x), newdata = data.frame(x = x.seq), interval = "confidence")

par(mfrow = c(1, 2))

plot(iris.s, pch = 19, col = adjustcolor("firebrick", 0.5), main = "Bootstrap Confidence Interval")
abline(coef = beta.hat + c(-beta.hat[2] * mean(x), 0))
lines(x.seq, boot.ci[, 1], col = 4, lwd = 2)
lines(x.seq, boot.ci[, 2], col = 4, lwd = 2)

plot(iris.s, pch = 19, col = adjustcolor("firebrick", 0.5), main = "Gaussian Confidence Interval")
abline(coef = beta.hat + c(-beta.hat[2] * mean(x), 0))
lines(x.seq, ci[, 2], col = 3, lwd = 2)
lines(x.seq, ci[, 3], col = 3, lwd = 2)

Bootstrap Confidence Interval Gaussian Confidence Interval


4.0

4.0
Sepal.Width

Sepal.Width
3.5

3.5
3.0

3.0
2.5

2.5

4.5 5.0 5.5 4.5 5.0 5.5

Sepal.Length Sepal.Length
• A confidence interval using the assumption of Gaussian Errors and a confidence interval using the
bootstrap and the percentile method match.

18
• Important question: what does the coverage probability of 95% for a regression line mean?
– note that a proper definition of a regression line is E(Y | X = x) = α + βx

Parametric Bootstrap
• How would we apply the parametric bootstrap in the context of regression?
• The assumed regression model is
Yi = α + β (xi − x) + Ri
with Ri ∼i.i.d G(0, σ)

• We fit the model to obtain the estimates α


b, β,
b and σ
b
– For thr Iris data above, these are α
b = 3.428, βb = 0.7985, and σ
b = 0.2565.
– Hence the fitted model is Yi = 3.428 + 0.7985 (xi − 5.006) + Ri with Ri ∼i.i.d G(0, 0.2565)

• To obtain a bootstrap samples, we generate Ri? from G(0, σ


b) and set

yi? = α
b + βb (xi − x) + Ri?

and then the bootstrap sample is

Sb? = {(x1 , y1? ), (x2 , y2? ), . . . , (xn , yn? )}

bb? , βbb? , and σ


• For each bootstrap sample Sb? we estimate the parameters to get the bootstrap replicates α bb?

Illustration
• Obtain the bootstrap samples using α
b = 3.428, βb = 0.7985, and σ
b = 0.2565.
B = 1000
par.boot.sam = Map(function(b) {
Rstar = rnorm(n, mean = 0, sd = 0.2565)
y = 3.428 + 0.7985 * (x - mean(x)) + Rstar
data.frame(x = x, y = y)
}, 1:B)

par.boot.coef = Map(function(sam) lm(y ~ I(x - mean(x)), data = sam)$coef, par.boot.sam)

19
4.0 Resampling Pairs: non−parametric Parametric Bootstrap

4.0
Sepal.Width

Sepal.Width
3.5

3.5
3.0

3.0
2.5

2.5
4.5 5.0 5.5 4.5 5.0 5.5
• We notice that the results are very similar in this example.
Sepal.Length Sepal.Length
• The parametric bootstrap motivates another way to re-sample data, i.e. sampling the errors.

Resampling the Errors


• Suppose the regression model is
Yi = α + β (xi − x) + Ri
where Ri ∼ F , i.e. the errors come from some unknown density F .
– How might we estimate F ?

• If we had α and β then ri = yi − [α + β (xi − x)] , (i = 1, . . . , n) would be a sample from F

• We fit the model to find α


b, β,
b
h i
– then obtain the residuals rbi = yi − ybi = yi − αb + βb (xi − x) and

– the sample of residuals or estimates of the errors is R


b = {b
r1 , . . . , rbn }

• We can use the sample of residuals to estimate F using the empirical cdf.
n
1X
Fb(t) = ri ≤ t)
I (b
n i=1

20
• We perform the bootstrap using Fb,
– we generate a bootstrap sample of errors Ri? by resampling from R
b and obtain

b + βb (xi − x) + Ri?
yi? = α

– and then the bootstrap sample is

Sb? = {(x1 , y1? ), (x2 , y2? ), . . . , (xn , yn? )}

Illustration
• The sample of residuals and the empirical cdf is
ecdf(R)
1.0
0.8
0.6
Fn(x)

0.4
0.2
0.0

−0.8 −0.4 0.0 0.4

x
• Obtain the bootstrap samples
B = 1000
nonpar.boot.sam = Map(function(b) {
Rstar = R[sample(n, n, replace = TRUE)]
y = 3.428 + 0.7985 * (x - mean(x)) + Rstar
data.frame(x = x, y = y)
}, 1:B)

nonpar.boot.coef = Map(function(sam) lm(y ~ I(x - mean(x)), data = sam)$coef,


nonpar.boot.sam)

21
4.0 Resampling Pairs: non−parametric Parametric Bootstrap Resampling Errors

4.0

4.0
Sepal.Width

Sepal.Width

Sepal.Width
3.5

3.5

3.5
3.0

3.0

3.0
2.5

2.5

2.5
4.5 5.0 5.5 4.5 5.0 5.5 4.5 5.0 5.5

• All threeSepal.Length
methods agree Sepal.Length Sepal.Length

Some other Examples


Animals Data and LS

library(robustbase)
data(Animals2)
# Animals2 = Animals2[-c(63,64,65),]
x = log(Animals2$body)
y = log(Animals2$brain)
n = length(y)
plot(x, y, pch = 19, col = adjustcolor("grey", 0.5))
beta.hat = lm(y ~ I(x - mean(x)))$coef
sd.hat = sqrt(sum(lm(y ~ I(x - mean(x)))$residualsˆ2)/(n - 2))
abline(coef = beta.hat + c(-beta.hat[2] * mean(x), 0), col = adjustcolor("firebrick",
1))

22
8
6
4
y

2
0
−2

−5 0 5 10

x
• LS confidence Intervals
Resampling Pairs: non−parametric Parametric Bootstrap Resampling Errors
8

8
6

6
4

4
y

y
2

2
0

0
−2

−2

−2

−5 0 5 10 −5 0 5 10 −5 0 5 10

x x x

Animals Data and Robust Regression


• Let fit the robust regression line using the Huber function.
library(robustbase)
library(MASS)
data(Animals2)

23
# Animals2 = Animals2[-c(63,64,65),]
x = log(Animals2$body)
y = log(Animals2$brain)
n = length(y)
plot(x, y, pch = 19, col = adjustcolor("grey", 0.5))
beta.hat = rlm(y ~ I(x - mean(x)), psi = "psi.huber")$coef

sd.hat = sqrt(sum(rlm(y ~ I(x - mean(x)), psi = "psi.huber")$residualsˆ2)/(n -


2))
abline(coef = beta.hat + c(-beta.hat[2] * mean(x), 0), col = adjustcolor("firebrick",
1))
8
6
4
y

2
0
−2

−5 0 5 10

x
• Robust Regression Confidence Intervals:
Resampling Pairs: non−parametric Parameteric Bootstrap Resampling Errors
8

8
6

6
4

4
y

y
2

2
0

0
−2

−2

−2

−5 0 5 10 −5 0 5 10 −5 0 5 10

x x x

24
• What do you learn from these plots? Do they all agree?

The bootstrap “can blow the head off any problem if the statistican can stand the resulting mess.”
– John Tukey

Summary
• We saw there is three ways to apply the bootstrap in the context of regression.
– resampling pairs (x is random)
– resampling errors (x is fixed)
– parametric bootstrap (x is fixed, has most assumptions)

25

You might also like