0% found this document useful (0 votes)

1 views

4.5-Bootstrap_Variations

The document discusses various bootstrap methods, including parametric and non-parametric approaches, for estimating statistical parameters. It provides examples using agricultural census data and regression analysis, highlighting the differences between parametric and non-parametric results. Additionally, it emphasizes the importance of empirical distribution functions and the generation of bootstrap samples for statistical inference.

Uploaded by

inayahchaudhary972

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

4.5-Bootstrap_Variations

Uploaded by

inayahchaudhary972

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

4.

5 Bootstrap Variations

Contents
4.5.0 Bootstrap 2
Estimate for F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Parametric Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Non-Parametric Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4.5.1 Parametric Bootstrap 5

Agricultural census (USA): Parametric Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Agricultural census (USA): Non-parametric Example . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Comparing parametric and non-parametric results . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Example - Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.5.2 Bootstrap in Regression 9

Iris Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Resampling the Pairs: non-parametric bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Bootstrap the Regression Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Aside: Comparing Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Bootstrap Regression lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Bootstrap Confidence Interval (Percentile Method) . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Parametric Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Resampling the Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Some other Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Animals Data and LS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Animals Data and Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Overview
• We redefine our measure of inaccuracy to include multiple samples and test sets.

sharks <- read.csv("../../../Data/sharks.csv")

popSharks <- rownames(sharks)

1
# samples <- combn(popSharks, 5) N_s <- ncol(samples)
N_s <- 10ˆ4
n = 6
set.seed(341)
samples <- sapply(1:N_s, FUN = function(b) sample(popSharks, n, replace = TRUE))

avePop <- mean(sharks[, "Length"])

avesSamp <- apply(samples, MARGIN = 2, FUN = function(s) {
mean(sharks[s, "Length"])
})
sampleErrors <- avesSamp - avePop

tmpAve <- mean(avesSamp)

tmpSD <- sd(avesSamp)

sdsSamp <- apply(samples, MARGIN = 2, FUN = function(s) {

sd(sharks[s, "Length"])
})
agpop <- read.csv("../../../Data/agpop_data.csv", header = TRUE)

missing92 <- agpop[, "acres92"] == -99

rowNumsMissing <- which(agpop[, "acres92"] == -99)
agpop[missing92, "acres92"] <- NA
agpop[agpop[, "acres87"] == -99, "acres87"] <- NA
agpop[agpop[, "acres82"] == -99, "acres82"] <- NA

4.5.0 Bootstrap
• So far, to bootstrap we have been sampling with replacement from the sample S.
– Because the sample S was viewed as an estimate of the population P
– This sampling scheme is equivalent to sampling from the empirical distribution function, Fb.

• In other words, we would like to sample from the distribution F , but instead,
– we obtain a sample using an estimate Fb.

• What other possible estimates are there for the cumulative distribution function F ?

Estimate for F
• Varies empirical distribution functions using the argument type in the quantile function.
– Generate 10 observations from G(5, 1)

2
set.seed(341)
x = rnorm(10, mean = 5)

pseq = seq(0, 1, length.out = 1000)

par(mfrow = c(3, 3), mar = 2.5 * c(1, 1, 1, 0.1))
for (i in 1:9) {
plot(quantile(x, probs = pseq, type = i), pseq, type = "l", xlim = extendrange(x),
ylim = c(0, 1), ylab = "proportion", xlab = "", main = paste("Type=",
i))
segments(x0 = c(-5, 10), y0 = c(0, 1), x1 = c(min(x), max(x)), y1 = c(0,
1))
}
Type= 1 Type= 2 Type= 3
1.0

1.0

1.0
0.8

0.8

0.8
0.6

0.6

0.6
proportion

proportion

proportion
0.4

0.4

0.4
0.2

0.2

0.2
0.0

0.0

0.0
4.0 4.5 5.0 5.5 6.0 6.5 4.0 4.5 5.0 5.5 6.0 6.5 4.0 4.5 5.0 5.5 6.0 6.5

Type= 4 Type= 5 Type= 6

1.0

1.0
0.8

0.8

0.8
0.6

0.6

0.6
proportion

proportion

proportion
0.4

0.4

0.4
0.2

0.2

0.2
0.0

0.0

4.0 4.5 5.0 5.5 6.0 6.5 4.0 4.5 5.0 5.5 6.0 6.5 4.0 4.5 5.0 5.5 6.0 6.5

Type= 7 Type= 8 Type= 9

1.0

1.0
0.8

0.8

0.8
0.6

0.6

0.6
proportion

proportion

proportion
0.4

0.4

0.4
0.2

0.2

0.2
0.0

0.0

4.0 4.5 5.0 5.5 6.0 6.5 4.0 4.5 5.0 5.5 6.0 6.5 4.0 4.5 5.0 5.5 6.0 6.5

• Refer to the help documentation of the quantile function for details on the argument type.

3
• All on the quantile functions on one plot
Different Empirical Distribution Functions
1.0
0.8
proportion

0.6
0.4
0.2
0.0

4.0 4.5 5.0 5.5 6.0 6.5

Parametric Estimate
• We can estimate the distribution function F (x) using a parametric model F (x; θ) which is indexed by
some parameters.
• Generate 100 observations from G(µ = 5, σ = 1)
set.seed(341)
x = rnorm(100, mean = 5)
c(mean(x), sd(x))

## [1] 4.994535 1.070854

• Overlay the fitted Gaussian distribution G(µ = µ
b, σ = σ
b) on the empirical CDF.
par(mfrow = c(1, 2), mar = 2.5 * c(1, 1, 1, 0.1))
xseq = seq(-10, 10, length.out = 1000)
hist(x, breaks = "FD", xlim = extendrange(x), ylab = "proportion", xlab = "",
main = "Empirical Distribution Function", prob = TRUE)
lines(xseq, dnorm(xseq, mean(x), sd = sd(x)), col = 2)

plot(ecdf(x), xlim = extendrange(x), ylim = c(0, 1), ylab = "proportion", xlab = "",
main = "Empirical Distribution Function")

4
lines(xseq, pnorm(xseq, mean(x), sd = sd(x)), col = 2)

Empirical Distribution Function Empirical Distribution Function

1.0
0.4

0.8
0.3

0.6
proportion

proportion
0.2

0.4
0.1

0.2
0.0

2 3 4 5 6 7 0.0 2 3 4 5 6 7

Non-Parametric Bootstrap
• For a given sample S and non-parametric method
– Obtain an estimate Fb(x) using the sample: this estimate is the empirical CDF

• Generate B bootstrap samples S1? , . . . , SB

?
using Fb(x)
– when you sample with replacement from S, you are generating samples S1? , . . . , SB
?
using Fb(x)

• Note: Alternatively, we could estimate the density function with some fb, and do the same thing.

4.5.1 Parametric Bootstrap

• For a given sample S and parametric model F (x; θ)
– Obtain an estimate θb using the sample

• Generate B bootstrap samples S1? , . . . , SB

?
using F (x; θ).
b

– Here, we will generate samples from the model, NOT through sampling with replacement from
the sample.

5
Agricultural census (USA): Parametric Example
• Consider the West region and suppose we obtain a sample of size 50 from the 422 farms in Western
region and measure the number of acres in 1987
agpop <- read.csv("../../../Data/agpop_data.csv", header = TRUE)

missing92 <- agpop[, "acres92"] == -99

rowNumsMissing <- which(agpop[, "acres92"] == -99)
agpop[missing92, "acres92"] <- NA
agpop[agpop[, "acres87"] == -99, "acres87"] <- NA
agpop[agpop[, "acres82"] == -99, "acres82"] <- NA
agpop = na.omit(agpop)

set.seed(341)
acres87 = agpop[agpop$region == "W", "acres87"]
N = length(acres87)
n = 50
acres87Sam = acres87[sample(1:N, n)]

• From a the histogram and empirical distribution function, it seems an exponential distribution with
rate equal to 1/x fits the data well.
– note that 1/x is the maximum likelihood estimate of the parameter of the exponential distribution.
par(mfrow = c(1, 2))
hist(acres87Sam, breaks = seq(0, max(acres87Sam, na.rm = TRUE), length.out = 15),
prob = TRUE, xlab = "Acres from 1987", main = "Acres from W region in 1987")

zseq = seq(0, max(acres87Sam, na.rm = TRUE), length.out = 100)

lines(zseq, dexp(zseq, rate = 1/mean(acres87Sam)))

plot(ecdf(acres87Sam), main = "ECDF of Acres", xlab = "Acres")

lines(zseq, pexp(zseq, rate = 1/mean(acres87Sam)), col = 2, xlab = "Acres")

Acres from W region in 1987 ECDF of Acres

1.2e−06

1.0
0.8
0.6
6.0e−07
Density

Fn(x)

0.4
0.2
0.0e+00

0.0

0 500000 1500000 2500000 0 500000 1500000 2500000

Acres from 1987 Acres

6
• The smooth curve is the fitted exponential model on the histogram/empirical CDF based on the data.
• Based on these graphs, exponential distribution seems to be a good fit for the data.
• This means that to repeat the experiment to get more samples each of size n, one can simply generate
n random samples from EXP (θ = 1 / x) distribution.
theta = 1/mean(acres87Sam)
B = 10ˆ4
Sstar <- sapply(1:B, FUN = function(b) rexp(n, rate = theta))
bootAvg = apply(Sstar, 2, mean)

The summary statistics for the sample averages of the parametric bootstrap sample are
sd(bootAvg)

## [1] 111808.4
summary(bootAvg)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 442437 721441 792075 798314 870757 1342822

Agricultural census (USA): Non-parametric Example

• Consider the West region and suppose we obtain a sample of size 50 from the 422 farms in Western
region and measure the number of acres in 1987
• From the sample of size 50 we take B samples with replacement, each of size n,
– we calculate the average on each of the B sampels
– then we look at the summary statistics of these averages
The summary statistics for the sample averages of the non-parametric bootstrap sample are
Sstar0 <- sapply(1:B, FUN = function(b) acres87Sam[sample(n, n, replace = TRUE)])
bootAvg0 = apply(Sstar0, 2, mean)
sd(bootAvg0)

## [1] 105484.4
summary(bootAvg0)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 426363 727142 796295 799360 869553 1235615

Comparing parametric and non-parametric results

• We can quantify estimating the standard error for the average, or other attributes, using the parametric
and non-parametric bootstrap.

7
pseq = seq(0, 1, length.out = 1000)
par(mfrow = c(1, 2), mar = 2.5 * c(1, 1, 1, 0.1))

hhPopAve <- hist(extendrange(c(bootAvg, bootAvg0)), breaks = 50, plot = FALSE)$breaks

hist(bootAvg, main = "Parametric Bootstrap", breaks = hhPopAve)

hist(bootAvg0, main = "Bootstrap Sampling with \n Replacement", breaks = hhPopAve)

Bootstrap Sampling with

Parametric Bootstrap
Replacement
800

800
600

600
Frequency

Frequency
400

400
200

200
0

400000 800000 1200000 400000 800000 1200000

• The histograms show very similar distributions for the sample average across the two methods.
bootAvg bootAvg0
• Compare the summary statistics of the parametric and non-parametric samples to see how close the
findings of two methods are.
– This is because the exponential distribution was a good fit to the data.
– It may not work as well for statsitics other than the sample average though (see example below)

Example - Median
• Now consider the median for the same data, i.e.
– the West region in the US agricaulture data, and suppose we obtain a sample of size 50 from the
422 farms in Western region and measure the number of acres in 1987.
bootMed = apply(Sstar, 2, median)
bootMed0 = apply(Sstar0, 2, median)

pseq = seq(0, 1, length.out = 1000)

par(mfrow = c(1, 2), mar = 2.5 * c(1, 1, 1, 0.1))

hhPopMed <- hist(extendrange(c(bootMed, bootMed0)), breaks = 50, plot = FALSE)$breaks

8
hist(bootMed, main = "Parametric Bootstrap", breaks = hhPopMed)
hist(bootMed0, main = "Bootstrap Sampling with \n Replacement", breaks = hhPopMed)

Bootstrap Sampling with

Parametric Bootstrap
Replacement

1200
600

200 400 600 800

Frequency

Frequency
400
200
0

0
200000 600000 1000000 1400000 200000 600000 1000000 1400000
• What do you observe?
bootMed bootMed0
• Try other attributes like IQR, min, max, mid-hinge, CV, etc.

Summary
• We introduced the Parametric bootstrap and illustrated how it can estimate the sampling distribution.

4.5.2 Bootstrap in Regression

Overview
• We consider how to apply the bootstrap in the context of regression.

• For this section, for simplicity we will treat each data-set as a sample.

9
Iris Data
• Here we will explore the bootstrap in analyzing the famous Iris data-set.
– Iris is a data set with 150 cases and 5 variables named Sepal.Length, Sepal.Width, Petal.Length,
Petal.Width, and Species.
– We will limit ourselves to the Setosa flower, in which Sepal.Length is our x-covariate and the
Sepal.Width is our response.

Figure 1: Iris Flower

The Iris data is built-in to R and a part of it is shown below:

data(iris)
head(iris)

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species

## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
iris.s = iris[iris[, 5] == "setosa", -c(3, 4, 5)]
# head(iris.s)

The sepal width and sepal length seem to be correlated:

plot(iris.s, col = adjustcolor("firebrick", 0.5), pch = 19)

10
4.0
Sepal.Width

3.5
3.0
2.5

4.5 5.0 5.5

Sepal.Length
• Here we assume that
– Sepal.Length is our x-covariate and
– the Sepal.Width is our response.
x = iris.s$Sepal.Length
y = iris.s$Sepal.Width
n = length(y)

• The assumed regression model is

Yi = α + β (xi − x) + Ri
where Ri ∼ N (0, σ 2 ) and the least squares estimate of (b
α, β)
b is
lm(y ~ x)$coef

## (Intercept) x
## -0.5694327 0.7985283
data(iris)
iris.s = iris[iris[, 5] == "setosa", -c(3, 4, 5)]
# head(iris.s)
plot(iris.s, col = adjustcolor("firebrick", 0.5), pch = 19)
beta.hat = lm(y ~ I(x - mean(x)))$coef
abline(beta.hat + c(-beta.hat[2] * mean(x), 0))

11
4.0
Sepal.Width

3.5
3.0
2.5

4.5 5.0 5.5

Sepal.Length

Resampling the Pairs: non-parametric bootstrap

• How can one assess the sampling variability of the regression line?
– How can we assess the standard error of the regression line?

• The sample is
S = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}

– We might sample with replacement from the pairs of observations. i.e. sample (xi , yi ).

• For each bootstrap sample Sb? , we estimate the LS line to obtain (b

αb? , βbb? ), for b = 1, . . . , B.
B = 1000
beta.boot = t(sapply(1:B, FUN = function(b) lm(y ~ I(x - mean(x)), subset = sample(n,
n, replace = TRUE))$coef))

Bootstrap the Regression Coefficients

Plots of the boot estimates.

12
alpha beta
120

1.2
100

150
80

1.0
Frequency

Frequency

100

beta
60

0.8
40

50
20

0.6
0

3.35 3.45 3.55 0.6 0.8 1.0 1.2 3.35 3.40 3.45 3.50 3.55

alpha beta alpha

• Notice that the estimates seem to be generated from a bivariate normal (how so?)
– which indicates that the bootstrap and the “theoretical” confidence intervals based on the errors
being Gaussian should agree.

Aside: Comparing Regression Models

• Suppose we consider the alternative parameterization of the line.

Yi = α + β (xi − x) + Ri and Yi = λ + βxi + Ri

• The bootstrap replicates for this model are

beta.boot2 = t(sapply(1:B, FUN = function(b) lm(y ~ x, subset = sample(n, n,
replace = TRUE))$coef))

13
lambda beta
140

1.2
120

150
100

1.0
100
80
Frequency

Frequency

beta

0.8
60
40

0.6
20
0

−2 −1 0 1 0.4 0.6 0.8 1.0 1.2 −2 −1 0 1

lambda beta lambda

• Note the strong dependence between the estimates of the two parameters (negative correlation) which
does not exist when using the centralized model, i.e. Yi = α + β(xi − x) + Ri .

Bootstrap Regression lines

We can plot the data with the fitted model

b + βb (x − x)
y=α

and all of the bootstrap regression lines.

bb? + βbb? (x − x)
y=α

14
4.0
Sepal.Width

3.5
3.0
2.5

4.5 5.0 5.5

Sepal.Length
• How can we construct a confidence interval for the regression line?

Bootstrap Confidence Interval (Percentile Method)

Suppose we want a confidence interval for the fitted line at the value x0
• The fitted value is
µ b + βb (x0 − x)
b(x0 ) = α

• The bootstrap replicates for this fitted value are

b?b (x0 ) = α
µ bb? + βbb? (x0 − x) where b = 1, . . . , B

µ?1 (x0 ), . . . , µ
So we have {b b?B (x0 )}

• A 95% bootstrap confidence interval is the 2.5 and 97.5 quantiles from the bootstrap replicates
(percentile interval)

µ
blower (x0 ) = Qb
µ? (x0 )
(0.025) and µ
bupper (x0 ) = Qb
µ? (x0 )
(0.975)

x0 = 4.5

mu0.hat = sum(beta.hat * c(1, x0 - mean(x)))

mu0.star.hat = apply(beta.boot, 1, function(z, a) {

15
sum(z * a)
}, a = c(1, x0 - mean(x)))

boot.ci0 = quantile(mu0.star.hat, prob = c(0.025, 0.975))

• Using x0 = 4.5 the fitted value, µ

b(x0 ), is 3.02 and bootstrap confidence using the percentile method is
(2.88, 3.14).
par(mfrow = c(1, 2), mar = 2.5 * c(1, 1, 1, 0.1))

hist(mu0.star.hat, freq = FALSE, breaks = "FD", col = adjustcolor("grey", 0.5),

main = "Histogram of \n mu.star.hat(x0)")
abline(v = c(mu0.hat, boot.ci0), lty = c(1, 2, 2))

plot(iris.s, pch = 19, col = adjustcolor("firebrick", 0.5))

abline(coef = beta.hat + c(-beta.hat[2] * mean(x), 0))
lines(c(x0, x0), boot.ci0, col = 4, lwd = 2)

Histogram of
mu.star.hat(x0)
6

4.0
5
4

Sepal.Width

3.5
Density

3.0
2
1

2.5
0

2.8 2.9 3.0 3.1 3.2 4.5 5.0 5.5

mu0.star.hat Sepal.Length

• To obtain a confidence interval for the whole regression line.

– We vary x0 , perform the bootstrap again and
– connect the lower and upper confidence intervals
x.seq = c(4.5, 5, 5.6)

Using x0 = 4.5, 5, 5.6, we obtain the following confidence intervals.

16
boot.ci = matrix(0, nrow = length(x.seq), 2)

for (i in 1:length(x.seq)) {
y.hat = apply(beta.boot, 1, function(z, a) {
sum(z * a)
}, a = c(1, x.seq[i] - mean(x)))
boot.ci[i, ] = quantile(y.hat, prob = c(0.025, 0.975))
}

round(boot.ci, 2)

## [,1] [,2]
## [1,] 2.88 3.14
## [2,] 3.35 3.49
## [3,] 3.76 4.05
par(mfrow = c(1, 2), mar = 2.5 * c(1, 1, 1, 0.1))

plot(iris.s, pch = 19, col = adjustcolor("firebrick", 0.5))

abline(coef = beta.hat + c(-beta.hat[2] * mean(x), 0))

for (i in 1:length(x.seq)) lines(rep(x.seq[i], 2), boot.ci[i, ], col = 4, lwd = 2)

plot(iris.s, pch = 19, col = adjustcolor("firebrick", 0.5))

abline(coef = beta.hat + c(-beta.hat[2] * mean(x), 0))

lines(x.seq, boot.ci[, 1], col = 4, lwd = 2)

lines(x.seq, boot.ci[, 2], col = 4, lwd = 2)
4.0

4.0
Sepal.Width

Sepal.Width
3.5

3.5
3.0

3.0
2.5

2.5

4.5 5.0 5.5 4.5 5.0 5.5

Sepal.Length Sepal.Length

17
• We can add more x0 values and then compare the bootstrap percentile interval to a confidence interval
that uses the assumption of Gaussian errors.
x.seq = seq(min(x), max(x), length.out = 100)
boot.ci = matrix(0, nrow = length(x.seq), 2)

for (i in 1:length(x.seq)) {
y.hat = apply(beta.boot, 1, function(z, a) {
sum(z * a)
}, a = c(1, x.seq[i] - mean(x)))
boot.ci[i, ] = quantile(y.hat, prob = c(0.025, 0.975))
}

## A CI using the assumption of Gaussian Errors

ci = predict(lm(y ~ x), newdata = data.frame(x = x.seq), interval = "confidence")

par(mfrow = c(1, 2))

plot(iris.s, pch = 19, col = adjustcolor("firebrick", 0.5), main = "Bootstrap Confidence Interval")
abline(coef = beta.hat + c(-beta.hat[2] * mean(x), 0))
lines(x.seq, boot.ci[, 1], col = 4, lwd = 2)
lines(x.seq, boot.ci[, 2], col = 4, lwd = 2)

plot(iris.s, pch = 19, col = adjustcolor("firebrick", 0.5), main = "Gaussian Confidence Interval")
abline(coef = beta.hat + c(-beta.hat[2] * mean(x), 0))
lines(x.seq, ci[, 2], col = 3, lwd = 2)
lines(x.seq, ci[, 3], col = 3, lwd = 2)

Bootstrap Confidence Interval Gaussian Confidence Interval

4.0

4.0
Sepal.Width

Sepal.Width
3.5

3.5
3.0

3.0
2.5

2.5

4.5 5.0 5.5 4.5 5.0 5.5

Sepal.Length Sepal.Length
• A confidence interval using the assumption of Gaussian Errors and a confidence interval using the
bootstrap and the percentile method match.

18
• Important question: what does the coverage probability of 95% for a regression line mean?
– note that a proper definition of a regression line is E(Y | X = x) = α + βx

Parametric Bootstrap
• How would we apply the parametric bootstrap in the context of regression?
• The assumed regression model is
Yi = α + β (xi − x) + Ri
with Ri ∼i.i.d G(0, σ)

• We fit the model to obtain the estimates α

b, β,
b and σ
b
– For thr Iris data above, these are α
b = 3.428, βb = 0.7985, and σ
b = 0.2565.
– Hence the fitted model is Yi = 3.428 + 0.7985 (xi − 5.006) + Ri with Ri ∼i.i.d G(0, 0.2565)

• To obtain a bootstrap samples, we generate Ri? from G(0, σ

b) and set

yi? = α
b + βb (xi − x) + Ri?

and then the bootstrap sample is

Sb? = {(x1 , y1? ), (x2 , y2? ), . . . , (xn , yn? )}

bb? , βbb? , and σ

• For each bootstrap sample Sb? we estimate the parameters to get the bootstrap replicates α bb?

Illustration
• Obtain the bootstrap samples using α
b = 3.428, βb = 0.7985, and σ
b = 0.2565.
B = 1000
par.boot.sam = Map(function(b) {
Rstar = rnorm(n, mean = 0, sd = 0.2565)
y = 3.428 + 0.7985 * (x - mean(x)) + Rstar
data.frame(x = x, y = y)
}, 1:B)

par.boot.coef = Map(function(sam) lm(y ~ I(x - mean(x)), data = sam)$coef, par.boot.sam)

19
4.0 Resampling Pairs: non−parametric Parametric Bootstrap

4.0
Sepal.Width

Sepal.Width
3.5

3.5
3.0

3.0
2.5

2.5
4.5 5.0 5.5 4.5 5.0 5.5
• We notice that the results are very similar in this example.
Sepal.Length Sepal.Length
• The parametric bootstrap motivates another way to re-sample data, i.e. sampling the errors.

Resampling the Errors

• Suppose the regression model is
Yi = α + β (xi − x) + Ri
where Ri ∼ F , i.e. the errors come from some unknown density F .
– How might we estimate F ?

• If we had α and β then ri = yi − [α + β (xi − x)] , (i = 1, . . . , n) would be a sample from F

• We fit the model to find α

b, β,
b
h i
– then obtain the residuals rbi = yi − ybi = yi − αb + βb (xi − x) and

– the sample of residuals or estimates of the errors is R

b = {b
r1 , . . . , rbn }

• We can use the sample of residuals to estimate F using the empirical cdf.
n
1X
Fb(t) = ri ≤ t)
I (b
n i=1

20
• We perform the bootstrap using Fb,
– we generate a bootstrap sample of errors Ri? by resampling from R
b and obtain

b + βb (xi − x) + Ri?
yi? = α

– and then the bootstrap sample is

Sb? = {(x1 , y1? ), (x2 , y2? ), . . . , (xn , yn? )}

Illustration
• The sample of residuals and the empirical cdf is
ecdf(R)
1.0
0.8
0.6
Fn(x)

0.4
0.2
0.0

−0.8 −0.4 0.0 0.4

x
• Obtain the bootstrap samples
B = 1000
nonpar.boot.sam = Map(function(b) {
Rstar = R[sample(n, n, replace = TRUE)]
y = 3.428 + 0.7985 * (x - mean(x)) + Rstar
data.frame(x = x, y = y)
}, 1:B)

nonpar.boot.coef = Map(function(sam) lm(y ~ I(x - mean(x)), data = sam)$coef,

nonpar.boot.sam)

21
4.0 Resampling Pairs: non−parametric Parametric Bootstrap Resampling Errors

4.0

4.0
Sepal.Width

Sepal.Width

Sepal.Width
3.5

3.5

3.5
3.0

3.0

3.0
2.5

2.5

2.5
4.5 5.0 5.5 4.5 5.0 5.5 4.5 5.0 5.5

• All threeSepal.Length
methods agree Sepal.Length Sepal.Length

Some other Examples

Animals Data and LS

library(robustbase)
data(Animals2)
# Animals2 = Animals2[-c(63,64,65),]
x = log(Animals2$body)
y = log(Animals2$brain)
n = length(y)
plot(x, y, pch = 19, col = adjustcolor("grey", 0.5))
beta.hat = lm(y ~ I(x - mean(x)))$coef
sd.hat = sqrt(sum(lm(y ~ I(x - mean(x)))$residualsˆ2)/(n - 2))
abline(coef = beta.hat + c(-beta.hat[2] * mean(x), 0), col = adjustcolor("firebrick",
1))

22
8
6
4
y

2
0
−2

−5 0 5 10

x
• LS confidence Intervals
Resampling Pairs: non−parametric Parametric Bootstrap Resampling Errors
8

8
6

6
4

4
y

y
2

2
0

0
−2

−2

−5 0 5 10 −5 0 5 10 −5 0 5 10

x x x

Animals Data and Robust Regression

• Let fit the robust regression line using the Huber function.
library(robustbase)
library(MASS)
data(Animals2)

23
# Animals2 = Animals2[-c(63,64,65),]
x = log(Animals2$body)
y = log(Animals2$brain)
n = length(y)
plot(x, y, pch = 19, col = adjustcolor("grey", 0.5))
beta.hat = rlm(y ~ I(x - mean(x)), psi = "psi.huber")$coef

sd.hat = sqrt(sum(rlm(y ~ I(x - mean(x)), psi = "psi.huber")$residualsˆ2)/(n -

2))
abline(coef = beta.hat + c(-beta.hat[2] * mean(x), 0), col = adjustcolor("firebrick",
1))
8
6
4
y

2
0
−2

−5 0 5 10

x
• Robust Regression Confidence Intervals:
Resampling Pairs: non−parametric Parameteric Bootstrap Resampling Errors
8

8
6

6
4

4
y

y
2

2
0

0
−2

−2

−5 0 5 10 −5 0 5 10 −5 0 5 10

x x x

24
• What do you learn from these plots? Do they all agree?

The bootstrap “can blow the head off any problem if the statistican can stand the resulting mess.”
– John Tukey

Summary
• We saw there is three ways to apply the bootstrap in the context of regression.
– resampling pairs (x is random)
– resampling errors (x is fixed)
– parametric bootstrap (x is fixed, has most assumptions)

Intrusion Detection Honeypots
From Everand
Intrusion Detection Honeypots
Chris Sanders
3/5 (2)
Bio Stat Methods
No ratings yet
Bio Stat Methods
474 pages
Stat 111: Introduction To Statistical Inference: ©2023 by Joseph K. Blitzstein and Neil Shephard
No ratings yet
Stat 111: Introduction To Statistical Inference: ©2023 by Joseph K. Blitzstein and Neil Shephard
387 pages
Ana Espinola-Arredondo, Felix Munoz-Garcia - Intermediate Microeconomic Theory - Tools and Step-by-Step Examples-The MIT Press (2020)
100% (1)
Ana Espinola-Arredondo, Felix Munoz-Garcia - Intermediate Microeconomic Theory - Tools and Step-by-Step Examples-The MIT Press (2020)
505 pages
Bootsteps
No ratings yet
Bootsteps
30 pages
L22 Bootstrap
No ratings yet
L22 Bootstrap
7 pages
Bootstrap
No ratings yet
Bootstrap
52 pages
Lecture 9 PDF
No ratings yet
Lecture 9 PDF
22 pages
Braun Bootstrap2012 PDF
No ratings yet
Braun Bootstrap2012 PDF
63 pages
HW 9 Bootstrap, Jackknife, and Permutation Tests
No ratings yet
HW 9 Bootstrap, Jackknife, and Permutation Tests
7 pages
Bootstrap Report
No ratings yet
Bootstrap Report
92 pages
Statistical Methods For Data Science
100% (2)
Statistical Methods For Data Science
406 pages
R-Web-Appendix of Foundations of Statistics For Data Scientists
No ratings yet
R-Web-Appendix of Foundations of Statistics For Data Scientists
122 pages
MIT18 05S14 Class24-Slde-A
No ratings yet
MIT18 05S14 Class24-Slde-A
16 pages
DSCI 100 Bootstrap Concept Cheat Sheet
No ratings yet
DSCI 100 Bootstrap Concept Cheat Sheet
2 pages
Chapter - 3 Common Statistical Procedure
No ratings yet
Chapter - 3 Common Statistical Procedure
20 pages
What Teachers Should Know About The Bootstrap Resa
No ratings yet
What Teachers Should Know About The Bootstrap Resa
84 pages
Bootstrap Up
No ratings yet
Bootstrap Up
5 pages
Intro&NP Stat
No ratings yet
Intro&NP Stat
122 pages
Wasserman 8 PDF
No ratings yet
Wasserman 8 PDF
12 pages
Computer Intensive Methods in Statistics
No ratings yet
Computer Intensive Methods in Statistics
227 pages
s-m-s-t-c--lecture-2425-4
No ratings yet
s-m-s-t-c--lecture-2425-4
43 pages
Advanced Econometric Methods I: Lecture Notes On Bootstrap: 1 Motivation
No ratings yet
Advanced Econometric Methods I: Lecture Notes On Bootstrap: 1 Motivation
19 pages
Bootstrap 1
No ratings yet
Bootstrap 1
7 pages
Intro Bootstrap 341
No ratings yet
Intro Bootstrap 341
18 pages
Statistics Consulting Cheat Sheet: Kris Sankaran October 1, 2017
100% (1)
Statistics Consulting Cheat Sheet: Kris Sankaran October 1, 2017
44 pages
Package Bootstrap': R Topics Documented
No ratings yet
Package Bootstrap': R Topics Documented
28 pages
Chapter 4
No ratings yet
Chapter 4
25 pages
Bootstrap PDF
No ratings yet
Bootstrap PDF
28 pages
STTN 225 R Summary
No ratings yet
STTN 225 R Summary
18 pages
bootstrap-methods-2020
No ratings yet
bootstrap-methods-2020
16 pages
Bootstrap Methods and Their Applications.
No ratings yet
Bootstrap Methods and Their Applications.
96 pages
Adv Stat Inf
No ratings yet
Adv Stat Inf
194 pages
Monte Carlo R-Solutions
No ratings yet
Monte Carlo R-Solutions
42 pages
Bootstrapping Techniques in Statistical Analysis and Approaches in R MATH 289
No ratings yet
Bootstrapping Techniques in Statistical Analysis and Approaches in R MATH 289
10 pages
A Practical Guide To Bootstrap in R
No ratings yet
A Practical Guide To Bootstrap in R
4 pages
Mis Notas de R PDF
100% (1)
Mis Notas de R PDF
396 pages
Lecture 4
No ratings yet
Lecture 4
6 pages
Bootstrap
No ratings yet
Bootstrap
28 pages
of Bootstrap by Spida - 2010
No ratings yet
of Bootstrap by Spida - 2010
80 pages
Bootstrap SCGN v131
No ratings yet
Bootstrap SCGN v131
7 pages
Analysing Data Using Linear Models 5th Ed January 2021
No ratings yet
Analysing Data Using Linear Models 5th Ed January 2021
388 pages
Imstat
No ratings yet
Imstat
549 pages
Full Download Introduction to Robust Estimation and Hypothesis Testing Second Edition Rand R. Wilcox PDF DOCX
100% (2)
Full Download Introduction to Robust Estimation and Hypothesis Testing Second Edition Rand R. Wilcox PDF DOCX
42 pages
Financial Statistics Laboratory 3: Bootstrap
No ratings yet
Financial Statistics Laboratory 3: Bootstrap
16 pages
Essential R
No ratings yet
Essential R
261 pages
exponential family
No ratings yet
exponential family
45 pages
Sim R
No ratings yet
Sim R
6 pages
Big Data Mid Term
No ratings yet
Big Data Mid Term
14 pages
Bootstrap Methodology
No ratings yet
Bootstrap Methodology
33 pages
Bootstrap: Estimate Statistical Uncertainties
No ratings yet
Bootstrap: Estimate Statistical Uncertainties
22 pages
Medical Statistics With R
No ratings yet
Medical Statistics With R
85 pages
Lecture 19 20
No ratings yet
Lecture 19 20
5 pages
Nonparametric and Semiparametric Models
No ratings yet
Nonparametric and Semiparametric Models
325 pages
Quante Con
No ratings yet
Quante Con
146 pages
Practical Machine Learning Course Notes
No ratings yet
Practical Machine Learning Course Notes
76 pages
Statistics for Econometrics
No ratings yet
Statistics for Econometrics
100 pages
Day 3
No ratings yet
Day 3
19 pages
Applied Statistics
No ratings yet
Applied Statistics
457 pages
Bootstrap Example
No ratings yet
Bootstrap Example
5 pages
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
I. Brainstorming For Research Topics: Alimannaohills, Peñablanca/Atulayan Sur, Tuguegarao City, Cagayan
No ratings yet
I. Brainstorming For Research Topics: Alimannaohills, Peñablanca/Atulayan Sur, Tuguegarao City, Cagayan
7 pages
Chapter 5
No ratings yet
Chapter 5
28 pages
Mangler Transform
No ratings yet
Mangler Transform
5 pages
Shaft Line Alignment Analysis
No ratings yet
Shaft Line Alignment Analysis
23 pages
9709 Y20-22 SW Pure1 v2
No ratings yet
9709 Y20-22 SW Pure1 v2
34 pages
Week 4 Synthesizing Rubric Knowledge - N
No ratings yet
Week 4 Synthesizing Rubric Knowledge - N
20 pages
Mock CAT
No ratings yet
Mock CAT
59 pages
ECON 102 Assignment 1
No ratings yet
ECON 102 Assignment 1
11 pages
CFC 2309 Maths Stats LR - Question Paper
No ratings yet
CFC 2309 Maths Stats LR - Question Paper
7 pages
Deepak_Singh_Resume_
No ratings yet
Deepak_Singh_Resume_
2 pages
ComparisonbetweentheDynamicBehavioroftheNon-steppedandDouble-steppedPlaningHullsinRoughWaterANumericalStudy
No ratings yet
ComparisonbetweentheDynamicBehavioroftheNon-steppedandDouble-steppedPlaningHullsinRoughWaterANumericalStudy
16 pages
Lab Manual: Microprocessor Lab (8086) Sub Code: 06CSL48
No ratings yet
Lab Manual: Microprocessor Lab (8086) Sub Code: 06CSL48
100 pages
Class 6 Maths Worksheet 2(b)
No ratings yet
Class 6 Maths Worksheet 2(b)
8 pages
Prediction Paper 2 0580 May 2024 Solutions
No ratings yet
Prediction Paper 2 0580 May 2024 Solutions
11 pages
Lecture Notes On Quantum Mechanics - Part I: Institute of Theoretical Physics, Shanxi University
No ratings yet
Lecture Notes On Quantum Mechanics - Part I: Institute of Theoretical Physics, Shanxi University
36 pages
HW 4
No ratings yet
HW 4
1 page
Planimeters and Theodolites
No ratings yet
Planimeters and Theodolites
7 pages
I YEAR TIME TABLE FOR 2024 - 2025 EVEN SEM FROM 05.02.25-5 (1)
No ratings yet
I YEAR TIME TABLE FOR 2024 - 2025 EVEN SEM FROM 05.02.25-5 (1)
1 page
Charles Babbage Biography
No ratings yet
Charles Babbage Biography
2 pages
UNIT II Machine Learning
No ratings yet
UNIT II Machine Learning
43 pages
Quadratic_functions
No ratings yet
Quadratic_functions
6 pages
EJERCICIO
No ratings yet
EJERCICIO
3 pages
Design of Column Base Plate
No ratings yet
Design of Column Base Plate
6 pages
Lexicology: Filmstar, Ex-President, Ex-Wife - Morphological Motivation
No ratings yet
Lexicology: Filmstar, Ex-President, Ex-Wife - Morphological Motivation
7 pages
MMT P4 - 2024
No ratings yet
MMT P4 - 2024
3 pages
Ludwig Wittgenstein On Certainty (Uber Gewissheit)
No ratings yet
Ludwig Wittgenstein On Certainty (Uber Gewissheit)
63 pages
Axiality in The Process of Space Organization in Architecture by Ozgiir DiN :E
No ratings yet
Axiality in The Process of Space Organization in Architecture by Ozgiir DiN :E
115 pages
Shah Hussain (227) Lab Open Ended
No ratings yet
Shah Hussain (227) Lab Open Ended
12 pages
Basic Mathematics (MT2311D1) : Problem Solving BY: Siti Sarah Binti Sekeri Aimi Najwa Binti Ghazali
No ratings yet
Basic Mathematics (MT2311D1) : Problem Solving BY: Siti Sarah Binti Sekeri Aimi Najwa Binti Ghazali
32 pages