0% found this document useful (0 votes)
37 views

HW 9 Bootstrap, Jackknife, and Permutation Tests

The document outlines the requirements for Homework 9 in STAT 5400, focusing on Bootstrap, Jackknife, and Permutation Tests, due on Nov 13, 2023. It includes specific problems related to nonparametric and parametric bootstrap analyses, confidence interval calculations, and permutation tests using provided datasets. Additionally, it emphasizes the importance of clear reporting and interpretation of results in the submitted .Rmd and .pdf files.

Uploaded by

Cade McDonald
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

HW 9 Bootstrap, Jackknife, and Permutation Tests

The document outlines the requirements for Homework 9 in STAT 5400, focusing on Bootstrap, Jackknife, and Permutation Tests, due on Nov 13, 2023. It includes specific problems related to nonparametric and parametric bootstrap analyses, confidence interval calculations, and permutation tests using provided datasets. Additionally, it emphasizes the importance of clear reporting and interpretation of results in the submitted .Rmd and .pdf files.

Uploaded by

Cade McDonald
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

HW 9 Bootstrap, Jackknife, and

Permutation Tests
STAT 5400
Due: Nov 13, 2023 11:59 PM

Problems
Submit your solutions as an .Rmd file and accompanying .pdf file.

1. Use echo=TRUE, include=FALSE to ensure that all the code are provided but only the important output is
included. Try to write your homework in the form of a neat report and don’t pile up any redundant and
irrelevant output.
2. Always interpret your result whenever it is necessary. Try to make sure the interpretation can be understood
by people with a moderate level of statistics knowledge.

Reading assignments.
Here is an undergraduate-level introduction to the bootstrap.
https://statweb.stanford.edu/~tibs/stat315a/Supplements/bootstrap.pdf
(https://statweb.stanford.edu/~tibs/stat315a/Supplements/bootstrap.pdf)

Problems
1. Bootstrap and jackknife
Consider the airconditioning data listed below:

3, 5, 7, 18, 43, 85, 91, 98, 100, 130, 230, 487.

Suppose the mean of the underlying distribution is μ and our interest is to estimate log(μ). To estimate it, we use
¯
the log of the sample mean, i.e., log(X ) , as an estimator.

¯
a. Carry out a nonparametric bootstrap analysis to estimate the bias of log(X ).
¯ ¯
b. Based on the bootstrap analysis, is the bias of log(X) positive or negative? (In other word, does log(X )

overestimates or underestimates log(μ)) Can you explain the observation? (Hint: Jensen’s inequality)
c. Also run a nonparametric bootstrap to estimate the standard error of the log of the sample mean. In terms of
the mean square error of the estimator, do you think the bias is large given the standard error?
d. Carry out a parametric bootstrap analysis to estimate the bias of the log of sample mean. Assume that the
population distribution of failure times of airconditioning equipment is exponential.
e. Plot both the histograms of the bootstrap replications from nonparametric and parametric bootstrap.
f. Produce 95% confidence intervals by the standard normal, basic, percentile, and Bca methods. (You may
need to attend the lecture on Oct 17 for this question.)
g. Use jackknife to estimate the standard error and bias of the log of the sample mean.

The bias is negative and thus it is an underestimate.


# Nonparametric bootstrap for SE

log_mean_exp <- function(x, indices) {


result <- log(mean(rexp(length(indices))))
return(result)
}

# Parametric bootstrap
param_bootstrap_results <- boot(data, statistic = log_mean_exp, R = 1000)
param_bias_estimate <- mean(param_bootstrap_results$t - log_mean(data))
# Interpretation
param_bias_estimate

## [1] -4.715413

param_bootstrap_results

##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = data, statistic = log_mean_exp, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -0.026407 -0.006103388 0.3065978

The Bias is quite large at -4.7 compared to the std. error at .307

# Plot histograms
par(mfrow = c(1, 2))
hist(bootstrap_results$t, main = "Nonparametric Bootstrap", xlab = "Log Mean")
hist(param_bootstrap_results$t, main = "Parametric Bootstrap", xlab = "Log Mean")
# Function to calculate confidence intervals
calculate_ci <- function(results, method) {
ci_output <- boot.ci(results, type = method)
print(ci_output) # Add this line to check the output
return(ci_output$bca[, 4:5])
}

# Confidence intervals for the log of the sample mean


log_mean <- function(x, indices) {
return(log(mean(x[indices])))
}

# Nonparametric bootstrap for log mean


set.seed(5400)
bootstrap_results <- boot(data, statistic = log_mean, R = 1000)

# Standard Normal CI
standard_normal_ci <- calculate_ci(bootstrap_results, "norm")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = results, type = method)
##
## Intervals :
## Level Normal
## 95% ( 4.031, 5.464 )
## Calculations and Intervals on Original Scale

# Basic CI
basic_ci <- calculate_ci(bootstrap_results, "basic")

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS


## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = results, type = method)
##
## Intervals :
## Level Basic
## 95% ( 4.132, 5.548 )
## Calculations and Intervals on Original Scale

# Percentile CI
percentile_ci <- calculate_ci(bootstrap_results, "perc")

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS


## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = results, type = method)
##
## Intervals :
## Level Percentile
## 95% ( 3.818, 5.234 )
## Calculations and Intervals on Original Scale

# BCa CI
bca_ci <- calculate_ci(bootstrap_results, "bca")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = results, type = method)
##
## Intervals :
## Level BCa
## 95% ( 4.022, 5.391 )
## Calculations and Intervals on Original Scale
## Some BCa intervals may be unstable

jackknife_se_estimate = 0.2558932 jackknife_bias_estimate = 0.01983011 ### 2. Failure of bootstrap The


bootstrap is not foolproof. To see this, consider analysis of a binomial model with n trials. You observe 0
successes. Discuss what would happen if you were to use the standard, non-parametric bootstrap in constructing
a 95% C.I. for the binomial parameter p.

When you observe 0 succeses with the non-parametric model it leads to the case of the confidence interval being
(0,0) ### 3. Bootstrap estimate of the standard error of trimmed mean.

Consider an artificial data set consisting of eight observations:

1, 3, 4.5, 6, 6, 6.9, 13, 19.2.

Let θ^ be the 25% trimmed mean, which is computed by deleting two smallest numbers and two largest numbers,
and then taking the average of the remaining four numbers.

a. Calculate \widehat{\mathrm{se}_B for B = 25, 100, 200, 500, 1000, 2000. From these results estimate the
ideal bootstrap estimate \widehat{\mathrm{se}_{\infty} .
b. Repeat part (a) using twenty different random number seeds. Comment on the trend of the variablity of each
\widehat{\mathrm{se}_B .
# Given data
data <- c(1, 3, 4.5, 6, 6, 6.9, 13, 19.2)

# Function to calculate the 25% trimmed mean


trimmed_mean <- function(x) {
sorted_data <- sort(x)
trimmed_data <- sorted_data[3:6] # Remove two smallest and two largest numbers
return(mean(trimmed_data))
}

# Function to perform bootstrap and calculate standard error


bootstrap_se <- function(B, seed) {
set.seed(seed)
bootstrap_results <- boot(
data,
statistic = function(x, indices) trimmed_mean(x[indices]),
R = B
)
return(sd(bootstrap_results$t)) # Use sd to calculate standard deviation
}

# Values of B
B_values <- c(25, 100, 200, 500, 1000, 2000)

# Results for each B


results <- numeric(length(B_values))

# Loop through different values of B


for (i in seq_along(B_values)) {
results[i] <- bootstrap_se(B_values[i], seed = 5400) # Use a fixed seed for reproducibility
}

# Ideal Bootstrap Estimate (B = Inf)


ideal_bootstrap_estimate <- sd(trimmed_mean(data))

# Print results
cat("Bootstrap Standard Errors for Different B values:\n")

## Bootstrap Standard Errors for Different B values:

print(results)

## [1] 2.238965 2.310789 2.132144 2.061035 2.104525 2.152344

cat("\nIdeal Bootstrap Estimate (B = Inf):\n")

##
## Ideal Bootstrap Estimate (B = Inf):
print(ideal_bootstrap_estimate)

## [1] NA

The mean of the estimated standard errors decreases as B increases, with a larger number of bootstrap samples,
the estimates tend to stabilize and become more precise. The standard deviation of the estimated standard errors
also decreases with increasing B, suggesting reduced variability in the estimates.

The plot shows a decreasing trend in the mean estimated standard errors as B increases. As B increases, the
bootstrap procedure becomes more robust, providing more consistent and reliable estimates of the standard error.
In practice, it is common to choose a sufficiently large B to achieve stable results.

4. Permutation distribution of the difference between sample means


Load the chickwts data in R. Have a quick graphical summary using the following code:

> attach(chickwts)
> boxplot(formula(chickwts))

Use two-sample t-test to see whether the population weight of soybean and linseed are the same or not.
Comment on the assumptions of the t-test.

Design permutation tests to answer the previous question.

5. Permutation distribution of the difference between two distributions


Instead of only comparing population means, design permutation tests based on Kolmogorov-Smirnov (K-S)
statistic to check if the distributions of soybean weights and linseed weights are the same or not?

Below is the definition of K-S statistic:

D = sup |Fn (zi ) − Gm (zi )| ,


1≤i≤n

where zi is the pooled sample, Fn is the empirical CDF of X1 , … , Xn and Gm is the emprical CDF of
Y 1 , … , Y m . You may want to try ks.test in R.

In fact, the K-S statistic is primarily used for univariate distributions. Try another alternative to the K-S
statistics: perform permutation tests based on Cramer-von-Mises statistic:

n m
mn 2 2
W = [∑(Fn (x i ) − Gm (x i )) + ∑(Fn (yj ) − Gm (yj )) . ]
2
(m + n)
i=1 j=1

You might also like