Attachment 1
Attachment 1
a. Let X1 and X2 be two independent random variables from N(0.5,1). Define X = (X1 +
X2)/2. What is the distribution of X? Based on the distribution of X, find p and q such
that p = P(X > 2.3) and P(X > q) = 0.176 using approximate r functions.
b. Following part (a), we seek to approximate p and q using Monte Carlo method. Run R
code set.seed(0) first. Generate a 1000 by 2 matrix (call it samp) from N(0.5,1). Then
the row means of samp (call it sampx) is a sample of X with size 1000. Approximate p
and q based on the sample sampx.
c. Using similar ideas in b, now we seek to approximate p = P(X1 + ··· + X10 > 5),
1), and 10), where X1,···,X10 are i.i.d.
random variables from Beta(1,2).
Problem 2. Assume you have a function check.prime that can be used to check whether a
given natural number is a prime. [Modify your own code or one of mine posted in the
solution of homework 3].
a. Based on the function check.prime, create a new function that can count the number
of primes smaller than or equal to the given natural number. Name this new function
count.prime.
c. Plot y versus x. On the same figure, add a curve z versus x. Distinguish the two curves
with different line types and colors. Add legend too. In a separate figure, plot y/z
versus x. Make comments on the relationship between y and z when x is large.
Search “Prime number theorem” for reference to help make your comments.
Problem 3. Interval-censored data are a special type of survival data. The main feature of
such data is that the failure time T of interest (response variable) is not observed exactly
but is known to fall within some interval (L,R]. For example, HIV status is only detectable by
some laboratory examinations when patients visit clinics. Thus, the HIV infection time is
only known to fall between the last visit time with negative examination result and the first
visit time with positive result. For a specific subject i, if Li = 0, we say Ti is left-censored; if Ri
1
= ∞, Ti is right-censored; otherwise, Ti is interval-censored. So interval-censored data
actually contains a mixture of left-, interval-, and right-censored observations.
In this problem, we consider a breast-cancer data set, in which the failure time is
defined as the time of the occurrence of breast retraction among early breast cancer
patients. Download the file DataBreastCancer.txt and answer the following questions. The
file contains three columns with observations from 94 patients. The first two columns show
the observed interval for the failure time with unit being month. The third column shows
the group indicator taking 1 or 0 (representing two treatments) for each patient.
a. Read the data into R and save the data as databc. [Pay attention to the missing values].
Make sure that databc is numerical.
c. All of the missing values in the original file appear in the second column and
actuallyrepresent ∞, so the observations with missing values are right-censored
observations. Create a categorical variable named delta, which takes 0, 1, and 2 for
left, interval, and right-censored observations, respectively. Calculate the numbers
and percentages of left, interval, and right-censored observations in this data set.
d. Split the first two columns of databc into two parts (call them data1 and data0) based
on the group indicator (the third column). Find the numbers of observations in data1
and data0, respectively.
e. Download the function turnbull.r and use it to estimate the cumulative distribution
function (CDF) of interval-censored data. Read the explanations (sentences following
#) of input and output. To obtain the CDF estimates for the two groups, use the
following commands ss1=turnbull(data1); ss0=turnbull(data0); Display ss1 and ss0.
f. Plot the estimated CDFs for both groups with type="s" (plotting as step functions) on
the same figure. Distinguish the two curves by using different line types (solid and
dashed) and colors (red and blue). Also add legends to indicate group difference
(“group=1” and “group=0”) for the two curves. Make comments on which group has a
better treatment performance judged by having a lower CDF curve over time.
Problem 4. In common practice of statistics, study investigators use 0.05 as the significance
level for hypothesis testing. If the p-value is smaller than 0.05, then one rejects the null
hypothesis; otherwise one fails to reject the null hypothesis. The probability of type 1 error
(i.e., the probability of rejecting the null hypothesis when it is true) is 0.05 in this case.
In many situations, there involves testing a number (say k) of hypotheses
simultaneously, H0i vs. H1i, for i = 1,···,k, such as gene expression experiments. In such
2
multiple testing problems, one will have inflated overall type 1 error if using 0.05 for each
individual hypothesis testing. To see this point, considering the independent case,
P(reject at least one null hypothesis|all k null hypotheses are true) = 1 − (1 − 0.05)k
becomes large when k is large. Suppose the p-value for each single hypothesis test is
obtained. A widely used approach in such situation is the Benjamini and Hochberg (1995)
algorithm to control the overall type 1 error below 0.05. The procedure is as follows: First
label the p-values in ascending order such that p(1) ≤ p(2) ≤ ··· ≤ p(k) and denote by H(i) the null
hypothesis corresponding to p(i). Second define i0 to be the largest i for which p(i) ≤ 0.05i/k.
The decision rule is to reject all H(i) for i = 1,···,i0.
The purpose of this problem is to write a function named as bh.adjust using the
Benjamini and Hochberg algorithm. The arguments of this function include a numerical
vector pvs of p-values and a significance level α (set its default value equal to 0.05). The
output will be a logic vector having the same length with the pvs, using FALSE (F) to denote
rejecting a specific null hypothesis. For example, (T, F, T) means that the algorithm rejects
the second null hypothesis but fails to reject and the first and the third null hypotheses.
Note that the output should be the final decisions for the series of hypotheses in the original
order based on the the Benjamini and Hochberg (1995) algorithm.
Implement your function bh.adjust(pvs) with the following vectors for pvs. Interpret your
output in words regarding which null hypotheses are rejected and which are not rejected.
• pvs=c(0.04, 0.02)