2024spring 340 Final
2024spring 340 Final
MC1-3 (/6) MC4-7 (/8) MC8-10 (/6) SA1 (/4) SA2 (/4) SA3 (/4) SA4 (/4) SA5 (/4) Total (/40)
Lecture section:
Rules:
You must show work for all computations (unless otherwise specified) to receive full credit.
You do NOT need to simplify any expressions you write down.
Note some of the multiple choice are choose ONE and some are choose ALL that apply, please pay attention to
the instruction and select the appropriate number of responses!
Multiple Choice 2pts each
NOTE: For each multiple choice question below, choose ONE means there is EXACTLY one right answer, choose ALL
means there is AT LEAST one right answer.
MC1
Let X̄ and S 2 be the mean and variance of a sample Xi drawn from some distribution X . The expression θ̂ = X̄ /S 2 would
make a good estimator for the parameter of the distribution if X followed which of the following distributions? Choose ALL
that apply!
a. Geometric
b. Poisson
c. Exponential
d. Binomial (with known n)
e. None of the above
MC2
X1 , … , Xn are an independent random sample from a population. As sample size increases, which of the following
statistics will tend to decrease? Choose ALL that apply!
a. S 2
b. X(1) , the sample minimum
c. SE(X̄ )
d. Width of a 95% confidence interval for μ
e. The width of the sample range
MC3
Let A, B be events for some random variable X . Suppose that A ⊂ B, in other words A is a subset of B (every outcome in
A is also in B) but A ≠ B . The Venn diagram would look something like this (see diagram to right)
B
Which of the following is DEFINITELY true? Choose ONE!
MC4
Which of the following is MOST problematic for a multiple regression model with response Y and predictors X1 , X2 ?
Choose ONE!
a. Non-normality in Y
b. Non-normality in X1 or X2
c. Correlation between Y and X1 or between Y and X2
d. Correlation between X1 and X2
e. High variance in X1 or X2
MC5
A regression model is fit to a data set predicting Y from three predictors X1 , X2 and X3 . The following residual diagnostic
plots are produced afterwards:
According to the plots above, which of the following assumptions of linear regression show evidence of NOT being
satisfied? Choose ALL that apply!
a. Normality of errors
b. Zero-mean of errors
c. Homoscedasticity (constant variance) of errors
d. Independence of errors
e. ALL the assumptions above show evidence of NOT being satisfied
MC6
The following is the output of a multiple linear regression fit. Let α = 0.05 as usual. Which of the following statements are
true? Choose ALL that apply!
MC7
In a logistic regression model, if the fitted probability Y ̂ i > T we will predict the classification as 1. If we increase T , what
will DEFINITELY be true? Choose ALL that apply!
Answer each of the following with either Y1, Y2, Y3, Y4, or N/S for EITHER Not enough information to determine OR too
Similar to tell based on the plot alone.
MC9
Which of the following are valid ways of performing model selection if you wish to compare multiple different possible
models? Choose ALL that apply!
MC10
Which of the following techniques AUTOMATICALLY does variable selection for you? Choose ALL that apply!
a. Write out an expression (plug in the necessary numbers, but do NOT simplify) for the probability that the land is
contaminated given a positive test result.
b. Write out an expression for the probability that the land is contaminated given a negative test result.
SA2
On a recent data collection trip you were sampling soil pH in a plot of land that you are considering purchasing. You would
like to estimate the mean soil pH, and model soil pH as a normally distributed random variable. You collect the following
data (summarized in a histogram)
After returning home, you realize the measurement tool was calibrated incorrectly and only recorded a maximum
measurement of 7.3! Any soil pH above 7.3 was simply recorded as 7.3. You would like to still be able to use the data but of
the 205 data measurements there are 15 “7.3”s and you know those are not reliable.
You are still comfortable assuming that (ignoring the calibration error) the rest of the soil pH measurements can be modeled
as being normally distributed, but you have to consider the following questions:
Yi − Ȳ Xi − X̄
Z Z
Y = and X =
i i
SY SX
This is basically analogous to applying the z -score formula, where we simply subtract out the sample mean and divide by
the sample standard deviation. Note this this a linear transformation of each variable, and results in XiZ and YiZ both having
mean 0 and standard deviation 1. Then, the standardized regression model is
Z Z
Y = β0 + β1 X + ϵi
i i
For each of the following, decide if the statement is TRUE or FALSE and explain why. Note: you MUST give sufficient
justification for full credit!
a. β̂ 0 = 0
b. The standard error of the residuals would stay the same compared to the unstandardized model
c. β̂ 1 = r xy
d. R2 may decrease due to standardizing data
SA4
The following output shows the result of a logistic regression fit for the Titanic passengers dataset, where we fit the
probability of survival for a given passenger on the predictors Fare, Age, Sex, and an interaction term (Fare and Age are
numeric variables, while Sex is a categorical variable that can be “Female” or “Male”)
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.315636 0.319388 0.988 0.32303
## Fare 0.012480 0.002708 4.608 4.06e-06 ***
## Age 0.012747 0.010832 1.177 0.23925
## Sexmale -1.306316 0.413719 -3.157 0.00159 **
## Age:Sexmale -0.037645 0.013740 -2.740 0.00615 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
a. What’s the most statistically significant predictor (not counting the intercept)? Interpret it in context.
b. Write down an expression for a 95% confidence interval for the Sex coefficient and interpret it in context (you can
keep it as a log odds).
c. Holding all else constant, for a female passenger, by what factor would the ODDS of survival change if the
passenger’s age was increased by 10 years?
d. What would you predict to be the PROBABILITY of survival for a 35-year old male passenger who paid $8 for fare?
Note: You do NOT need to simplify any expressions! Also, you may use intermediate variables if you wish!
SA5
Suppose k ≤ n (where n is the sample size). For each of the following, decide if the statement is TRUE or FALSE and
explain why. Note: you MUST give sufficient justification for full credit!