0% found this document useful (0 votes)
9 views

2024spring 340 Final

The document outlines the structure and rules for the SP24 STAT340 Final exam, including sections for multiple choice and short answer questions. It covers various statistical concepts such as regression analysis, probability, and model selection. Students are required to show work for computations and adhere to specific instructions for answering questions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

2024spring 340 Final

The document outlines the structure and rules for the SP24 STAT340 Final exam, including sections for multiple choice and short answer questions. It covers various statistical concepts such as regression analysis, probability, and model selection. Students are required to show work for computations and adhere to specific instructions for answering questions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

SP24 STAT340 Final

MC1-3 (/6) MC4-7 (/8) MC8-10 (/6) SA1 (/4) SA2 (/4) SA3 (/4) SA4 (/4) SA5 (/4) Total (/40)

First (given) name:

Write here: ______________________________________________________________________

Last (family) name:

Write here: ______________________________________________________________________

Lecture section:

Circle one: Bi’s section Brian’s section Yongyi’s section

Rules:
You must show work for all computations (unless otherwise specified) to receive full credit.
You do NOT need to simplify any expressions you write down.
Note some of the multiple choice are choose ONE and some are choose ALL that apply, please pay attention to
the instruction and select the appropriate number of responses!
Multiple Choice 2pts each
NOTE: For each multiple choice question below, choose ONE means there is EXACTLY one right answer, choose ALL
means there is AT LEAST one right answer.

MC1
Let X̄ and S 2 be the mean and variance of a sample Xi drawn from some distribution X . The expression θ̂ = X̄ /S 2 would
make a good estimator for the parameter of the distribution if X followed which of the following distributions? Choose ALL
that apply!

a. Geometric
b. Poisson
c. Exponential
d. Binomial (with known n)
e. None of the above

MC2
X1 , … , Xn are an independent random sample from a population. As sample size increases, which of the following
statistics will tend to decrease? Choose ALL that apply!

a. S 2
b. X(1) , the sample minimum
c. SE(X̄ )
d. Width of a 95% confidence interval for μ
e. The width of the sample range

MC3
Let A, B be events for some random variable X . Suppose that A ⊂ B, in other words A is a subset of B (every outcome in
A is also in B) but A ≠ B . The Venn diagram would look something like this (see diagram to right)

B
Which of the following is DEFINITELY true? Choose ONE!

a. P(A | B) > P(B | A) A


b. P(B | A) > P(A | B)
c. P(A | B) = P(A)
d. P(A | B) = P(B)
e. None of the above

MC4
Which of the following is MOST problematic for a multiple regression model with response Y and predictors X1 , X2 ?
Choose ONE!

a. Non-normality in Y
b. Non-normality in X1 or X2
c. Correlation between Y and X1 or between Y and X2
d. Correlation between X1 and X2
e. High variance in X1 or X2
MC5
A regression model is fit to a data set predicting Y from three predictors X1 , X2 and X3 . The following residual diagnostic
plots are produced afterwards:

According to the plots above, which of the following assumptions of linear regression show evidence of NOT being
satisfied? Choose ALL that apply!

a. Normality of errors
b. Zero-mean of errors
c. Homoscedasticity (constant variance) of errors
d. Independence of errors
e. ALL the assumptions above show evidence of NOT being satisfied

MC6
The following is the output of a multiple linear regression fit. Let α = 0.05 as usual. Which of the following statements are
true? Choose ALL that apply!

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) 6.532 3.273 1.996 0.0499 *
## x1 2.030 1.090 1.862 0.0668 .
## x2 1.722 1.124 1.532 0.1300
##
## Residual standard error: 10.24 on 69 degrees of freedom
## Multiple R-squared: 0.08399, Adjusted R-squared: 0.05744
## F-statistic: 3.163 on 2 and 69 DF, p-value: 0.04848

a. The slope of predictor x1 , β 1 , is significantly different from 0


b. The slope of predictor x2 , β 2 , is significant different from 0
c. The intercept β 0 is significant different from 0
d. The overall model is significantly better than a null model
e. None of the above

MC7
In a logistic regression model, if the fitted probability Y ̂ i > T we will predict the classification as 1. If we increase T , what
will DEFINITELY be true? Choose ALL that apply!

a. Type I error rate will increase


b. Type II error rate will increase
c. Power of the test will increase
d. AIC of the model will increase
e. R2 will increase
MC8
The following figure show 4 different binary-response datasets (X1 , Y1 ) , (X2 , Y2 ) , (X3 , Y3 ) , (X4 , Y4 ) with the same sample
size fitted with 4 different logistic regressions.

Answer each of the following with either Y1, Y2, Y3, Y4, or N/S for EITHER Not enough information to determine OR too
Similar to tell based on the plot alone.

a. Which fit shows the most significant relationship? _______


b. Which fit shows the least significant relationship? _______
c. Which fit gives the maximum value for β̂ 1 ? _______
d. Which fit gives the maximum value for β̂ 0 ? _______

MC9
Which of the following are valid ways of performing model selection if you wish to compare multiple different possible
models? Choose ALL that apply!

a. Try to minimize the RSS (i.e. the loss function)


b. Try to maximize the R2
c. Try to minimize the RSE (i.e. the residual standard error)
d. Try to maximize the correlation coefficient
e. Try to minimize the AIC (i.e. the Akaike information criterion)

MC10
Which of the following techniques AUTOMATICALLY does variable selection for you? Choose ALL that apply!

a. Stepwise model fitting


b. K-fold cross validation
c. Ridge regression
d. LASSO regression
e. None of the above
Short Answer 4pts each
SA1
A simple soil test can reveal the presence of lead. Contamination is considered > 100 ppm. We assume that the land is not
contaminated unless the soil test shows evidence of lead. The soil test has a power of 90% and a type 1 error rate of 5%.
Suppose that in this region of Wisconsin there is a 1% chance that land is contaminated with lead.

a. Write out an expression (plug in the necessary numbers, but do NOT simplify) for the probability that the land is
contaminated given a positive test result.
b. Write out an expression for the probability that the land is contaminated given a negative test result.
SA2
On a recent data collection trip you were sampling soil pH in a plot of land that you are considering purchasing. You would
like to estimate the mean soil pH, and model soil pH as a normally distributed random variable. You collect the following
data (summarized in a histogram)

After returning home, you realize the measurement tool was calibrated incorrectly and only recorded a maximum
measurement of 7.3! Any soil pH above 7.3 was simply recorded as 7.3. You would like to still be able to use the data but of
the 205 data measurements there are 15 “7.3”s and you know those are not reliable.

You are still comfortable assuming that (ignoring the calibration error) the rest of the soil pH measurements can be modeled
as being normally distributed, but you have to consider the following questions:

a. What would be the problem with using X̄ to estimate μ?


b. What other point estimator could you for μ? Justify your answer.
c. What would be the problem with using s , the sample standard deviation, to estimate σ ?
d. What other estimator could be used to estimate σ ? Hint: You can use the Empirical Rule to find at least one, but there
are other acceptable responses
SA3
Consider a simple linear regression of two variables Yi and Xi . Suppose you decide to standardize your data before fitting
the linear model:

Yi − Ȳ Xi − X̄
Z Z
Y = and X =
i i
SY SX

This is basically analogous to applying the z -score formula, where we simply subtract out the sample mean and divide by
the sample standard deviation. Note this this a linear transformation of each variable, and results in XiZ and YiZ both having
mean 0 and standard deviation 1. Then, the standardized regression model is

Z Z
Y = β0 + β1 X + ϵi
i i

For each of the following, decide if the statement is TRUE or FALSE and explain why. Note: you MUST give sufficient
justification for full credit!

a. β̂ 0 = 0
b. The standard error of the residuals would stay the same compared to the unstandardized model
c. β̂ 1 = r xy
d. R2 may decrease due to standardizing data
SA4
The following output shows the result of a logistic regression fit for the Titanic passengers dataset, where we fit the
probability of survival for a given passenger on the predictors Fare, Age, Sex, and an interaction term (Fare and Age are
numeric variables, while Sex is a categorical variable that can be “Female” or “Male”)

## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.315636 0.319388 0.988 0.32303
## Fare 0.012480 0.002708 4.608 4.06e-06 ***
## Age 0.012747 0.010832 1.177 0.23925
## Sexmale -1.306316 0.413719 -3.157 0.00159 **
## Age:Sexmale -0.037645 0.013740 -2.740 0.00615 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

a. What’s the most statistically significant predictor (not counting the intercept)? Interpret it in context.
b. Write down an expression for a 95% confidence interval for the Sex coefficient and interpret it in context (you can
keep it as a log odds).
c. Holding all else constant, for a female passenger, by what factor would the ODDS of survival change if the
passenger’s age was increased by 10 years?
d. What would you predict to be the PROBABILITY of survival for a 35-year old male passenger who paid $8 for fare?
Note: You do NOT need to simplify any expressions! Also, you may use intermediate variables if you wish!
SA5
Suppose k ≤ n (where n is the sample size). For each of the following, decide if the statement is TRUE or FALSE and
explain why. Note: you MUST give sufficient justification for full credit!

a. The goal of cross-validation is to estimate the training error.


b. Repeated application of leave-one-out cross-validation will always produce the same estimation of error.
c. For any k , repeated application of k -fold cross-validation will always produce the same estimation of error.
d. Leave-one-out cross-validation is equivalent to k -fold cross validation when k is equal to the total number of
observations.

You might also like