cor
cor
February 5, 2025
y
x2
A
B
−1 0 1 2 3
x1
We want to use this data set to make a prediction for Y when X1 = 1, X2 = 2 using the K-nearest neighbors
classification method.
a)
Calculate the Euclidean distance between each observation and the test point, X1 = 1, X2 = 2.
b)
Use P (Y = j | X = x0 ) = K
1
I (Y = j) to predict the class of Y when K = 1, K = 4 and K = 7.
P
I∈N0
Why is K = 7 a bad choice?
1
c)
If the Bayes decision boundary in this problem is highly non-linear, would we expect the best value for K to
be large or small? Why?
a)
Assume the true covariance matrix for the genuine and fake bank notes are the same. How would you estimate
the common covariance matrix?
b)
Explain the assumptions made to use linear discriminant analysis to classify a new observation to be a genuine
or a fake bank note. Write down the classification rule for a new observation (make any assumptions you
need to make).
c)
Use the method in b) to determine if a bank note with length 214.0 and diagonal 140.4 is genuine or fake.
You can use R to perform the matrix calculations.
R-hints:
# inv(A)
solve(A)
# transpose of vector
t(v)
# determinant of A
det(A)
# multiply vector and matrix / matrix and matrix
v %*% A
B %*% A
d)
What is the difference between LDA and QDA? Use the classification rule for QDA to determine the bank
note from c). Do you obtain the same result? You can use R to perform the matrix calculations.
Hint: the following formulas might be useful.
−1
1
a b d −b
A−1 = =
c d ad − bc −c a
2
a b
|A| = det(A) = = ad − bc
c d
a)
On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in
fact default?
b)
Suppose that an individual has a 16% chance of defaulting on her credit card payment. What are the odds
that she will default?
a)
Estimate the probability that a student who studies for 40 hours and has an undergrad GPA of 3.5 gets an A
in the class.
b)
How many hours would the student in part a) need to study to have an estimated 50% probability of getting
an A in the class?
a)
We choose the rule p(x) > 0.5 to classify to disease. Define the sensitivity and the specificity of the test.
b)
Explain how you can construct a receiver operator curve (ROC) for your setting, and why that is a useful
thing to do. In particular, why do we want to investigate different cut-offs of the probability of disease?
3
c)
Assume that we have a competing method q(x) that also produces probability of disease for a covariate x.
We get the information that the AUC of the p(x)-method is 0.6 and the AUC of the q(x)-method is 0.7.
What is the definition and interpretation of the AUC? Would you prefer the p(x) or the q(x) method for
classification?
a)
Produce numerical and graphical summaries of the Weekly data. Do there appear to be any patterns? R-hint:
Load the data as follows:
data("Weekly")
b)
Use the full data set to perform a logistic regression with Direction as the response and the five lag
variables plus Volume as predictors. Use the summary() function to print the results. Which of the predictors
appears to be associated with Direction? R-hints: You should use the glm() function with the argument
family="binomial" to make a logistic regression model.
c)
Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix
is telling you about the types of mistakes made by your logistic regression model. R-hints: insert the name
of your model for yourGlmModel in the code below to get the predicted probabilities for “Up”, the classified
direction and the confusion matrix.
glm.probs_Weekly <- predict(yourGlmModel, type = "response")
glm.preds_Weekly <- ifelse(glm.probs_Weekly > 0.5, "Up", "Down")
table(glm.preds_Weekly, Weekly$Direction)
d)
Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only
predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data
(that is, the data from 2009 and 2010).
R-hints: use the following code to divide into test and train set. For predicting the direction of the test set,
use newdata = Weekly_test in the predict() function.
4
Weekly_trainID <- (Weekly$Year < 2009)
Weekly_train <- Weekly[Weekly_trainID, ]
Weekly_test <- Weekly[!Weekly_trainID, ]
e)
Repeat d) using LDA.
f)
Repeat d) using QDA.
R-hints: plug in your variables in the following code to perform lda (and similarly for qda).
library(MASS)
lda.Weekly <- lda(Response ~ pred1, data = youTrainData)
lda.Weekly_pred <- predict(yourModel, newdata = YourTestData)$class
lda.Weekly_prob <- predict(yourModel, newdata = YourTestData)$posterior
table(lda.Weekly_pred, YourTestData$Direction)
g)
Repeat d) using KNN with K = 1.
R-hints: plug in your variables in the following code to perform KNN. The argument prob=TRUE will provide
the probabilities for the classified direction (which you will need later). When there are ties (same amount of
Up and Down for the nearest neighbors), the knn function picks a class at random. We use the set.seed()
function such that we don’t get different answers for each time we run the code.
library(class)
knn.train <- as.matrix(YourTrainData$Lag2)
knn.test <- as.matrix(YourTestData$Lag2)
set.seed(123)
yourKNNmodel <- knn(train = knn.train,
test = knn.test,
cl = YourTrainData$Direction,
k = YourValueOfK,
prob = TRUE)
table(yourKNNmodel, YourTestData$Direction)
h)
Use the following code to find the best value of K. Report the confusion matrix and overall fraction of correct
predictions for this value of K.
#knn error:
K <- 30
knn.error <- rep(NA, K)
set.seed(234)
for (k in 1:K) {
knn.pred <- knn(train = knn.train,
test = knn.test,
cl = Weekly_train$Direction,
k = k)
5
knn.error[k] <- mean(knn.pred != Weekly_test$Direction)
}
knn.error.df <- data.frame(k = 1:K, error = knn.error)
ggplot(knn.error.df, aes(x = k, y = error)) +
geom_point(col = "blue") +
geom_line(linetype = "dotted")
i)
Which of these methods appear to provide the best results on this data?
j)
Plot the ROC curves and calculate the AUC for the four methods (using your the best choice for KNN).
What can you say about the fit of these models?
R-hints:
• For KNN you can use knn(...,prob=TRUE) to get the probability for the classified direction. Note
that we want P (Direction = U p) when plotting the ROC-curve, so we need to modify the probabilities
returned from the knn function.
#get the probabilities for the classified class
yourKNNProbs <- attributes(yourKNNmodel)$prob
# since we want the probability for Up, we need to take 1-p for the elements
# that gives probability for Down
down <- which(yourKNNmodel == "Down")
yourKNNProbs[down] <- 1 - yourKNNProbs[down]