0% found this document useful (0 votes)

12 views

cor

Module 4 of TMA4268 focuses on statistical learning exercises, including K-nearest neighbors (KNN), linear discriminant analysis (LDA), logistic regression, and ROC analysis. It presents various problems involving calculations and interpretations related to classification methods, with practical applications using R. The document emphasizes understanding the assumptions and methodologies for effective data analysis in statistical learning.

Uploaded by

emilien88.henry

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

cor

Uploaded by

emilien88.henry

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Module 4: Recommended Exercises

TMA4268 Statistical Learning V2025

Sara Martino, Stefanie Muff, Kenneth Aase

Department of Mathematical Sciences, NTNU

February 5, 2025

Problem 1: KNN (Exercise 2.4.7 in ISL textbook, slightly modified)

The table and plot below provides a training data set consisting of seven observations, two predictors and
one qualitative response variable.
## x1 x2 y
## 1 3 3 A
## 2 2 0 A
## 3 1 1 A
## 4 0 1 B
## 5 -1 0 B
## 6 2 1 B
## 7 1 0 B
3

y
x2

A
B

−1 0 1 2 3
x1

We want to use this data set to make a prediction for Y when X1 = 1, X2 = 2 using the K-nearest neighbors
classification method.

a)
Calculate the Euclidean distance between each observation and the test point, X1 = 1, X2 = 2.

b)
Use P (Y = j | X = x0 ) = K
1
I (Y = j) to predict the class of Y when K = 1, K = 4 and K = 7.
P
I∈N0
Why is K = 7 a bad choice?

1
c)
If the Bayes decision boundary in this problem is highly non-linear, would we expect the best value for K to
be large or small? Why?

Problem 2: Bank notes and LDA (with calculations)

To distinguish between genuine and fake bank notes, measurements of length and diagonal of part of the
bank notes have been made. For 1000 bank notes (500 genuine and 500 false) this gave the following values
for the mean and the covariance matrix (using unbiased estimators), where the first value is the length of the
bank note, and the second is the diagonal.
Genuine bank notes:
214.97 0.1502 0.0055

x̄G = and Σ̂G =
141.52 0.0055 0.1998

Fake bank notes:

214.82 0.1240 0.0116

x̄F = and Σ̂F =
139.45 0.0116 0.3112

a)
Assume the true covariance matrix for the genuine and fake bank notes are the same. How would you estimate
the common covariance matrix?

b)
Explain the assumptions made to use linear discriminant analysis to classify a new observation to be a genuine
or a fake bank note. Write down the classification rule for a new observation (make any assumptions you
need to make).

c)
Use the method in b) to determine if a bank note with length 214.0 and diagonal 140.4 is genuine or fake.
You can use R to perform the matrix calculations.
R-hints:
# inv(A)
solve(A)
# transpose of vector
t(v)
# determinant of A
det(A)
# multiply vector and matrix / matrix and matrix
v %*% A
B %*% A

d)
What is the difference between LDA and QDA? Use the classification rule for QDA to determine the bank
note from c). Do you obtain the same result? You can use R to perform the matrix calculations.
Hint: the following formulas might be useful.
−1
1

a b d −b
A−1 = =
c d ad − bc −c a

2
a b
|A| = det(A) = = ad − bc
c d

Problem 3: Odds (Exercise 4.7.9 in ISL textbook)

This problem is about odds.

a)
On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in
fact default?

b)
Suppose that an individual has a 16% chance of defaulting on her credit card payment. What are the odds
that she will default?

Problem 4: Logistic regression (Exercise 4.7.6 in ISL textbook)

Suppose we collect data for a group of students in a statistics class with variables x1 = hours studied, x2
= undergrad grade point average (GPA), and Y = I(student gets an A). We fit a logistic regression and
produce estimated coefficient, β̂0 = −6, β̂1 = 0.05, β̂2 = 1.

a)
Estimate the probability that a student who studies for 40 hours and has an undergrad GPA of 3.5 gets an A
in the class.

b)
How many hours would the student in part a) need to study to have an estimated 50% probability of getting
an A in the class?

Problem 5: Sensitivity, specificity, ROC and AUC

We have a two-class problem, with classes 0 =non-disease and 1 =disease, and a method p(x) that produces
probability of disease depending on a covariate x. In a population we have investigated n individuals and
know the predicted probability of disease p(x) and true disease status for these n.

a)
We choose the rule p(x) > 0.5 to classify to disease. Define the sensitivity and the specificity of the test.

b)
Explain how you can construct a receiver operator curve (ROC) for your setting, and why that is a useful
thing to do. In particular, why do we want to investigate different cut-offs of the probability of disease?

3
c)
Assume that we have a competing method q(x) that also produces probability of disease for a covariate x.
We get the information that the AUC of the p(x)-method is 0.6 and the AUC of the q(x)-method is 0.7.
What is the definition and interpretation of the AUC? Would you prefer the p(x) or the q(x) method for
classification?

Data analysis with R

For the following problems, you should check out and learn how to use the following R functions: glm()
(stats library), lda(), qda() (MASS library), knn() (class library), roc() and auc() (pROC library).

Problem 6 (Exercise 4.7.10 in ISL textbook - modified)

This question should be answered using the Weekly data set, which is part of the ISLR package. This data is
similar to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years,
from the beginning of 1990 to the end of 2010.

a)
Produce numerical and graphical summaries of the Weekly data. Do there appear to be any patterns? R-hint:
Load the data as follows:
data("Weekly")

b)
Use the full data set to perform a logistic regression with Direction as the response and the five lag
variables plus Volume as predictors. Use the summary() function to print the results. Which of the predictors
appears to be associated with Direction? R-hints: You should use the glm() function with the argument
family="binomial" to make a logistic regression model.

c)
Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix
is telling you about the types of mistakes made by your logistic regression model. R-hints: insert the name
of your model for yourGlmModel in the code below to get the predicted probabilities for “Up”, the classified
direction and the confusion matrix.
glm.probs_Weekly <- predict(yourGlmModel, type = "response")
glm.preds_Weekly <- ifelse(glm.probs_Weekly > 0.5, "Up", "Down")
table(glm.preds_Weekly, Weekly$Direction)

d)
Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only
predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data
(that is, the data from 2009 and 2010).
R-hints: use the following code to divide into test and train set. For predicting the direction of the test set,
use newdata = Weekly_test in the predict() function.

4
Weekly_trainID <- (Weekly$Year < 2009)
Weekly_train <- Weekly[Weekly_trainID, ]
Weekly_test <- Weekly[!Weekly_trainID, ]

e)
Repeat d) using LDA.

f)
Repeat d) using QDA.
R-hints: plug in your variables in the following code to perform lda (and similarly for qda).
library(MASS)
lda.Weekly <- lda(Response ~ pred1, data = youTrainData)
lda.Weekly_pred <- predict(yourModel, newdata = YourTestData)$class
lda.Weekly_prob <- predict(yourModel, newdata = YourTestData)$posterior
table(lda.Weekly_pred, YourTestData$Direction)

g)
Repeat d) using KNN with K = 1.
R-hints: plug in your variables in the following code to perform KNN. The argument prob=TRUE will provide
the probabilities for the classified direction (which you will need later). When there are ties (same amount of
Up and Down for the nearest neighbors), the knn function picks a class at random. We use the set.seed()
function such that we don’t get different answers for each time we run the code.
library(class)
knn.train <- as.matrix(YourTrainData$Lag2)
knn.test <- as.matrix(YourTestData$Lag2)

set.seed(123)
yourKNNmodel <- knn(train = knn.train,
test = knn.test,
cl = YourTrainData$Direction,
k = YourValueOfK,
prob = TRUE)
table(yourKNNmodel, YourTestData$Direction)

h)
Use the following code to find the best value of K. Report the confusion matrix and overall fraction of correct
predictions for this value of K.
#knn error:
K <- 30
knn.error <- rep(NA, K)

set.seed(234)
for (k in 1:K) {
knn.pred <- knn(train = knn.train,
test = knn.test,
cl = Weekly_train$Direction,
k = k)

5
knn.error[k] <- mean(knn.pred != Weekly_test$Direction)
}
knn.error.df <- data.frame(k = 1:K, error = knn.error)
ggplot(knn.error.df, aes(x = k, y = error)) +
geom_point(col = "blue") +
geom_line(linetype = "dotted")

i)
Which of these methods appear to provide the best results on this data?

j)
Plot the ROC curves and calculate the AUC for the four methods (using your the best choice for KNN).
What can you say about the fit of these models?
R-hints:
• For KNN you can use knn(...,prob=TRUE) to get the probability for the classified direction. Note
that we want P (Direction = U p) when plotting the ROC-curve, so we need to modify the probabilities
returned from the knn function.
#get the probabilities for the classified class
yourKNNProbs <- attributes(yourKNNmodel)$prob

# since we want the probability for Up, we need to take 1-p for the elements
# that gives probability for Down
down <- which(yourKNNmodel == "Down")
yourKNNProbs[down] <- 1 - yourKNNProbs[down]

• Use the following code to produce ROC-curves:

#install.packages("plotROC")
#install.packages("pROC")
library(pROC)
library(plotROC)

yourRoc <- roc(response = Weekly_test$Direction,

predictor = yourModelsPredictedProb,
direction = "<")
#you can use this function for all your methods and plot them using plot(yourRoc)

#or use ggplot2

dat <- data.frame(Direction = Weekly_test$Direction,
glm = yourGlmProbs,
lda = yourLDAProbs[, 2],
qda = yourQDAProbs[, 2],
knn = yourKNNProbs)
dat_long <- melt_roc(dat, "Direction", c("glm", "lda", "qda", "knn"))
ggplot(dat_long, aes(d = D, m = M, color = name)) +
geom_roc(n.cuts = FALSE) +
xlab("1-Specificity") +
ylab("Sensitivity")
#glm is very similar to lda, so the roc-curve for glm is not shown.

#AUC: yourAUC = auc(yourRoc)

Cognitive Class - Answers Data Analysis With Python
No ratings yet
Cognitive Class - Answers Data Analysis With Python
6 pages
Stats216 hw2
No ratings yet
Stats216 hw2
21 pages
Choice of Statistical Method Flow Diagram
No ratings yet
Choice of Statistical Method Flow Diagram
1 page
ISLR solutions——Classification
No ratings yet
ISLR solutions——Classification
20 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
Stat 5700 HW 2
No ratings yet
Stat 5700 HW 2
15 pages
Machine Learning-Lecture 2(Student)
No ratings yet
Machine Learning-Lecture 2(Student)
9 pages
DS535 Note 4 (With Marks)
No ratings yet
DS535 Note 4 (With Marks)
18 pages
Worksheet Classification1
No ratings yet
Worksheet Classification1
15 pages
STAT-2450 Assignment 1: Name:, Student ID: B00
No ratings yet
STAT-2450 Assignment 1: Name:, Student ID: B00
9 pages
BDA MSC It
No ratings yet
BDA MSC It
35 pages
36-708 Statistical Machine Learning Homework #3 Solutions: DUE: March 29, 2019
No ratings yet
36-708 Statistical Machine Learning Homework #3 Solutions: DUE: March 29, 2019
22 pages
Regression in R
No ratings yet
Regression in R
40 pages
Lab 3 - Logistic Regression: Part B
No ratings yet
Lab 3 - Logistic Regression: Part B
7 pages
The University of Auckland: Second Semester, 2004 Campus: City
No ratings yet
The University of Auckland: Second Semester, 2004 Campus: City
23 pages
Activity 7
No ratings yet
Activity 7
5 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
11 pages
Homework 1
No ratings yet
Homework 1
8 pages
Matlab Homework Experts 2
No ratings yet
Matlab Homework Experts 2
10 pages
Lab 4
No ratings yet
Lab 4
20 pages
WEEK
No ratings yet
WEEK
17 pages
Lab 4 Classification v.0
No ratings yet
Lab 4 Classification v.0
5 pages
Uni T - 2 - R Programming
No ratings yet
Uni T - 2 - R Programming
10 pages
CS 2008 3complete PDF
No ratings yet
CS 2008 3complete PDF
53 pages
Logistic Regression
No ratings yet
Logistic Regression
41 pages
Logistic Regression and Discriminant Analysis: Jerry D.T. Purnomo, PH.D
No ratings yet
Logistic Regression and Discriminant Analysis: Jerry D.T. Purnomo, PH.D
54 pages
Data Science Unit-5
No ratings yet
Data Science Unit-5
37 pages
SDSC3006 - Assignment 1
No ratings yet
SDSC3006 - Assignment 1
3 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
Errata First Printing
No ratings yet
Errata First Printing
6 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
12 pages
Rstudio Study Notes For PA 20181126
No ratings yet
Rstudio Study Notes For PA 20181126
6 pages
saurabh
No ratings yet
saurabh
22 pages
HW 4
No ratings yet
HW 4
7 pages
Exercise 3 Computer Intensive Statistics
No ratings yet
Exercise 3 Computer Intensive Statistics
10 pages
Final Predictive Vaibhav 2020
No ratings yet
Final Predictive Vaibhav 2020
101 pages
Wa0030.
No ratings yet
Wa0030.
36 pages
Problem Set 6 Solution Numerical Methods
No ratings yet
Problem Set 6 Solution Numerical Methods
11 pages
Logistic Regression EBay
No ratings yet
Logistic Regression EBay
10 pages
SDSC3006_Assignment 2
No ratings yet
SDSC3006_Assignment 2
3 pages
MultivariableRegression 2
No ratings yet
MultivariableRegression 2
79 pages
STA3022Test2 2018
No ratings yet
STA3022Test2 2018
7 pages
Lab Wk1soln PDF
No ratings yet
Lab Wk1soln PDF
14 pages
Bda Assign
No ratings yet
Bda Assign
15 pages
HW5_solution_Fall_2024
No ratings yet
HW5_solution_Fall_2024
18 pages
Homework 5 - Logistic Regression in R
No ratings yet
Homework 5 - Logistic Regression in R
3 pages
ML PPT 2
No ratings yet
ML PPT 2
206 pages
Logistic Regression - Exercises
No ratings yet
Logistic Regression - Exercises
8 pages
Classification: K N X X X y I y
No ratings yet
Classification: K N X X X y I y
6 pages
Content PDF
No ratings yet
Content PDF
61 pages
R Code Default Data PDF
No ratings yet
R Code Default Data PDF
10 pages
Supervised Learning in R Classification
No ratings yet
Supervised Learning in R Classification
7 pages
Seu Ds610 Mod03
No ratings yet
Seu Ds610 Mod03
45 pages
Classification
No ratings yet
Classification
31 pages
Regression PDF
No ratings yet
Regression PDF
10 pages
Logistic Regression
No ratings yet
Logistic Regression
12 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
13 pages
Weatherwax Weisberg Solutions
No ratings yet
Weatherwax Weisberg Solutions
162 pages
ML_AI
No ratings yet
ML_AI
53 pages
Pre-Calculus Essentials
From Everand
Pre-Calculus Essentials
Ernest Woodward
No ratings yet
Numerical Analysis II Essentials
From Everand
Numerical Analysis II Essentials
The Editors of REA
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Two-Sample Tests of Hypothesis: Mcgraw-Hill/Irwin
No ratings yet
Two-Sample Tests of Hypothesis: Mcgraw-Hill/Irwin
14 pages
Topic 7 Frequency Distribution, CrossTabulation & T-Test
No ratings yet
Topic 7 Frequency Distribution, CrossTabulation & T-Test
20 pages
Model Paper 1 Kas 302
No ratings yet
Model Paper 1 Kas 302
3 pages
Data Analytics Using R-Programming Notes
No ratings yet
Data Analytics Using R-Programming Notes
100 pages
Lecture 6
No ratings yet
Lecture 6
54 pages
BS Assignment3
No ratings yet
BS Assignment3
2 pages
STATS 10 Assignment 1
No ratings yet
STATS 10 Assignment 1
7 pages
Solomon B QP - S1 Edexcel
No ratings yet
Solomon B QP - S1 Edexcel
4 pages
Mean and Variance of Sampling Distribution of Sample Mean: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
No ratings yet
Mean and Variance of Sampling Distribution of Sample Mean: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
21 pages
MATM111-Midterms-REVIEWER
No ratings yet
MATM111-Midterms-REVIEWER
3 pages
GE 5 Module 4
No ratings yet
GE 5 Module 4
31 pages
Chapter - 2 (Sampling Technique)
No ratings yet
Chapter - 2 (Sampling Technique)
31 pages
MAST20005 Statistics Assignment 3
No ratings yet
MAST20005 Statistics Assignment 3
8 pages
An Examination Was Held To Decide On The Award of A Scholarship
0% (2)
An Examination Was Held To Decide On The Award of A Scholarship
4 pages
Sample Size Determination
No ratings yet
Sample Size Determination
4 pages
Mediation With Structural Equations Modeling: The Measurement Model
No ratings yet
Mediation With Structural Equations Modeling: The Measurement Model
15 pages
Full Download (Ebook PDF) Essentials of Modern Business Statistics With Microsoft Office Excel 7th Edition PDF
100% (5)
Full Download (Ebook PDF) Essentials of Modern Business Statistics With Microsoft Office Excel 7th Edition PDF
49 pages
Q 4 RESEARCH Module 2 3
No ratings yet
Q 4 RESEARCH Module 2 3
27 pages
Module 6 - Central Limit Theorem
No ratings yet
Module 6 - Central Limit Theorem
6 pages
Two stageClusterSampling
No ratings yet
Two stageClusterSampling
5 pages
Kernels, Model Selection and Feature Selection
No ratings yet
Kernels, Model Selection and Feature Selection
5 pages
Statistical Machine Learning W4400 Lecture Slides PDF
No ratings yet
Statistical Machine Learning W4400 Lecture Slides PDF
520 pages
Formula Sheet TMQU03
No ratings yet
Formula Sheet TMQU03
3 pages
Marketing Research Feb 2021
No ratings yet
Marketing Research Feb 2021
2 pages
Factor Influencing On Hanu Students' House Rent: Econometrics Project
No ratings yet
Factor Influencing On Hanu Students' House Rent: Econometrics Project
23 pages
Bigras Et Al - Keeping Collecting Device in Liquid Medium Is Mandatory To Ensure Optimized LB Cervical Cytologic Sampling
No ratings yet
Bigras Et Al - Keeping Collecting Device in Liquid Medium Is Mandatory To Ensure Optimized LB Cervical Cytologic Sampling
7 pages
Hubungan Sosialisasi Politik Dengan Partisipasi Politik Dalam Pemilihan Kepala Daerah Di Kabupaten Dairi Kecamatan Gunung Sitember
No ratings yet
Hubungan Sosialisasi Politik Dengan Partisipasi Politik Dalam Pemilihan Kepala Daerah Di Kabupaten Dairi Kecamatan Gunung Sitember
12 pages
Practice Exam2 PDF
100% (1)
Practice Exam2 PDF
8 pages

cor

Uploaded by

cor

Uploaded by

Module 4: Recommended Exercises

TMA4268 Statistical Learning V2025

Sara Martino, Stefanie Muff, Kenneth Aase

Problem 1: KNN (Exercise 2.4.7 in ISL textbook, slightly modified)

Problem 2: Bank notes and LDA (with calculations)

Fake bank notes:

Problem 3: Odds (Exercise 4.7.9 in ISL textbook)

Problem 4: Logistic regression (Exercise 4.7.6 in ISL textbook)

Problem 5: Sensitivity, specificity, ROC and AUC

Data analysis with R

Problem 6 (Exercise 4.7.10 in ISL textbook - modified)

• Use the following code to produce ROC-curves:

yourRoc <- roc(response = Weekly_test$Direction,

#or use ggplot2

#AUC: yourAUC = auc(yourRoc)

You might also like