Data8 Su24 Final
Data8 Su24 Final
Summer 2024
INSTRUCTIONS
You have 1 hour and 50 minutes to complete the exam.
• The exam is closed book, closed notes, closed computer/calculator, except for the provided reference
sheet.
• Mark your answers on the exam itself in the spaces provided. We will not grade answers written on
scratch paper or outside the designated answer spaces.
• If you need to use the restroom, bring your phone, exam, reference sheet, and student ID to the front
of the room.
For questions with circular bubbles, you should fill in exactly one choice.
⃝ You must choose either this option
⃝ Or this one, but not both!
For questions with square checkboxes, you may fill in multiple choices.
□ You could select this choice.
□ You could select this one too!
**Important**: Please fill in circles and squares to indicate answers and clearly cross out or erase mis-
takes.
Preliminaries
You can complete these questions before the exam starts.
(iii) Who is sitting to your left? (Write no one if no one is next to you.)
(iv) Who is sitting to your right? (Write no one if no one is next to you.)
Data 8 Summer 2024 Final Exam Initials:
c. (2 points) When bootstrapping, we sample from our original sample without replacement to avoid sam-
pling the same row multiple times.
⃝ True
⃝ False
d. (2 points) When creating regression lines, minimizing the root mean squared error will always result in
the same line as minimizing the mean squared error, assuming both lines are created using the same
data.
⃝ True
⃝ False
e. (2 points) As a part of evaluating the accuracy of the k -nearest neighbors classifier, each point in the
testing set is classified by finding its k -nearest neighbors in the testing set and picking the majority class
among these neighbors. training
⃝ True
⃝ False
percentage
f. (2 points) A histogram should be constructed such that the area of each bar is equal to the number of
entries in the bin.
⃝ True
⃝ False
g. (2 points) Joseph has 8 hats: 4 green, 3 blue, and 1 white. Every day of the week, he picks a hat to
wear at random with replacement. The hats are chosen independently of any other day. Which one of
the following is the probability that on two consecutive days, he picks the same color hat?
⃝ (2 ∗ 48 ) + (2 ∗ 83 ) + (2 ∗ 18 )
⃝ ( 84 ∗ 48 ) + ( 83 ∗ 38 ) + ( 81 ∗ 18 )
⃝ ( 48 ∗ 48 ) ∗ ( 38 ∗ 83 ) ∗ ( 18 ∗ 18 )
⃝ ( 48 ∗ 3
8 ∗ 81 ) ∗ ( 48 ∗ 3
8 ∗ 18 )
2
⃝ 1 − ( 48 ∗ 3
8 ∗ 18 )
⃝ None of the above
Data 8 Summer 2024 Final Exam Initials:
h. (2 points) Mia has a bag with 10 marbles. Each of the 10 marbles has a unique color and an equal
probability of being chosen. Mia draws from the bag 10 times with replacement and observes 9 draws
of the red marble and 1 draw of the purple marble. After this, Mia believes that each marble does not
actually have an equal probability of being chosen. She wants to run a hypothesis test. Which one of
the following test statistics should she use?
⃝ Absolute difference between the number of red and purple marbles
⃝ Difference between the number of red and purple marbles
⃝ Number of red marbles
⃝ Number of purple marbles TVD
⃝ Number of marbles
⃝ None of the above
i. (2 points) Ashley has told Data 8 staff that her headshot rate in the game Valorant is 45%. Fiona
wants to test this claim, running a hypothesis test. Her alternative hypothesis is that Ashley has less
than a 45% headshot rate (differences from this in a sample are not due to chance). Ashley plays
one competitive match, and Fiona observes a headshot rate of 35%. Given the test statistic of (sample
headshot rate - expected headshot rate), Fiona claims that larger, more positive values of the test statistic
are in favor of the alternative hypothesis. Is she correct?
⃝ True
⃝ False
j. (2 points) What conditions need to be met in order to invoke the Central Limit Theorem to create a
confidence interval for a population parameter? Select all that apply.
□ The original population is normally distributed
□ The statistic of interest is the mean or sum
□ The data collected comes from a large and random sample with replacement
□ All of the above
□ None of the above
Page 2
Data 8 Summer 2024 Final Exam Initials:
Assume that the crosswords table has one column named Time with the time (in minutes) that it
took to solve each crossword. Also, assume that the function bootstrap crosswords() will perform
one bootstrap resample on the crosswords table.
make_array()
20000
Page 3
Data 8 Summer 2024 Final Exam Initials:
np.average(resampled_cw.column("Time"))
np.append(cw_times, stat)
percentile(1.5, cw_times)
percentile(98.5, cw_times)
np.array([left, right])
c. (2 points) After calling the function, Marissa and Mia generate a 97% confidence interval of [202, 204].
Which one of the following is an appropriate estimate of the probability that the true population mean
is in their interval?
⃝ 0%
⃝ 3%
⃝ 97%
⃝ 98.5%
⃝ 100%
⃝ None of the above
d. (2 points) Which of the following can be concluded from Marissa’s and Mia’s confidence interval
in part (c)? Select all that apply.
□ Marissa’s and Mia’s mean completion time was exactly 203 seconds.
□ The mean completion time in their original sample was exactly 203 seconds.
□ 95% of the completion times in the population are between 202 and 204 seconds.
□ 95% of the completion times in the original sample are between 202 and 204 seconds.
□ If another data scientist independently repeats the bootstrap process 1000 times, exactly 950
of the intervals created will contain the true population mean time.
□ None of the above.
Page 4
Data 8 Summer 2024 Final Exam Initials:
e. (2 points) Marissa and Mia are considering changing their confidence level from 97% to 99% for their
confidence interval calculation. How would this change affect the width of their confidence interval for
the average completion time? Select all that apply.
□ The new confidence interval’s width would increase.
□ The new confidence interval’s width would decrease.
□ The new confidence interval’s width would remain the same.
□ The new confidence interval would contain 202 seconds.
□ The new confidence interval would contain 206 seconds.
□ The effect on the new confidence interval width cannot be determined from the given informa-
tion.
f. (2 points) Now, Marissa and Mia want to figure out the proportion of times that the Data 8 staff complete
the mini crossword in under one minute, but only find it practical to take a large, random sample of the
staff’s past times. They want to construct a 95% confidence interval for this proportion with a total
width of only 2 percent. What is the smallest sample size they should use? Please show all of your
work and put a box around your final answer.
Page 5
Data 8 Summer 2024 Final Exam Initials:
• Score: (int) The number of points Conan had in the game. Players earn points in various ways,
including scoring a goal, making a save, taking a shot on net, etc.
• Touches: (int) The number of times Conan’s car touched the ball.
• Boost: (int) The amount of boost Conan’s car used in the game.
• Won: (Boolean) A value indicating whether Conan’s team won (True) or lost (False) the game.
a. (2 points) Conan wants to examine how the distribution of touches varies by whether Conan’s team won
the game. Write one line of code that creates the most appropriate visualization for these data.
rocket. A ( B )
Page 6
Data 8 Summer 2024 Final Exam Initials:
b. (4 points) An important part of the game involves collecting boost that is scattered around the field,
as boost fuels the rocket-powered cars, allowing them to go faster and propel them into the air. Conan
wants to increase the number of ball touches he has in a given game, as touching the ball frequently can
increase offensive pressure, ultimately increasing his chances of winning the game. He suspects that the
amount of boost he uses in a game affects the number of ball touches he makes. In order to visualize his
data, he creates the following scatter plot.
["Touches", "Boost"]
Page 7
Data 8 Summer 2024 Final Exam Initials:
np.std
col_name + "_su"
(arr - avg) / sd
(ii) (3 points) While you help Conan, he creates a function named weird multiply which takes in a
row object and returns the product of the elements in index 0 and index 1.
Use the function above to calculate the correlation coefficient between Touches and Boost.
columns
B. (1 point) Fill in blank (B)
np.mean
apply(weird_multiply)
Page 8
Data 8 Summer 2024 Final Exam Initials:
d. (8 points) We find the correlation coefficient between Touches and Boost to be approximately 0.705.
We also find that across the 50 games:
• The average number of Touches was 28.54 with a standard deviation of 9.51.
• The average of Boost used was 1773.4 with a standard deviation of 471.7.
(i) (2 points) A correlation coefficient of 0.705 suggests that there must be a linear relationship between
Touches and Boost.
⃝ True
⃝ False
(ii) (2 points) If we fit a regression line to the scatter plot, the sum of the residuals will be A
and the sum of the squared residuals will be B .
⃝ A: zero, B: zero
⃝ A: non-zero, B: zero
⃝ A: zero, B: non-zero
⃝ A: non-zero, B: non-zero
⃝ We do not have enough information to answer this.
(iv) (2 points) The regression line that minimizes the RMSE for predicting Touches from Boost is always
the same as the regression line that minimizes the RMSE for predicting Boost from Touches.
⃝ True
⃝ False
e. (7 points) Conan wants to construct a regression line. However, he refuses to use the regression equations
because he does not believe that it produces a good regression line. Instead, he decides to use his
computer to find a regression line that minimizes the RMSE.
(i) (3 points) First, help Conan define the function rmse that takes in a slope and intercept, and
returns the RMSE between the predictions made by the regression line and the actual y values.
x = rocket.column("Boost")
y = rocket.column("Touches")
def rmse(slope, intercept):
pred = A
residuals = B
squared resid = residuals ** 2
return C
Page 9
Data 8 Summer 2024 Final Exam Initials:
slope * x + intercept
pred - y
np.mean(squared_resid) ** 0.5
(ii) (2 points) Write a line of code that assigns best params to a two-element array containing the
slope and intercept for the regression line that minimizes the RMSE.
best params = A ( B )
minimize
rmse
(iii) (2 points) What is your best estimate of best params.item(0)? If you believe you lack the infor-
mation needed, write “Not enough information”. You may incorporate any numbers and statistics
introduced in previous subparts. You do not need to simplify your answer.
Page 10
Data 8 Summer 2024 Final Exam Initials:
f. (12 points) Mia has watched Conan prioritize grabbing boost over hitting the ball on multiple occasions.
Therefore, she believes that there is no linear relationship between Boost and Touches. On the other
hand, Conan believes that there is a linear relationship between Boost and Touches. In order to settle
this dispute, they decide to conduct a hypothesis test.
(i) (2 points) Formulate a valid null hypothesis.
(iii) (1 point) Choose all valid test statistics for the hypothesis test. Select all that apply.
(iv) (1 point) Choose all valid simulation methods for the hypothesis test. Select all that apply.
(v) (3 points) Conan and Mia decide to use a p-value cutoff of 10% for their hypothesis test. They
perform the first part of their hypothesis test, and obtain 10,000 simulated statistics stored in an
array called simulated stats. Fill in the blanks so that the code prints the correct conclusion for
their hypothesis test.
Page 11
Data 8 Summer 2024 Final Exam Initials:
95
(vi) (3 points) After running the code from part (v), Conan and Mia find that left and right values are
0.578 and 0.814, respectively. Select all that apply.
□ Using a p-value cutoff of 10%, Conan and Mia should reject the null hypothesis.
□ Using a p-value cutoff of 5%, Conan and Mia should reject the null hypothesis.
□ There is approximately a 10% chance that the next simulated statistic Conan and Mia
calculate is between 0.578 and 0.814.
□ There is approximately a 90% chance that the next simulated statistic Conan and Mia
calculate is between 0.578 and 0.814.
□ There is a 10% chance that Conan and Mia falsely reject the null hypothesis when it is
actually true.
□ None of the above.
Page 12
Data 8 Summer 2024 Final Exam Initials:
Confronted with a mixed army of Good and Evil Lilas, Cynthia develops a Lila scanner to identify the Evil
clones. If the Lila clone is Good, the scanner will return an accurate result 96% of the time. If the Lila clone
is Evil, the scanner will return an accurate result 93% of the time.
For this section, you may leave any of your answers unsimplified or as mathematical expres-
sions. Please also put a box around your final answer for each of the questions below.
a. (0 points) SCRATCH WORK: You can use this space to write any extra calculations or diagrams that
may be helpful. Anything written in this box will not be graded. Alternatively, use this space to draw
your interpretation of an Evil Lila (she is your typical small fluffy white dog)!
b. (2 points) What is the probability that 6 Good Lilas are created in a row?
0.83 ^ 6
c. (3 points) If Cynthia scans a Lila clone at random, what is the probability that the scanner says the
Lila clone is Good?
Page 13
Data 8 Summer 2024 Final Exam Initials:
d. (3 points) Suppose the scanner says a Lila clone is Evil. What is the probability that the Lila clone is
actually Evil?
e. (3 points) Aha! Cynthia stumbles upon clone #8, who seems to be asleep in the perfect position to be
scanned. Prior to scanning, the position of the clone #8 leads Cynthia to believe that there’s a 30%
chance the clone is Evil. Upon scanning, the scanner reads “Good”.
Given the information in this question and assuming the conditional probabilities in the problem state-
ment are still valid, what is Cynthia’s subjective probability that the clone is actually Good?
⃝ 0.30 ∗ 0.07 + 0.70 ∗ 0.96
⃝ 0.70 ∗ 0.04 + 0.30 ∗ 0.93 p(good | pred=good) = p(pred=good | good) * p(good) /
0.83∗0.96
⃝ 0.83∗0.96 + 0.17∗0.07 p (pred=good)
0.70∗0.96
⃝ 0.70∗0.96 + 0.30∗0.07
0.30∗0.70
⃝ 0.70∗0.96 + 0.30∗0.07
⃝ None of the above = 0.96 * 0.83 / (0.83 * 0.96 + 0.17 * 0.07)
Page 14
Data 8 Summer 2024 Final Exam Initials:
Answer the following questions based on this dataset using the k -Nearest Neighbors algorithm.
import numpy as np
from datascience import *
Page 15
Data 8 Summer 2024 Final Exam Initials:
labels
standardized_olympics.column(label)
def distance(row):
return A
np.sum(row ** 2) ** 0.5
take(np.arange(0, k))
sort("count", descending=True)
Page 16
Data 8 Summer 2024 Final Exam Initials:
item(0)
(ii) (2 points) When using k -nearest neighbors with the same dataset, a prediction model based on
standard units will always produce the same results as a prediction model based on non-standardized
units.
standard units is better
⃝ True
⃝ False
(iii) (2 points) If we want to build a model to predict 3 classes instead of 2, we can use the same method
of picking k that we discussed in class to avoid ties.
⃝ True
⃝ False
48 * 2 + 1
(ii) (1 point) If n is the total number of rows, w is the number of elements in the majority class, and m
is the number of elements of the minority class (in a binary classification task), what is the general
formula for k that we can pick that will always give us the same label regardless of our input
row? You must use k, a comparison operator (such as >, >=, =, <=, <), and a mathematical
expression in terms of n and/or m and/or w.
m * 2 + 1 <= k <= n
Page 17
Data 8 Summer 2024 Final Exam Initials:
6 Optional [0 points]
a. (0 points) Assumptions
If there was any question on the exam that you thought was ambiguous and required clarification to be
answerable, please identify the question (including the title of the section, e.g., Experiments) and state
your assumptions. Be warned: We only plan to consider this information if we agree that the question
was erroneous or ambiguous and we consider your assumption reasonable. We will only consider
assumptions that are written inside the box below.
Page 18