Thomas Watson - Scatterplot practice
Thomas Watson - Scatterplot practice
Residuals: The error (vertical distance) between a linear model’s Residual = actual – predicted
_______________
PREDICTION and theACTUAL __________ data point. Residual = 𝒚 − 𝒚
b) The student gets 44 questions correct on his exam. Find, draw, and interpret our model’s prediction
error (residual) for this student.
The University of California report found a positive correlation between family income and SAT scores.
For this reason, the report expressed concern that using the SAT in admissions could disadvantage the
kids who couldn’t afford tutors / test prep.
Discussion Question: To level the playing field, many colleges “consider student backgrounds” when
evaluating their SAT scores.
a) How could a college use this data to consider income when evaluating SAT scores? Hint: Think
about how they could use the residual plot.
They'll see over/under predictions and just give out free act/sat books cause
there literally like 200 dollars for a book which is like 3 test sounds like market
capitalization
b) Is the method you proposed in part (a) “fair” to both wealthy and low-income students?
Explain your thinking.
If its free and distributed to schools to equall give then there
should be a increase
1.
“Report of the UC Academic Council Standardized Testing Task Force,” University of California Academic Senate (2020):
https://senate.universityofcalifornia.edu/_files/committees/sttf/sttf-report.pdf
3
1) An avid bird watcher counted the number of geese on a pond each morning and wondered if the
temperature could be used to predict the number of geese she would see. A linear regression was
performed resulting in the least-square regression line of 𝑦 = −9.89 + 0.25𝑥, where x = temperature in
degrees Fahrenheit and y = the number of geese on the pond.
c) On a morning that was 61℉, there were also 4 geese on the pond. Calculate and interpret the
model’s residual for this day, and then draw the residual on the scatterplot.
d) The residual plot for the number of geese vs. temperature data set is given below. Does the linear
model provide a good fit for the data? Why or why not?
2) The size of a home or apartment is often described using square footage of the floor plan. Suppose
we want to predict the monthly rent for an apartment from its square footage. The least squares
regression line for x = square footage and y = monthly rent was calculated to be 𝑦 = −995.89 + 2.33𝑥
for a sample of small 1-2 bedroom apartments in a city. Using the residual plot below, is this linear
model a good fit for the data? Why or why not?
Which do you think would be more reliable using the LSRL: a prediction using x = 1.5 or a prediction
using x = 8.5? Justify your answer.
1.5 since it has a slightly better scattering and doesn't have more of a pattern then
8.5 being X in the Least Square Regression Line
5
Colleges use high school GPA and SAT/ACT scores to predict how well applicants will perform in college
classes. Above are the relationships between high school GPA, SAT score, and college GPA from our
dataset. The x-axis on the rightmost graph represents the “composite” - the percent of points earned for
high school GPA (out of 4.0) and SAT score (out of 2400), where GPA & SAT are evenly weighted.
a) Use the first LSRL model to predict the first year college GPA of a student who has a 3.5 GPA in high
school. You can visually approximate using the graph (show your thinking by drawing on the graph).
b) Which of the explanatory variables above (high school GPA, SAT score, or composite of GPA/SAT)
looks like the best predictor of college students’ GPAs? Explain your thinking.
1
“Report of the UC Academic Council Standardized Testing Task Force,” University of California Academic Senate (2020):
https://senate.universityofcalifornia.edu/_files/committees/sttf/sttf-report.pdf
6
Standard deviation of the residuals (s): _________________ error between data points and their LSRL.
Above, the three predictors of college GPA (high school GPA, SAT scores, and a composite of both) are
displayed, along with some residuals. The standard deviation of the residuals is also shown.
2) Which of the predictors above (high school GPA, SAT score, or composite of GPA/SAT) is the best
predictor of college GPA? Why might this explain why some colleges are requiring the SAT again?
7
𝒓𝟐 = 𝟏. 𝟎𝟎 = 𝟏𝟎𝟎%
The linear model ______________
explains the data’s pattern.
Stem for interpreting r2 : 𝑟 % of the
variation in response variable can be
𝒓𝟐 = 𝟎. 𝟕𝟐 = 𝟕𝟐% explained by the linear relationship
The linear model explains with explanatory variable
_________ of the data’s pattern,
but not all of it. There is some error.
3) Interpret the r2 value for the relationship between 4) Does including SAT scores substantially improve
high school GPA and college GPA. the strength of predictions? Justify your answer.
8
A B C
5) For each of the above models (A, B, C), is the circled data value an outlier? Explain.
Graph A Graph B
.749 and 10.336 data is further off the line the data .955 and 5.251 data
is nearly percect and aren't
very far off the line
2) An avid bird watcher counted the number of geese on a pond each morning and wondered if the
temperature could be used to predict the number of geese she would see. A linear regression was
performed using x = temperature (℉) and y = number of geese.