0% found this document useful (0 votes)
3 views

Thomas Watson - Scatterplot practice

Uploaded by

thomas.watson455
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Thomas Watson - Scatterplot practice

Uploaded by

thomas.watson455
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Name: _________________

AP Statistics Handout: Lesson 3.3


Topics: residuals & residual plots

Lesson 3.3 Guided Notes


Residuals

Residuals: The error (vertical distance) between a linear model’s Residual = actual – predicted
_______________
PREDICTION and theACTUAL __________ data point. Residual = 𝒚 − 𝒚

A linear regression was performed on students’


attendance and their test scores, resulting in the linear
equation: 𝑦 = −7.69 + 0.57𝑥, where 𝑦 = predicted raw
score and x = percent of school days attended.
a) A new student comes to the school. If his attendance
rate is 80%, what is his predicted test score? Show your
work below and by drawing on the graph.

-7.69+0.57(.80)=36 predicted score

b) The student gets 44 questions correct on his exam. Find, draw, and interpret our model’s prediction
error (residual) for this student.

44-36=8 which is the residual

Residual Plots Scatterplot Residual Plot

Does this linear model provide a good fit


residual values

for this data? Justify why or why not.


y-values

Yes because there isn't a pattern


and its just all over the place
x-values x-values

Insert text here


Does this linear model provide a good fit for
residual values

this data? Justify why or why not.


y-values

No the residual plot resembles a


upside quadratic
x-values x-values
2

High Composite First Year Dataset – University of California Admissions


SAT Family
Student School (HSGPA & College
GPA
Score
SAT)
Income
GPA
 In 2020, one of the world’s largest school systems, the University of
California, released a study1 of standardized tests and admission in their
Student 1 3.12 1240 64.8% $45,696 1.85
schools. The report gave the public a rare look at admissions statistics,
Student 2 3.24 1460 70.9% $115,754 2.84
which are normally kept private.
Student 3 3.66 1670 80.5% $48,209 3.27
Student 4 3.43 1860 81.6% $63,582 2.61  Today, we’ll analyze a simulated version of their full dataset, with n =
Student 5 3.35 1890 81.2% $33,641 3.41 1,000 students. Simulation is required, since the raw data is still private.
Note: This is simulated data that closely matches key summary statistics from UC However, the simulated data matches the key summary statistics from
testing task force report’s latest years (2015 & 2016). Full data file, including
citations and summary, is available on the lesson page: skewthescript.org/3-3.
the report.

Lesson 3.3 (Day 1) Discussion

Scatterplot Residual Plot

The University of California report found a positive correlation between family income and SAT scores.
For this reason, the report expressed concern that using the SAT in admissions could disadvantage the
kids who couldn’t afford tutors / test prep.

Discussion Question: To level the playing field, many colleges “consider student backgrounds” when
evaluating their SAT scores.

a) How could a college use this data to consider income when evaluating SAT scores? Hint: Think
about how they could use the residual plot.

They'll see over/under predictions and just give out free act/sat books cause
there literally like 200 dollars for a book which is like 3 test sounds like market
capitalization

b) Is the method you proposed in part (a) “fair” to both wealthy and low-income students?
Explain your thinking.
If its free and distributed to schools to equall give then there
should be a increase

1.
“Report of the UC Academic Council Standardized Testing Task Force,” University of California Academic Senate (2020):
https://senate.universityofcalifornia.edu/_files/committees/sttf/sttf-report.pdf
3

Lesson 3.3 (Day 1) Practice

1) An avid bird watcher counted the number of geese on a pond each morning and wondered if the
temperature could be used to predict the number of geese she would see. A linear regression was
performed resulting in the least-square regression line of 𝑦 = −9.89 + 0.25𝑥, where x = temperature in
degrees Fahrenheit and y = the number of geese on the pond.

a) When it is 55℉ in the morning, how many


geese are predicted to be on the pond?

y=-9.89 +0.25(55) = 3.86 = 4

b) On a morning that was 55℉, there were


actually 4 geese on the pond. Calculate and
interpret the residual for this day, and then draw
the residual on the scatterplot.
The residual is 0.5

c) On a morning that was 61℉, there were also 4 geese on the pond. Calculate and interpret the
model’s residual for this day, and then draw the residual on the scatterplot.

the residual is -1.5 meaning the underpredicted by -1.5 gooses


g

d) The residual plot for the number of geese vs. temperature data set is given below. Does the linear
model provide a good fit for the data? Why or why not?

Yes because it doesn't show a pattern and seems pretty linear


4

2) The size of a home or apartment is often described using square footage of the floor plan. Suppose
we want to predict the monthly rent for an apartment from its square footage. The least squares
regression line for x = square footage and y = monthly rent was calculated to be 𝑦 = −995.89 + 2.33𝑥
for a sample of small 1-2 bedroom apartments in a city. Using the residual plot below, is this linear
model a good fit for the data? Why or why not?

The scattering has a pattern and it represents a non-linear


relationship
3) Linear regression was performed on two quantitative variables resulting in the LSRL of
𝑦 = −4.28 + 7.77𝑥 and the residual plot shown below.

Which do you think would be more reliable using the LSRL: a prediction using x = 1.5 or a prediction
using x = 8.5? Justify your answer.

1.5 since it has a slightly better scattering and doesn't have more of a pattern then
8.5 being X in the Least Square Regression Line
5

AP Statistics Handout: Lesson 3.3 (Day 2)


Topics: standard deviation of residuals (s), coefficient of determination (r 2), outliers

Lesson 3.3 Guided Notes


After testing centers closed in 2020 due to the Covid-19 pandemic, many colleges dropped their SAT /
ACT testing requirements. Now, many colleges continue to list these tests as optional for applicants. But,
slowly, more and more colleges have started requiring these tests again. Let’s explore why.

Dataset – University of California Admissions


High Composite First Year
SAT Family
Student School (HSGPA & College
Score Income  In 2020, one of the world’s largest school systems, the University
GPA SAT) GPA
of California, released a study1 of standardized tests and
Student 1 3.12 1240 64.8% $45,696 1.85
admission in their schools. The report gave the public a rare look
Student 2 3.24 1460 70.9% $115,754 2.84
at admissions statistics, which are normally kept private.
Student 3 3.66 1670 80.5% $48,209 3.27
Student 4 3.43 1860 81.6% $63,582 2.61  Today, we’ll analyze a simulated version of their full dataset, with
Student 5 3.35 1890 81.2% $33,641 3.41 n = 1,000 students. Simulation is required, since the raw data is
Note: This is simulated data that closely matches key summary statistics from UC still private. However, the simulated data matches the key
testing task force report’s latest years (2015 & 2016). Full data file, including
citations and summary, is available on the lesson page: skewthescript.org/3-3.
summary statistics from the report.

Colleges use high school GPA and SAT/ACT scores to predict how well applicants will perform in college
classes. Above are the relationships between high school GPA, SAT score, and college GPA from our
dataset. The x-axis on the rightmost graph represents the “composite” - the percent of points earned for
high school GPA (out of 4.0) and SAT score (out of 2400), where GPA & SAT are evenly weighted.

a) Use the first LSRL model to predict the first year college GPA of a student who has a 3.5 GPA in high
school. You can visually approximate using the graph (show your thinking by drawing on the graph).

b) Which of the explanatory variables above (high school GPA, SAT score, or composite of GPA/SAT)
looks like the best predictor of college students’ GPAs? Explain your thinking.

composite data seems closer to graph

1
“Report of the UC Academic Council Standardized Testing Task Force,” University of California Academic Senate (2020):
https://senate.universityofcalifornia.edu/_files/committees/sttf/sttf-report.pdf
6

Standard Deviation of the Residuals (s)

Standard deviation of the residuals (s): _________________ error between data points and their LSRL.

s: The typical residual length


Stem for interpreting s: When using the
Stronger correlation
s:  _________________
LSRL with explanatory variable to predict
response variable, we will typically be
off by about value of s with units of the
response variable (y).
Weaker correlation
s:
 _________________

1) The standard deviation of the residuals for the LSRL


between attendance and test scores is s = 1.99.
Interpret this value.

s = 0.563 s = 0.531 s = 0.518

Above, the three predictors of college GPA (high school GPA, SAT scores, and a composite of both) are
displayed, along with some residuals. The standard deviation of the residuals is also shown.

2) Which of the predictors above (high school GPA, SAT score, or composite of GPA/SAT) is the best
predictor of college GPA? Why might this explain why some colleges are requiring the SAT again?
7

The coefficient of determination (r2)

Graphic inspired by mathisfun.com


𝑟=1 𝑟 = 0.91 𝑟 = 0.48 𝑟=0 𝑟 = −0.48 𝑟 = −0.91 𝑟 = −1
𝑟 =1 𝑟 = _____ 𝑟 = _____ 𝑟 = _____ 𝑟 = _____ 𝑟 = _____ 𝑟 = _____

𝑟 close to 0  _______ correlation | 𝑟 close to 1  _______ correlation

𝒓𝟐 = 𝟏. 𝟎𝟎 = 𝟏𝟎𝟎%
The linear model ______________
explains the data’s pattern.
Stem for interpreting r2 : 𝑟 % of the
variation in response variable can be
𝒓𝟐 = 𝟎. 𝟕𝟐 = 𝟕𝟐% explained by the linear relationship
The linear model explains with explanatory variable
_________ of the data’s pattern,
but not all of it. There is some error.

r2 = 0.13 r2 = 0.22 r2 = 0.26

3) Interpret the r2 value for the relationship between 4) Does including SAT scores substantially improve
high school GPA and college GPA. the strength of predictions? Justify your answer.
8
A B C

The effect of outliers

5) For each of the above models (A, B, C), is the circled data value an outlier? Explain.

6) Are these measures (r, r2, and s)


resistant to outliers? How can you tell?

No, because the change grea


No Outlier With Outlier
r = 0.79 | r2 = 0.62 | s = 1.31 .61 | r2 =.37
r = ____ ____ | s = 1.85
____

Lesson 3.3 (Day 2) Discussion

The UC report found that high school GPA used to be almost as


strong as the SAT in predicting college GPA. See the graph at left
for an older (2007) cohort of applicants.

Discussion: Give one possible reason that high school GPA


became a weaker predictor of college GPA over time.
Hint: Think about the shape of the data along the x-axis.

The changes in the grading scale

Note: This is simulated data, matching summary


statistics from the University of California report.
9

Lesson 3.3 (Day 2) Practice


1) Two scatterplots and their corresponding least-squares regression lines are shown. One graph (A or B)
has 𝑟 = 0.955 and the other has 𝑟 = 0.749. One graph (A or B) has 𝑠 = 5.251 and the other has 𝑠 =
10.336. Match each graph with its respective 𝑟 and s values. Justify your answers.

Graph A Graph B

.749 and 10.336 data is further off the line the data .955 and 5.251 data
is nearly percect and aren't
very far off the line

2) An avid bird watcher counted the number of geese on a pond each morning and wondered if the
temperature could be used to predict the number of geese she would see. A linear regression was
performed using x = temperature (℉) and y = number of geese.

a) The resulting value of 𝑟 was 0.85. Interpret this value in context.


.85 of the data is connected to eachother

b) The resulting value of 𝑠 was 0.82. Interpret this value in context.

68 percent of the data was within .82 of the


predicted line

You might also like