IPS7e_LecturePPT_ch02
IPS7e_LecturePPT_ch02
IPS Chapter 2
2.1: Scatterplots
2.2: Correlation
2.3: Least‐Squares Regression
2.4: Cautions About Correlation and Regression
2.5: Data Analysis for Two‐Way Tables
2.6: The Question of Causation
© 2012 W.H. Freeman and Company
Looking at Data–Relationships
2.1 Scatterplots
Scatterplots
Explanatory and response variables
Interpreting scatterplots
Outliers
Categorical variables in scatterplots
Scatterplot smoothers
Examining Relationships
Most statistical studies involve more than one variable.
Questions:
1 5 0.1
2 2 0.03
3 9 0.19
6 7 0.095
7 3 0.07
9 3 0.02
11 4 0.07
13 5 0.085
4 8 0.12
5 3 0.04
8 5 0.06
10 5 0.05
12 6 0.1
14 7 0.09
15 1 0.01
16 4 0.05
Interpreting scatterplots
After plotting two variables on a scatterplot, we describe the
relationship by examining the form, direction, and strength of the
association. We look for an overall pattern …
Form: linear, curved, clusters, no pattern
Nonlinear
Positive association: High values of one variable tend to occur together
with high values of the other variable.
Using an inappropriate
scale for a scatterplot
can give an incorrect
impression.
In a scatterplot, outliers are points that fall outside of the overall pattern
of the relationship.
Not an outlier:
Outliers
Comparison of income
(quantitative response variable)
for different education levels (five
categories).
When both variables are quantitative, the order of the data points is defined
entirely by their value. This is not true for categorical data.
Scatterplot smoothers
When an association is more complex than linear, we can still describe
the overall pattern by smoothing the scatterplot.
You can simply average the y values separately for each x value.
When a data set does not have many y values for a given x, software
smoothers form an overall pattern by looking at the y values for points in
the neighborhood of each x value. Smoothers are resistant to outliers.
x i − x y i − y
n
1
r= ∑
n −1 i=1 sx sy
r = -0.75 r = -0.75
"Time to swim" is the explanatory variable here, and belongs on the x axis.
However, in either plot r is the same (r=-0.75).
"r" has no unit
r = -0.75
Changing the units of variables does
not change the correlation coefficient
"r", because we get rid of all our units
when we standardize (get z-scores).
1 n x i − x y i − y
r= ∑
n −1 i=1 sx sy
z for time z for pulse
Estimate r.
(in 1000’s)
ageman = agewoman + 2
equation for a straight line
Thought quiz on correlation
Regression lines
Correlation and r2
Transforming relationships
Correlation tells us about
strength (scatter) and direction
of the linear relationship
between two quantitative
variables.
Response 0.16
0.14
(dependent)
0.12
variable: 0.10
0.08
blood alcohol 0.06
content 0.04
0.02
y 0.00
0 1 2 3 4 5 6 7 8 9 10
x Number of Beers
Do calories explain
sodium amounts?
yˆ = b 0 + b1 x
Once we know b1, the slope, we can calculate b0, the y-intercept:
yˆ = a + bx
And some use:
yˆ = ax + b
Make sure you know what YOUR
calculator gives you for a and b before
you answer homework or exam questions.
Software output
intercept
slope
R2
r
R2
intercept
slope
The equation completely describes the regression line.
To plot the regression line you only need to plug two x values into the
equation, get y, and draw the line that goes through those points.
Hint: The regression line always passes through the mean of x and y.
Regression examines the distance of all points from the line in the y
direction only.
yˆ = 0.125 (500 ) − 41 .4 ⇒ yˆ = 62 .5 − 41 .4 = 21 .1
Roughly 21 manatees.
Extrapolation
!!!
Height in Inches
!!!
Height in Inches
to do, as seen here.
The y intercept
y-intercept shows
But the negative value is negative blood alcohol
r=0 Changes in x
r2 = 0 explain 0% of the Here the change in x only
variations in y. explains 76% of the change in
The value(s) y y. The rest of the change in y
takes is (are) (the vertical scatter, shown as
entirely
red arrows) must be explained
independent of
by something other than x.
what value x
takes.
There is quite some variation in BAC for the same
r =0.7 number of beers drank. A person’s blood volume is
r2 =0.49 a factor in the equation that was overlooked here.
We changed
number of beers
to number of
beers/weight of
person in lb.
r =0.9
r2 =0.81 In the first plot, number of beers only explains
49% of the variation in blood alcohol content.
But number of beers / weight explains 81% of
the variation in blood alcohol content.
Additional factors contribute to variations in
BAC among individuals (like maybe some
genetic ability to process alcohol).
Grade performance
A weak correlation.
Transforming relationships
A scatterplot might show a clear relationship between two quantitative
variables, but issues of influential points or nonlinearity prevent us from
using correlation and regression tools.
5000 4
4000
3000
2
2000
1
1000
0 0
0 30 60 90 120 150 180 210 240 0 30 60 90 120 150 180 210 240
Time (min) Time (min)
Residuals
Lurking variables
Each dot represents an average. The These histograms illustrate that each
variation among boys per age class is mean represents a distribution of
not shown. boys of a particular age.
Should parents be worried if their son does not match the point for his age?
If the raw values were used in the correlation instead of the mean, there would be
a lot of spread in the y-direction, and thus the correlation would be smaller.
That's why typically growth
charts show a range of values
(here from 5th to 95th
percentiles).
Predicted ŷ
dist. ( y − yˆ ) = residual
Observed y
Residual plots
Residuals are the distances between y-observed and y-predicted. We
plot them in a residual plot.
Child 19 = outlier
in y direction
Child 19 is an outlier
of the relationship.
Child 18 is only an
outlier in the x
direction and thus
Child 18 = outlier in x direction
might be an
influential point.
All data
outlier in Without child 18
y-direction Without child 19
Are these
points
influential?
influential
Always plot your data
So make sure to
always plot your data
before you run a
correlation or
regression analysis.
Always plot your data!
The correlations all give r ≈ 0.816, and the regression lines are all approximately ŷ
= 3 + 0.5x. For all four sets, we would predict ŷ = 8 when x = 10.
However, making the scatterplots shows us that the correlation/
regression analysis is not appropriate for all data sets.
Now we change
number of beers
to number of
beers/weight of
person in lb.
Two-way tables
Joint distributions
Marginal distributions
Conditional distributions
Simpson’s paradox
Two-way tables
An experiment has a two-way, or block, design if two categorical
factors are studied with several levels of each factor.
Group Record
by age education First factor: age
Second factor:
education
Two-way tables
We call education the row variable and age group the column
variable.
For each cell, we can compute a proportion by dividing the cell entry
by the total sample size. The collection of these proportions would
be the joint distribution of the two variables.
Marginal distributions
We can look at each categorical variable separately in a two-way table
by studying the row totals and the column totals. They represent the
marginal distributions, expressed in counts or percentages. (They
are written as if in a margin.)
Here the
percents are
calculated by age
range (columns).
29.30% = 11071
37785
= cell total .
column total
The conditional distributions can be graphically compared using side by
side bar graphs of one variable for each value of the other variable.
Hospital A Hospital B
Example: Hospital death On the surface,
Died 63 16
rates Survived 2037 784 Hospital B would
Total 2100 800 seem to have a
% surv. 97.0% 98.0% better record.
Causation
Common response
Confounding
Establishing causation
Explaining association: causation
Example 2: Married men earn more than single men. Can a man
raise his income by getting married?
1
0.9
0.8
reading index
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7
shoe size
Students who have high SAT scores in high school have high GPAs
in their first year of college.
Example: Studies have found that religious people live longer than
nonreligious people.
Religious people also take better care of themselves and are less
likely to smoke or be overweight.
Some possible explanations for an observed association. The
dashed lines show an association. The solid arrows show a cause-
and-effect link. x is explanatory, y is response, and z is a lurking
variable.
Figure 2.28
Introduction to the Practice of Statistics, Sixth Edition
© 2009 W.H. Freeman and Company
Establishing causation
It appears that lung cancer is associated with smoking.
How do we know that both of these variables are not being affected by an
unobserved third (lurking) variable?
For instance, what if there is a genetic predisposition that causes people to
both get lung cancer and become addicted to smoking, but the smoking itself
doesn’t CAUSE lung cancer?
JMP
Software output
CrunchIt!
JMP