0% found this document useful (0 votes)

3 views

IPS7e_LecturePPT_ch02

Uploaded by

burgulamey

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

IPS7e_LecturePPT_ch02

Uploaded by

burgulamey

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 105

Looking at Data–Relationships

IPS Chapter 2
2.1: Scatterplots
2.2: Correlation
2.3: Least‐Squares Regression
2.4: Cautions About Correlation and Regression
2.5: Data Analysis for Two‐Way Tables
2.6: The Question of Causation
© 2012 W.H. Freeman and Company
Looking at Data–Relationships
2.1 Scatterplots

© 2012 W. H. Freeman and Company

Objectives
2.1 Scatterplots

Scatterplots
Explanatory and response variables
Interpreting scatterplots
Outliers
Categorical variables in scatterplots
Scatterplot smoothers
Examining Relationships
Most statistical studies involve more than one variable.

Questions:

What cases does the data describe?

What variables are present and how are they measured?

Are all of the variables quantitative?

Do some of the variables explain or even cause changes in other

variables?
Student Beers Blood Alcohol
Here, we have two quantitative
variables for each of 16 1 5 0.1
2 2 0.03
students.
3 9 0.19
6 7 0.095
1) How many beers they
7 3 0.07
drank, and
9 3 0.02
2) Their blood alcohol level
11 4 0.07
(BAC)
13 5 0.085
4 8 0.12
We are interested in the 5 3 0.04
relationship between the two 8 5 0.06
variables: How is one affected 10 5 0.05

by changes in the other one? 12 6 0.1

14 7 0.09
15 1 0.01
16 4 0.05
Looking at relationships

Start with a graph

Look for an overall pattern and deviations from the pattern

Use numerical descriptions of the data and overall pattern (if

appropriate)
Scatterplots
In a scatterplot, one axis is used to represent each of the variables,
and the data are plotted as points on the graph.

Student Beers BAC

1 5 0.1
2 2 0.03
3 9 0.19
6 7 0.095
7 3 0.07
9 3 0.02
11 4 0.07
13 5 0.085
4 8 0.12
5 3 0.04
8 5 0.06
10 5 0.05
12 6 0.1
14 7 0.09
15 1 0.01
16 4 0.05
Interpreting scatterplots
After plotting two variables on a scatterplot, we describe the
relationship by examining the form, direction, and strength of the
association. We look for an overall pattern …
Form: linear, curved, clusters, no pattern

Direction: positive, negative, no direction

Strength: how closely the points fit the “form”

… and deviations from that pattern.

Outliers
Form and direction of an association
Linear
No relationship

Nonlinear
Positive association: High values of one variable tend to occur together
with high values of the other variable.

Negative association: High values of one variable tend to occur together

with low values of the other variable.
No relationship: X and Y vary independently. Knowing X tells you
nothing about Y.
Strength of the association

The strength of the relationship between the two variables can be

seen by how much variation, or scatter, there is around the main form.

With a strong relationship, you With a weak relationship, for any

can get a pretty good estimate x you might get a wide range of
of y if you know x. y values.
This is a weak relationship. For a This is a very strong relationship.
particular state median household The daily amount of gas consumed
income, you can’t predict the state can be predicted quite accurately for
per capita income very well. a given temperature value.
How to scale a scatterplot
Same data in all four plots

Using an inappropriate
scale for a scatterplot
can give an incorrect
impression.

Both variables should be

given a similar amount of
space:
• Plot roughly square
• Points should occupy all
the plot space (no blank
space)
Outliers
An outlier is a data value that has a very low probability of occurrence
(i.e., it is unusual or unexpected).

In a scatterplot, outliers are points that fall outside of the overall pattern
of the relationship.
Not an outlier:
Outliers

The upper right-hand point here is

not an outlier of the relationship—It
is what you would expect for this
many beers given the linear
relationship between beers/weight
and blood alcohol.

This point is not in line with the

others, so it is an outlier of the
relationship.
IQ score and
Grade point average

a) Describe in words what this

plot shows.

b) Describe the direction,

shape, and strength. Are
there outliers?

c) What is the deal with these

people?
Categorical variables in scatterplots
Often, things are not simple and one-dimensional. We need to group
the data into categories to reveal trends.

What may look like a positive linear

relationship is in fact a series of
negative linear associations.

Plotting different habitats in

different colors allows us to make
that important distinction.
Comparison of men and women
racing records over time.
Each group shows a very strong
negative linear relationship that
would not be apparent without the
gender categorization.

Relationship between lean body mass

and metabolic rate in men and women.
Both men and women follow the same
positive linear trend, but women show
a stronger association. As a group,
males typically have larger values for
both variables.
Categorical explanatory variables

When the explanatory variable is categorical, you cannot make a

scatterplot, but you can compare the different categories side by side on
the same graph (boxplots, or mean +/− standard deviation).

Comparison of income
(quantitative response variable)
for different education levels (five
categories).

But be careful in your

interpretation: This is NOT a
positive association, because
education is not quantitative.
Example: Beetles trapped on boards of different colors
Beetles were trapped on sticky boards scattered throughout a field. The sticky
boards were of four different colors (categorical explanatory variable). The
number of beetles trapped (response variable) is shown on the graph below.

? What association? What relationship?

Blue White Green Yellow

Board color

Blue Green White Yellow

Board color

Î Describe one category at a time.

When both variables are quantitative, the order of the data points is defined
entirely by their value. This is not true for categorical data.
Scatterplot smoothers
When an association is more complex than linear, we can still describe
the overall pattern by smoothing the scatterplot.
You can simply average the y values separately for each x value.

When a data set does not have many y values for a given x, software
smoothers form an overall pattern by looking at the y values for points in
the neighborhood of each x value. Smoothers are resistant to outliers.

Time plot of the acceleration of the

head of a crash test dummy as a
motorcycle hits a wall.

The overall pattern was calculated

by a software scatterplot smoother.
Looking at Data—Relationships
2.2 Correlation

© 2012 W. H. Freeman and Company

Objectives
2.2 Correlation

The correlation coefficient “r”

r does not distinguish between x and y
r has no units of measurement
r ranges from -1 to +1
Influential points
The correlation coefficient "r"

The correlation coefficient is a measure of the direction and strength

of a linear relationship.

It is calculated using the mean and the standard deviation of both

the x and y variables.

Correlation can only be used to describe quantitative variables.

Categorical variables don’t have means and standard deviations.
The correlation coefficient "r"

x i − x  y i − y 
n 
1
r= ∑  
n −1 i=1  sx  sy 


Time to swim: x = 35, sx = 0.7

Pulse rate: y = 140 sy = 9.5

1  x i − x  y i − y 
n
r= ∑  
n −1 i=1  sx  sy 


z for time z for pulse

Part of the calculation

involves finding z, the
standardized score we used
when working with the
normal distribution.

You DON'T want to do this by hand.

Make sure you learn how to use
your calculator or software.
Standardization:
Allows us to compare
correlations between data
sets where variables are
measured in different units
or when variables are
different.

For instance, we might

want to compare the
correlation between [swim
time and pulse], with the
correlation between [swim
time and breathing rate].
“r” does not distinguish x & y
x i − x  y i − y 
n 
1
The correlation coefficient, r, treats r= ∑  
n −1 i=1  sx  sy 

x and y symmetrically.

r = -0.75 r = -0.75

"Time to swim" is the explanatory variable here, and belongs on the x axis.
However, in either plot r is the same (r=-0.75).
"r" has no unit
r = -0.75
Changing the units of variables does
not change the correlation coefficient
"r", because we get rid of all our units
when we standardize (get z-scores).
1 n  x i − x  y i − y 
r= ∑  
n −1 i=1  sx  sy 

z for time z for pulse

z-score plot is the same

for both plots
r = -0.75
"r" ranges
from -1 to +1
"r" quantifies the strength
and direction of a linear
relationship between 2
quantitative variables.

Strength: how closely the points

follow a straight line.

Direction: is positive when

individuals with higher X values
tend to have higher values of Y.
When variability in one
or both variables
decreases, the
correlation coefficient
gets stronger
(Æ closer to +1 or -1).
Correlation only describes linear relationships

No matter how strong the association,

r does not describe curved relationships.

Note: You can sometimes transform a non-linear association to a linear form,

for instance by taking the logarithm. You can then calculate a correlation using
the transformed data.
Influential points
Correlations are calculated using
means and standard deviations,
and thus are NOT resistant to
outliers.

Just moving one point away from the

general trend here decreases the
correlation from -0.91 to -0.75
Try it out for yourself—companion book website:
http://www.whfreeman.com/ips7e

Adding two outliers decreases r from 0.95 to 0.61.

Review examples

1) What is the explanatory variable?

Describe the form, direction, and strength

of the relationship.

Estimate r.

(in 1000’s)

2) If women always marry men 2 years older

than themselves, what is the correlation of the
ages between husband and wife?

ageman = agewoman + 2
equation for a straight line
Thought quiz on correlation

Why is there no distinction between explanatory and response

variables in correlation?

Why do both variables have to be quantitative?

How does changing the units of measurement affect correlation?

What is the effect of outliers on correlations?

Why doesn’t a tight fit to a horizontal line imply a strong correlation?

Looking at Data–Relationships
2.3 Least-Squares Regression

© 2012 W. H. Freeman and Company

Objectives
2.3 Least-squares regression

Regression lines

Prediction and Extrapolation

Correlation and r2

Transforming relationships
Correlation tells us about
strength (scatter) and direction
of the linear relationship
between two quantitative
variables.

In addition, we would like to have a numerical description of how both

variables vary together. For instance, is one variable increasing faster
than the other one? And we would like to make predictions based on that
numerical description.
But which line best
describes our data?
Explanatory and response variables
A response variable measures or records an outcome of a study. An
explanatory variable explains changes in the response variable.

Typically, the explanatory or independent variable is plotted on the x

axis, and the response or dependent variable is plotted on the y axis.
Blood Alcohol as a function of Number of Beers
0.20
0.18
Blood Alcohol Level (mg/ml)

Response 0.16
0.14
(dependent)
0.12
variable: 0.10
0.08
blood alcohol 0.06
content 0.04
0.02

y 0.00
0 1 2 3 4 5 6 7 8 9 10
x Number of Beers

Explanatory (independent) variable:

number of beers
Some plots don’t have clear explanatory and response variables.

Do calories explain
sodium amounts?

Does percent return on Treasury

bills explain percent return
on common stocks?
The regression line

A regression line is a straight line that describes how a response

variable y changes as an explanatory variable x changes.

We often use a regression line to predict the value of y for a given

value of x.

In regression, the distinction between explanatory and response

variables is important.
The regression line
The least-squares regression line is the unique line such that the sum
of the squared vertical (y) distances between the data points and the
line is as small as possible.

Distances between the points and

line are squared so all are positive
values. This is done so that
distances can be properly added
(Pythagoras).
Properties
The least-squares regression line can be shown to have this equation:

yˆ = b 0 + b1 x

yˆ is the predicted y value (y hat)

b1 is the slope
b0 is the y-intercept
How to:

First we calculate the slope of the line, b1; sy

from statistics we already know:
b1 = r
r is the correlation.
sx
sy is the standard deviation of the response variable y.
sx is the the standard deviation of the explanatory variable x.

Once we know b1, the slope, we can calculate b0, the y-intercept:

b 0 = y − b1 x Where x and y are the sample

means of the x and y variables

Typically, we use a 2-var stats calculator or stats software.

BEWARE!!!
Not all calculators and software use the same convention. Some use:

yˆ = a + bx
And some use:

yˆ = ax + b
Make sure you know what YOUR
calculator gives you for a and b before
you answer homework or exam questions.
Software output

intercept
slope
R2

r
R2

intercept
slope
The equation completely describes the regression line.

To plot the regression line you only need to plug two x values into the
equation, get y, and draw the line that goes through those points.
Hint: The regression line always passes through the mean of x and y.

The points you use for

drawing the regression
line are derived from the
equation.

They are NOT points from

your sample data (except
by pure coincidence).
The distinction between explanatory and response variables is crucial in
regression. If you exchange y for x in calculating the regression line, you
will get the wrong line.

Regression examines the distance of all points from the line in the y
direction only.

Hubble telescope data about

galaxies moving away from earth:

These two lines are the two

regression lines calculated either
correctly (x = distance, y = velocity,
solid line) or incorrectly (x =
velocity, y = distance, dotted line).
Correlation versus regression

The correlation is a measure In regression we examine

of spread (scatter) in both the the variation in the response
x and y directions in the linear variable (y) given change in
relationship. the explanatory variable (x).
Making predictions
The equation of the least-squares regression allows you to predict y
for any x within the range studied.

yˆ = 0 .0144 x + 0 .0008 Nobody in the study drank 6.5

beers, but by finding the value
of ŷ from the regression line for
x = 6.5 we would expect a
blood alcohol content of 0.094
mg/ml.

yˆ = 0.0144* 6.5 + 0.0008

yˆ = 0.936+ 0.0008= 0.0944mg/ml
(in 1000s)
Year
1977
Powerboats
447
Dead Manatees
13
ŷ = 0.125 x − 41 .4
1978 460 21
1979 481 24
1980 498 16
1981 513 24
1982 512 20
1983 526 15
1984 559 34
1985 585 33
1986 614 33
1987 645 39
1988 675 43
1989 711 50
1990 719 47

There is a positive linear relationship between the number of powerboats

registered and the number of manatee deaths.

The least squares regression line has the equation: ŷ = 0.125 x − 41 .4

Thus if we were to limit the number of powerboat registrations to 500,000, what
could we expect for the number of manatee deaths?

yˆ = 0.125 (500 ) − 41 .4 ⇒ yˆ = 62 .5 − 41 .4 = 21 .1
Roughly 21 manatees.
Extrapolation
!!!

Height in Inches
!!!

Extrapolation is the use of a

regression line for predictions
outside the range of x values
used to obtain the line.

This can be a very stupid thing

Height in Inches
to do, as seen here.
The y intercept

Sometimes the y-intercept is not biologically possible. Here we have

negative blood alcohol content, which makes no sense…

y-intercept shows
But the negative value is negative blood alcohol

appropriate for the equation

of the regression line.

There is a lot of scatter in the

data, and the line is just an
estimate.
Coefficient of determination, r2
r2, the coefficient of determination, is the square of the correlation
coefficient.

r2 represents the percentage of

the variance in y (vertical scatter
from the regression line) that can
be explained by changes in x. sy
b1 = r
sx
r = -1 Changes in x
r2 = 1 explain 100% of r = 0.87
the variations in y. r2 = 0.76
Y can be entirely
predicted for any
given value of x.

r=0 Changes in x
r2 = 0 explain 0% of the Here the change in x only
variations in y. explains 76% of the change in
The value(s) y y. The rest of the change in y
takes is (are) (the vertical scatter, shown as
entirely
red arrows) must be explained
independent of
by something other than x.
what value x
takes.
There is quite some variation in BAC for the same
r =0.7 number of beers drank. A person’s blood volume is
r2 =0.49 a factor in the equation that was overlooked here.

We changed
number of beers
to number of
beers/weight of
person in lb.

r =0.9
r2 =0.81 In the first plot, number of beers only explains
49% of the variation in blood alcohol content.
But number of beers / weight explains 81% of
the variation in blood alcohol content.
Additional factors contribute to variations in
BAC among individuals (like maybe some
genetic ability to process alcohol).
Grade performance

If class attendance explains 16% of the variation in grades, what is

the correlation between percent of classes attended and grade?

1. We need to make an assumption: attendance and grades are

positively correlated. So r will be positive too.

2. r2 = 0.16, so r = +√0.16 = + 0.4

A weak correlation.
Transforming relationships
A scatterplot might show a clear relationship between two quantitative
variables, but issues of influential points or nonlinearity prevent us from
using correlation and regression tools.

Transforming the data—changing the scale in which one or both of the

variables are expressed—can make the shape of the relationship linear
in some cases.

Example: Patterns of growth are often exponential, at least in their initial

phase. Changing the response variable y into log(y) or ln(y) will transform
the pattern from an upward-curved exponential to a straight line.
Exponential bacterial growth
In ideal environments, bacteria multiply through binary fission. The
number of bacteria can double every 20 minutes in that way.

5000 4

4000

Log of bacterial count

3
Bacterial count

3000
2
2000

1
1000

0 0
0 30 60 90 120 150 180 210 240 0 30 60 90 120 150 180 210 240
Time (min) Time (min)

1 - 2 - 4 - 8 - 16 - 32 - 64 - … log(2n) = n*log(2) ≈ 0.3n

Exponential growth 2n, Taking the log changes the growth
not suitable for regression. pattern into a straight line.
Body weight and brain weight
in 96 mammal species
r = 0.86, but this is misleading.

The elephant is an influential point. Most

mammals are very small in comparison.
Without this point, r = 0.50 only.

Now we plot the log of brain weight

against the log of body weight.

The pattern is linear, with r = 0.96.

The vertical scatter is homogenous
→ good for predictions of brain weight
from body weight (in the log scale).
Looking at Data–Relationships
2.4 Cautions about
Correlation and Regression

© 2012 W. H. Freeman and Company

Objectives
2.4 Cautions about correlation and regression

Residuals

Outliers and influential points

Lurking variables

Correlation/regression using averages

The restricted range problem

Correlation/regression using averages
Many regression or correlation studies use average data.

While this is appropriate, you should know that correlations based on

averages are usually quite higher than those made on the raw data.

The correlation is a measure of spread

(scatter) in a linear relationship. Using
averages greatly reduces the scatter.

Therefore, r and r2 are typically greatly

increased when averages are used.
Boys Boys

Each dot represents an average. The These histograms illustrate that each
variation among boys per age class is mean represents a distribution of
not shown. boys of a particular age.

Should parents be worried if their son does not match the point for his age?
If the raw values were used in the correlation instead of the mean, there would be
a lot of spread in the y-direction, and thus the correlation would be smaller.
That's why typically growth
charts show a range of values
(here from 5th to 95th
percentiles).

This is a more comprehensive

way of displaying the same
information.
Residuals
The distances from each point to the least-squares regression line give
us potentially useful information about the contribution of individual data
points to the overall pattern of scatter.
These distances are
called “residuals.”

Points above the The sum of these

line have a positive
residual.
residuals is always 0.

Points below the line have a

negative residual.

Predicted ŷ
dist. ( y − yˆ ) = residual
Observed y
Residual plots
Residuals are the distances between y-observed and y-predicted. We
plot them in a residual plot.

If residuals are scattered randomly around 0, chances are your data

fit a linear model, was normally distributed, and you didn’t have outliers.
The x-axis in a residual plot is the
same as on the scatterplot.

Only the y-axis is different.

Residuals are randomly scattered—good!

Curved pattern—means the relationship

you are looking at is not linear.

A change in variability across a plot is a

warning sign. You need to find out why it
is, and remember that predictions made
in areas of larger variability will not be as
good.
Outliers and influential points
Outlier: observation that lies outside the overall pattern of observations.
“Influential individual”: observation that markedly changes the
regression if removed. This is often an outlier on the x-axis.

Child 19 = outlier
in y direction

Child 19 is an outlier
of the relationship.

Child 18 is only an
outlier in the x
direction and thus
Child 18 = outlier in x direction
might be an
influential point.
All data
outlier in Without child 18
y-direction Without child 19

Are these
points
influential?

influential
Always plot your data

A correlation coefficient and a regression line can be calculated for any

relationship between two quantitative variables. However, outliers
greatly influence the results, and running a linear regression on a
nonlinear association is not only meaningless but misleading.

So make sure to
always plot your data
before you run a
correlation or
regression analysis.
Always plot your data!

The correlations all give r ≈ 0.816, and the regression lines are all approximately ŷ
= 3 + 0.5x. For all four sets, we would predict ŷ = 8 when x = 10.
However, making the scatterplots shows us that the correlation/
regression analysis is not appropriate for all data sets.

Moderate linear Obvious One point deviates Just one very

association; nonlinear from the highly influential point; all
regression OK. relationship; linear pattern; this other points have
regression outlier must be the same x value;
not OK. examined closely a redesign is due
before proceeding. here.
Lurking variables
A lurking variable is a variable not included in the study design that
does have an effect on the variables studied.
Lurking variables can falsely suggest a relationship.

What is the lurking variable in these examples?

How could you answer if you didn’t know anything about the topic?

Strong positive association between

number of firefighters at a fire site and the
amount of damage a fire does.

Negative association between moderate

amounts of wine drinking and death rates
from heart disease in developed nations.
There is quite some variation in BAC for the
same number of beers drank. A person’s
blood volume is a factor in the equation that
we have overlooked.

Now we change
number of beers
to number of
beers/weight of
person in lb.

The scatter is much smaller now. One’s

weight was indeed influencing the
response variable “blood alcohol content.”
Vocabulary: lurking vs. confounding

A lurking variable is a variable that is not among the explanatory or

response variables in a study and yet may influence the
interpretation of relationships among those variables.

Two variables are confounded when their effects on a response

variable cannot be distinguished from each other. The confounded
variables may be either explanatory variables or lurking variables.

Association is not causation. Even if an association is very strong,

this is not by itself good evidence that a change in x will cause a
change in y.
Caution before rushing into a correlation or a
regression analysis

Do not use a regression on inappropriate data.

9 Pattern in the residuals
9 Presence of large outliers Use residual plots for help.
9 Clumped data falsely appearing linear

Beware of lurking variables.

Avoid extrapolating (going beyond interpolation).

Recognize when the correlation/regression is performed on averages.

A relationship, however strong it is, does not itself imply causation.

Looking at Data–Relationships
2.5 Data analysis for two-way
tables

Objectives
2.5 Data analysis for two-way tables

Two-way tables

Joint distributions

Marginal distributions

Relationships between categorical variables

Conditional distributions

Simpson’s paradox
Two-way tables
An experiment has a two-way, or block, design if two categorical
factors are studied with several levels of each factor.

Two-way tables organize data about two categorical variables obtained

from a two-way, or block, design. (There are now two ways to group the
data).

Group Record
by age education First factor: age

Second factor:
education
Two-way tables
We call education the row variable and age group the column
variable.

Each combination of values for these two variables is called a cell.

For each cell, we can compute a proportion by dividing the cell entry
by the total sample size. The collection of these proportions would
be the joint distribution of the two variables.
Marginal distributions
We can look at each categorical variable separately in a two-way table
by studying the row totals and the column totals. They represent the
marginal distributions, expressed in counts or percentages. (They
are written as if in a margin.)

2000 U.S. census

The marginal distributions can then be displayed on separate bar graphs, typically
expressed as percents instead of raw counts. Each graph represents only one of
the two variables, completely ignoring the second one.
Parental smoking
Does parental smoking influence the smoking habits of their high school children?

Summary two-way table:

High school students were
asked whether they
smoke and whether their
parents smoke.

Marginal distribution for the categorical

variable “parental smoking”:
The row totals are used and re-expressed as
percent of the grand total.
One
Both parent Neither
parents smoke parent
smoke s smokes
Percent of
Students 33.1% 41.7% 25.2%
The percents are then displayed in a bar graph.
Relationships between categorical variables
The marginal distributions summarize each categorical variable
independently. But the two-way table actually describes the relationship
between both categorical variables.

The cells of a two-way table represent the intersection of a given level

of one categorical factor and a given level of the other categorical
factor.
Conditional Distribution
In the table below, the 25 to 34 age group occupies the first column. To find
the complete distribution of education in this age group, look only at that
column. Compute each count as a percent of the column total.
These percents should add up to 100% because all persons in this age
group fall into one of the education categories. These four percents together
are the conditional distribution of education, given the 25 to 34 age group.

2000 U.S. census

Conditional distributions
The percents within the table represent the conditional distributions.
Comparing the conditional distributions allows you to describe the
“relationship” between both categorical variables.

Here the
percents are
calculated by age
range (columns).
29.30% = 11071
37785
= cell total .
column total
The conditional distributions can be graphically compared using side by
side bar graphs of one variable for each value of the other variable.

Here, the percents are

calculated by age range
(columns).
Music and wine purchase decision

What is the relationship between type of music

played in supermarkets and type of wine purchased?

We want to compare the conditional distributions of the response

variable (wine purchased) for each value of the explanatory
variable (music played). Therefore, we calculate column percents.

Calculations: When no music was played, there were 30 = 35.7%

84
84 bottles of wine sold. Of these, 30 were French wine.
= cell total .
30/84 = 0.357 Î 35.7% of the wine sold was French column total
when no music was played.

We calculate the column

conditional percents similarly for
each of the nine cells in the table:
For every two-way table, there are two
sets of possible conditional distributions.

Does background music in

supermarkets influence
customer purchasing
decisions?

Wine purchased for each kind of

music played (column percents)

Music played for each

kind of wine purchased
(row percents)
Simpson’s paradox
An association or comparison that holds for all of several groups can
reverse direction when the data are combined (aggregated) to form a
single group. This reversal is called Simpson’s paradox.

Hospital A Hospital B
Example: Hospital death On the surface,
Died 63 16
rates Survived 2037 784 Hospital B would
Total 2100 800 seem to have a
% surv. 97.0% 98.0% better record.

But once patient Patients in good condition Patients in poor condition

Hospital A Hospital B Hospital A Hospital B
condition is taken
Died 6 8 Died 57 8
into account, we Survived 594 592 Survived 1443 192
see that hospital A Total 600 600 Total 1500 200
has in fact a better % surv. 99.0% 98.7% % surv. 96.2% 96.0%
record for both patient conditions (good and poor).

Here, patient condition was the lurking variable.

Looking at Data–Relationships
2.6 The Question of
Causation

Objectives
2.6 The question of causation

Causation

Common response

Confounding

Establishing causation
Explaining association: causation

Association, however strong, does NOT imply causation.

Example 1: Daughter’s body mass index depends on mother’s body

mass index. This is an example of direct causation.

Example 2: Married men earn more than single men. Can a man
raise his income by getting married?

Only careful experimentation can show causation.

Association and causation

Strong positive linear relationship

Children reading skills w ith shoe size

1
0.9
0.8

reading index
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7

shoe size

Not all examples are so obvious…

Explaining association: common response

Students who have high SAT scores in high school have high GPAs
in their first year of college.

This positive correlation can be explained as a common response to

students’ ability and knowledge.

The observed association between two variables x and y could be

explained by a third lurking variable z.

Both x and y change in response to changes in z. This creates an

association even though there is no direct causal link.
Explaining association: confounding

Two variables are confounded when their effects on a response

variable cannot be distinguished from each other. The confounded
variables may be either explanatory variables or lurking variables.

Example: Studies have found that religious people live longer than
nonreligious people.

Religious people also take better care of themselves and are less
likely to smoke or be overweight.
Some possible explanations for an observed association. The
dashed lines show an association. The solid arrows show a cause-
and-effect link. x is explanatory, y is response, and z is a lurking
variable.

Figure 2.28
Introduction to the Practice of Statistics, Sixth Edition
© 2009 W.H. Freeman and Company
Establishing causation
It appears that lung cancer is associated with smoking.
How do we know that both of these variables are not being affected by an
unobserved third (lurking) variable?
For instance, what if there is a genetic predisposition that causes people to
both get lung cancer and become addicted to smoking, but the smoking itself
doesn’t CAUSE lung cancer?

We can evaluate the association using the

following criteria:

1) The association is strong.

2) The association is consistent.
3) Higher doses are associated with stronger
responses.
4) Alleged cause precedes the effect.
5) The alleged cause is plausible.
Alternate Slides

The following slides offer alternate software

output data and examples for this presentation.
Software output
CrunchIt!

JMP
Software output
CrunchIt!

JMP

SAPRE FX LAISAE30002018 Independent Practitioner's
No ratings yet
SAPRE FX LAISAE30002018 Independent Practitioner's
29 pages
Biochemistry - Nucleic Acid Lesson Plan
100% (3)
Biochemistry - Nucleic Acid Lesson Plan
19 pages
Econometrics: A Simple Introduction
From Everand
Econometrics: A Simple Introduction
K.H. Erickson
3.5/5 (5)
McPherson & McCormick 2000
No ratings yet
McPherson & McCormick 2000
12 pages
Chapter2-ESTA3042 2020S2
No ratings yet
Chapter2-ESTA3042 2020S2
80 pages
Chapter 2
No ratings yet
Chapter 2
67 pages
Results of Controlled Experiment Supervised by Law Enforcement Officials
No ratings yet
Results of Controlled Experiment Supervised by Law Enforcement Officials
14 pages
BA 216 Lecture 5 Notes
No ratings yet
BA 216 Lecture 5 Notes
31 pages
Correg
No ratings yet
Correg
19 pages
Second Stats Packet 24
No ratings yet
Second Stats Packet 24
100 pages
Hypothesis Testing Correlation
No ratings yet
Hypothesis Testing Correlation
15 pages
Module 2 - Section 4 (Linear Regression) - 11
No ratings yet
Module 2 - Section 4 (Linear Regression) - 11
20 pages
Correlation New
No ratings yet
Correlation New
37 pages
Bi Variate 1
No ratings yet
Bi Variate 1
75 pages
stat215 test 2
No ratings yet
stat215 test 2
18 pages
Statistics Regression Final Project
100% (2)
Statistics Regression Final Project
12 pages
Looking at Data Relationships p79: Explanatory
No ratings yet
Looking at Data Relationships p79: Explanatory
8 pages
MIS_BA_20232024_notes_chapter02
No ratings yet
MIS_BA_20232024_notes_chapter02
8 pages
Correlation Analysis
No ratings yet
Correlation Analysis
32 pages
Statistics Learners' Working Manual
No ratings yet
Statistics Learners' Working Manual
25 pages
Lecture 7
No ratings yet
Lecture 7
65 pages
Scatter Plot Linear Correlation
No ratings yet
Scatter Plot Linear Correlation
4 pages
Outliers Correlation
No ratings yet
Outliers Correlation
21 pages
Presentation4 - Bivariate Analysis and Simple Linear Regression
No ratings yet
Presentation4 - Bivariate Analysis and Simple Linear Regression
31 pages
Notes 2 - Scatterplots and Correlation
No ratings yet
Notes 2 - Scatterplots and Correlation
6 pages
Chapter 6
No ratings yet
Chapter 6
2 pages
L3 Correlation
No ratings yet
L3 Correlation
101 pages
Chapter 4: Describing The Relationship Between Two Variables
No ratings yet
Chapter 4: Describing The Relationship Between Two Variables
27 pages
3.1 Power Point
No ratings yet
3.1 Power Point
17 pages
Two Quantitative Variables: Scatterplot, Correlation, and Linear Regression
No ratings yet
Two Quantitative Variables: Scatterplot, Correlation, and Linear Regression
17 pages
TEACHING NOTES Section 6
No ratings yet
TEACHING NOTES Section 6
6 pages
CS3353 FDS UNIT 3 NEW
No ratings yet
CS3353 FDS UNIT 3 NEW
48 pages
SEE5211 Chapter3-P2017
No ratings yet
SEE5211 Chapter3-P2017
58 pages
Stat and Prob Q4 Week 7 Module 15 Lorena
No ratings yet
Stat and Prob Q4 Week 7 Module 15 Lorena
24 pages
Unit 4 Statistics Notes Scatter Plot 2023-24
No ratings yet
Unit 4 Statistics Notes Scatter Plot 2023-24
15 pages
Notes Scatterplots
No ratings yet
Notes Scatterplots
145 pages
Stats10_Chapter+4 2
No ratings yet
Stats10_Chapter+4 2
54 pages
Chapter 3 Slides
No ratings yet
Chapter 3 Slides
40 pages
L3 Bivariate Worksheet
No ratings yet
L3 Bivariate Worksheet
25 pages
المادة العمية المتلقة بالارتباط والانحدار - د فواز القربي
100% (1)
المادة العمية المتلقة بالارتباط والانحدار - د فواز القربي
150 pages
Chapter13 Regression
No ratings yet
Chapter13 Regression
110 pages
Looking at Data: Relationships - : Caution About Correlation and Regression The Question of Causation
No ratings yet
Looking at Data: Relationships - : Caution About Correlation and Regression The Question of Causation
20 pages
Notes3.1 TPS6up
No ratings yet
Notes3.1 TPS6up
19 pages
Regression Correlation
No ratings yet
Regression Correlation
22 pages
Chapter 03 Describing Bivarate Data
No ratings yet
Chapter 03 Describing Bivarate Data
32 pages
CORRELATION
No ratings yet
CORRELATION
4 pages
Correlation
No ratings yet
Correlation
19 pages
6 Correlation and Linear Regression
No ratings yet
6 Correlation and Linear Regression
32 pages
AP Stats 3.1
No ratings yet
AP Stats 3.1
38 pages
Corr_Regression Analysis
No ratings yet
Corr_Regression Analysis
19 pages
Correlation
No ratings yet
Correlation
29 pages
Correlation
No ratings yet
Correlation
72 pages
Captura de ecrã 2024-10-16 à(s) 13.04.06
No ratings yet
Captura de ecrã 2024-10-16 à(s) 13.04.06
38 pages
FODS Unit-3
No ratings yet
FODS Unit-3
25 pages
06 Simple Linear Regression Part1
No ratings yet
06 Simple Linear Regression Part1
8 pages
Analise Bivariada_moodle
No ratings yet
Analise Bivariada_moodle
46 pages
Correlation and Regression
No ratings yet
Correlation and Regression
11 pages
Chapter 3 - Regression
No ratings yet
Chapter 3 - Regression
8 pages
Stats_ch_4_powerpoint
No ratings yet
Stats_ch_4_powerpoint
67 pages
20200519072923cce68d4cc4
No ratings yet
20200519072923cce68d4cc4
28 pages
Correlation and Regression: Six Sigma Thinking, #8
From Everand
Correlation and Regression: Six Sigma Thinking, #8
Sumeet Savant
5/5 (1)
Errors of Regression Models: Bite-Size Machine Learning, #1
From Everand
Errors of Regression Models: Bite-Size Machine Learning, #1
Lee Baker
No ratings yet
Beginner’s Guide to Correlation Analysis: Bite-Size Stats, #4
From Everand
Beginner’s Guide to Correlation Analysis: Bite-Size Stats, #4
Lee Baker
No ratings yet
Relevant Sites in Our Vicinity - IKS
No ratings yet
Relevant Sites in Our Vicinity - IKS
3 pages
12-Stem 1 - Subject Orientation - Ucsp
No ratings yet
12-Stem 1 - Subject Orientation - Ucsp
24 pages
October 5, 2018 Strathmore Times
100% (1)
October 5, 2018 Strathmore Times
24 pages
Amazon
No ratings yet
Amazon
13 pages
Lexus 5 Speed A650E Gearbox Drawing Diodes 4
No ratings yet
Lexus 5 Speed A650E Gearbox Drawing Diodes 4
1 page
TZ Nvshow
No ratings yet
TZ Nvshow
23 pages
2 Game Regulations
No ratings yet
2 Game Regulations
25 pages
Cotton Trade in Pakistan
No ratings yet
Cotton Trade in Pakistan
20 pages
Cycle Counting - The Secret To Inventory Accuracy
No ratings yet
Cycle Counting - The Secret To Inventory Accuracy
40 pages
Kisi Kisi LKSN 2024
No ratings yet
Kisi Kisi LKSN 2024
5 pages
Doosan Engine Dl08c Maintenance Manual
100% (63)
Doosan Engine Dl08c Maintenance Manual
10 pages
Sunny Engineers
No ratings yet
Sunny Engineers
4 pages
Cambridge Lower Secondary Complete Chemistry Student Book 2
No ratings yet
Cambridge Lower Secondary Complete Chemistry Student Book 2
1 page
Global Boards One Desire Many Realities 2009th Edition A. Kakabadse All Chapters Instant Download
100% (4)
Global Boards One Desire Many Realities 2009th Edition A. Kakabadse All Chapters Instant Download
81 pages
TC-CBSE FORMAT
No ratings yet
TC-CBSE FORMAT
1 page
Wb213e Pusher Centrifuge SHS
No ratings yet
Wb213e Pusher Centrifuge SHS
12 pages
Spec Esc
No ratings yet
Spec Esc
5 pages
Understanding Exposure
No ratings yet
Understanding Exposure
26 pages
Big Data Visualization
No ratings yet
Big Data Visualization
7 pages
Nu - 07 9200 - Joint Sealants
No ratings yet
Nu - 07 9200 - Joint Sealants
13 pages
Students Module e Unit 1 Lesson 1 Exploration 3 Relating Air Circulation To The Earth System
100% (1)
Students Module e Unit 1 Lesson 1 Exploration 3 Relating Air Circulation To The Earth System
13 pages
Continuous Miners
No ratings yet
Continuous Miners
8 pages
Dehemi UAE 2024
No ratings yet
Dehemi UAE 2024
2 pages
Manufacturing Performance and Evolution of TPM
No ratings yet
Manufacturing Performance and Evolution of TPM
14 pages
Wireless Synchronized Digital Clock: Key Features
No ratings yet
Wireless Synchronized Digital Clock: Key Features
1 page
Dlp-Math-Nov. 21-25
No ratings yet
Dlp-Math-Nov. 21-25
12 pages
1611 en
No ratings yet
1611 en
16 pages

IPS7e_LecturePPT_ch02

Uploaded by

IPS7e_LecturePPT_ch02

Uploaded by

Looking at Data–Relationships

© 2012 W. H. Freeman and Company

 What cases does the data describe?

 What variables are present and how are they measured?

 Are all of the variables quantitative?

 Do some of the variables explain or even cause changes in other

by changes in the other one? 12 6 0.1

 Start with a graph

 Look for an overall pattern and deviations from the pattern

 Use numerical descriptions of the data and overall pattern (if

Student Beers BAC

 Direction: positive, negative, no direction

 Strength: how closely the points fit the “form”

 … and deviations from that pattern.

Negative association: High values of one variable tend to occur together

The strength of the relationship between the two variables can be

With a strong relationship, you With a weak relationship, for any

Both variables should be

The upper right-hand point here is

This point is not in line with the

a) Describe in words what this

b) Describe the direction,

c) What is the deal with these

What may look like a positive linear

Plotting different habitats in

Relationship between lean body mass

When the explanatory variable is categorical, you cannot make a

But be careful in your

? What association? What relationship?

Blue White Green Yellow

Blue Green White Yellow

Î Describe one category at a time.

Time plot of the acceleration of the

The overall pattern was calculated

© 2012 W. H. Freeman and Company

 The correlation coefficient “r”

 The correlation coefficient is a measure of the direction and strength

 It is calculated using the mean and the standard deviation of both

 Correlation can only be used to describe quantitative variables.

Time to swim: x = 35, sx = 0.7

Pulse rate: y = 140 sy = 9.5

z for time z for pulse

Part of the calculation

You DON'T want to do this by hand.

For instance, we might

z-score plot is the same

Strength: how closely the points

Direction: is positive when

No matter how strong the association,

Note: You can sometimes transform a non-linear association to a linear form,

Just moving one point away from the

Adding two outliers decreases r from 0.95 to 0.61.

1) What is the explanatory variable?

Describe the form, direction, and strength

2) If women always marry men 2 years older

 Why is there no distinction between explanatory and response

 Why do both variables have to be quantitative?

 How does changing the units of measurement affect correlation?

 What is the effect of outliers on correlations?

 Why doesn’t a tight fit to a horizontal line imply a strong correlation?

© 2012 W. H. Freeman and Company

 Prediction and Extrapolation

In addition, we would like to have a numerical description of how both

Typically, the explanatory or independent variable is plotted on the x

Explanatory (independent) variable:

Does percent return on Treasury

 A regression line is a straight line that describes how a response

 We often use a regression line to predict the value of y for a given

 In regression, the distinction between explanatory and response

Distances between the points and

yˆ is the predicted y value (y hat)

First we calculate the slope of the line, b1; sy

b 0 = y − b1 x Where x and y are the sample

Typically, we use a 2-var stats calculator or stats software.

The points you use for

They are NOT points from

What cases does the data describe?

What variables are present and how are they measured?

Are all of the variables quantitative?

Do some of the variables explain or even cause changes in other

Start with a graph

Look for an overall pattern and deviations from the pattern

Use numerical descriptions of the data and overall pattern (if

Direction: positive, negative, no direction

Strength: how closely the points fit the “form”

… and deviations from that pattern.

The correlation coefficient “r”

The correlation coefficient is a measure of the direction and strength

It is calculated using the mean and the standard deviation of both

Correlation can only be used to describe quantitative variables.

Why is there no distinction between explanatory and response

Why do both variables have to be quantitative?

How does changing the units of measurement affect correlation?

What is the effect of outliers on correlations?

Why doesn’t a tight fit to a horizontal line imply a strong correlation?

Prediction and Extrapolation

A regression line is a straight line that describes how a response

We often use a regression line to predict the value of y for a given

In regression, the distinction between explanatory and response

Outliers and influential points

Correlation/regression using averages

The restricted range problem

Strong positive association between

Negative association between moderate

A lurking variable is a variable that is not among the explanatory or

Two variables are confounded when their effects on a response

Association is not causation. Even if an association is very strong,

Do not use a regression on inappropriate data.

Beware of lurking variables.

Avoid extrapolating (going beyond interpolation).

Recognize when the correlation/regression is performed on averages.

A relationship, however strong it is, does not itself imply causation.

Relationships between categorical variables

Each combination of values for these two variables is called a cell.