Two Quantitative Variables: Scatterplot, Correlation, and Linear Regression
Two Quantitative Variables: Scatterplot, Correlation, and Linear Regression
Regression
Example. When a US president runs for re-election, how strong is the relationship between
the president’s approval rating and the outcome of the election? The table below includes
all the presidential elections since 1940 in which an incumbent was running and shows the
presidential approval rating at the time of the election.
1. What was the highest approval rating for any of the losing presidents? What was the
lowest approval rating for any of the winning presidents? Make a conjecture about the
approval rating needed by a sitting president in order to win re-election.
2. Approval rating and margin of victory are both quantitative variables. Does there seem
to be an association between the two variables?
Scatterplot
A scatterplot is a graph of the relationship between two quantitative variables.
A scatterplot includes a pair of axes with appropriate numerical scales, one for each variable.
The paired data for each case are plotted as a point on the scatterplot. If there are explanatory
and response variables, we put the explanatory variable on the x-axis and the response
variable on the y-axis.
Example. Draw a scatterplot for the data on approval rating and margin of victory in the
table above.
Interpreting a Scatterplot
When looking at a scatterplot we often address the following questions:
• Do the points form a clear trend with a particular direction, are they more scattered
about a general trend, or is there no obvious pattern?
• If there is a trend, does it seem to follow a straight line, which we call a linear associ-
ation, or some other curve or pattern?
• Are there any outlier points that are clearly distinct from a general pattern in the data?
2
Example. Four scatterplots are shown in the figure below. For each pair of variables, discuss
the information contained in the scatterplot. If there appears to be a positive or negative
association, discuss what that means in the specific context.
3
Summarizing a Relationship between Two Quantitative Variables: Correlation
Just as the mean or median summarizes the center and the standard deviation or IQR mea-
sures the spread of the distribution for a single quantitative variable, we need a numerical
statistic to measure the strength and direction of association between two quantitative vari-
ables. One such statistic is the correlation.
Correlation
The correlation is a measure of the strength and direction of linear association between two
quantitative variables.
The correlations for each of the pairs of variables that have been displayed in scatterplots
earlier in this section are displayed below.
• The correlation r has no units and is independent of the scale of either variable.
• The correlation is symmetric: The correlation between variables x and y is the same
as the correlation between y and x.
4
Example. Common folk wisdom claims that one can determine the temperature on a sum-
mer evening by counting how fast the crickets are chirping. Is there really an association
between chirp rate and temperature? The data below were collected by E.A. Bessey and
C.A. Bessey, who measured chirp rates for crickets and temperature during the summer of
1898.
1. Use the scatterplot to estimate the correlation between chirp rate and temperature.
Explain your reasoning.
5
Example. The figure below shows the estimated average life expectancy (in years) for a
sample of 40 countries against the average amount of fat (measured in grams per capita per
day) in the food supply for each country. The scatterplot shows a clear positive association
(r = 0.70) between these two variables. The countries with short life expectancies all have
below-average fat consumption, while the countries consuming more than 100 grams of fat
on average all have life expectancies over 70 years. Does this mean that we should eat more
fat to live longer?
6
Correlation Caution #1
A strong positive or negative correlation does not (necessarily) imply a cause and effect
relationship between the two variables.
Example. Core body temperature for an individual person tends to fluctuate during the
day according to a regular circadian rhythm. Suppose that the body temperature for an
adult woman are recorded every hour of the day, starting at 6 am. The results are shown in
the figure below. Does there appear to be an association between the time of day and body
temperature? Estimate the correlation between the hour of the day and the woman’s body
temperature.
7
Correlation Caution #2
A correlation near zero does not (necessarily) mean that the two variables are not associated,
since the correlation measures only the strength of a linear relationship.
Example. The figure below shows the alcohol consumption (drinks per week) and average
daily caloric intake for 91 subjects who are at least 60 years old, from the data in Nutri-
tionStudy. Notice the distinct outlier who claims to imbibe 203 drinks per week as part of
a 6662 calorie diet! This is almost certainly an incorrect observation. The second plot shows
these same data with the outlier removed. How do you think the correlation between calories
and alcohol consumption change when the outlier is deleted?
8
Correlation Caution #3
Correlation can be heavily influenced by outliers. Always plot your data.
This formula essentially involves converting all values for both variables to z-scores, which
puts the correlation on a fixed ±1 scale and makes it independent of the scale of measurement.
For a positive association, large values of x tend to occur with large values of y (both z-scores
are positive) and small values (with negative z-scores) tend to occur together. In either case,
the products are positive, which leads to a positive sum. For a negative association, the
z-scores tend to have opposite signs (small x with large y and vice versa) so the products
tend to be negative.
9
The Regression Line
The process of fitting a line to a set of data is called linear regression and the line of the best
fit is called the regression line. The regression line provides a model of a linear association
between two variables, and we can use the regression line on a scatterplot to give a predicted
value of the response variable, based on a given value of the explanatory variable.
Example. Use the regression line in the figure below to estimate the predicted tip amount
on a $60 bill.
10
Explanatory and Response Variables
The regression line to predict y from x is NOT the same as the regression line to predict x
from y. Be sure to always pay attention to which is the explanatory variable and which is
the response variable.
A regression line is always in the form
\ = a + b · Explanatory
Response
The equation of the regression line is often called a prediction equation because we can use
it to make predictions. We substitute the value o the explanatory variable into the prediction
equation to calculate the predicted response.
Example. Three different bill amounts from the RestaurantTips dataset are given. In
d = −.292 + 0.182 · Bill to predict the tip.
each case, use the regression line Tip
1. A bill of $59.33
2. A bill of $9.52
3. A bill of $23.70
11
Residuals
The residual at a data value is the difference between the observed and predicted values of
the response variable:
On a scatterplot, the residual represents the vertical deviation from the line to a data point.
Points above will have positive residuals and points below the line will have negative residuals.
If the predicted values closely match the observed data values, the residuals will be small.
Example. In the previous example, we found the predicted tip amount for three different
bills in the restaurantTips dataset. The actual tips left by each of these customers are
shown below. Use the information to calculate the residuals for each of these sample points.
12
Example. The data from ElectionMargin are given below.
\ = −36.5 + 0.836(Approval)
Margin
Calculate the predicted values and the residuals for all the data points.
13
2. Which residual is the largest? For this largest residual, is the observed margin higher
or lower than the margin predicted by the regression line? To which president and year
does this residual correspond?
• The slope b represents the predicted change in the response variable y given a one unit
increase in the explanatory variable x.
• The intercept a represents the predicted value of the response variable y if the explana-
tory variable x is zero. The interpretation may be nonsensical since it is often not
reasonable for the explanatory variable to be zero.
14
Example. In an earlier example, we consider some scatterplots from the dataset Flori-
daLakes showing relationships between acidity, alkalinity, and fish mercury levels in n = 53
Florida lakes. We wish to predict a quantity that is difficult to measure (mercury level of
fish) using a value that is more easily obtained from a water sample (acidity). We saw that
there appears to be a negative linear association between these two variables, so a regression
line is appropriate.
1. Use technology to find the regression line to predict Mercury from pH, and plot it on
a scatterplot of the data.
2. Interpret the slope of the regression line in the context of Florida lakes.
15
Regression Caution #1
Avoid trying to apply a regression line to predict values far from those that were used to
create it.
Example. In the previous example, we used the acidity (pH) of Florida lakes to predict
mercury levels in fish. Suppose that, instead of mercury, we use acidity to predict the calcium
concentration (mg/l) in Florida lakes. The figure below shows a scatterplot of these data
\ = −51.4 + 11.17 · pH for the 53 lakes in our sample. Give an
with the regression line Calcium
interpretation for the slope in this situation. Does the intercept make sense? Comment on
how well the linear prediction equation describes the relationship between these two variables.
16
Regression Caution #2
Always plot the data. Although the regression line can be calculated for any set of paired
quantitative variables, it is only appropriate to use a regression line when there is a linear
trend in the data.
Regression Caution #3
Outliers can have a strong influence on the regression line, just as we saw for correlation. In
particular, data points for which the explanatory value is an outlier are often called influential
points because they exert an overly strong effect on the regression line.
17