Anscombe's Quartet:: Data Sets
Anscombe's Quartet:: Data Sets
073/MSMS/859
Assignment 01(Data Science)
Anscombe's Quartet:
Anscombe’s quartet is a classic example of the drawback to just reporting correlation. Francis
Anscombe illustrated in his 1973 American Statistician paper, how a set of four different pairs
of variables can deliver the same correlation coefficient, while the relationships between each
pair are completely different. It is constructed to demonstrate both the importance of graphing
data before analyzing it and the effect of outliers on statistical properties.
As it contains four datasets that have nearly identical simple descriptive statistics, yet appear
very different when graphed.
He described the article as being intended to counter the impression among statisticians that
"numerical calculations are exact, but graphs are rough.
Data Sets:
X1 Y1 X2 Y2 X3 Y3 X4 Y4
6 X1 vs Y1
Linear (X1 vs Y1)
4
0
0 5 10 15
X2 vs Y2
12
y = 0.5x + 3.0009
10 R² = 0.6662
8
X2 vs Y2
6
Linear (X2 vs Y2)
4
0
0 5 10 15
he second graph is not distributed normally; while a relationship between the two variables is
obvious, it is not linear, and the Pearson correlation coefficient is not relevant. A more
general regression and the corresponding coefficient of determination would be more
appropriate.
X3 vs Y3
14
y = 0.4997x + 3.0025
12 R² = 0.6663
10
8
X3 vs Y3
6
Linear (X3 vs Y3)
4
0
0 5 10 15
In the third graph, the distribution is linear, but should have a different regression line (a robust
regression would have been called for). The calculated regression is offset by the one outlier
which exerts enough influence to lower the correlation coefficient from 1 to 0.816.
X4 vs Y4
14
y = 0.4999x + 3.0017
12 R² = 0.6667
10
8
X4 vs Y4
6
Linear (X4 vs Y4)
4
0
0 5 10 15 20
Finally, the fourth graph shows an example when one outlier is enough to produce a high
correlation coefficient, even though the other data points do not indicate any relationship
between the variables
It is unknown, how Anscombe created his datasets. Since its publication, several methods to
generate similar data sets with identical statistics and dissimilar graphics have been developed.