Bi Variate 1
Bi Variate 1
Bivariate Data
In this PowerPoint we look at sets of data
which contain two variables.
Outliers Causation
Variables
We are only going to consider
quantitative variables in this AS
Quantitative Qualitative
(Numerical) (categorical)
(measurements and counts) (define groups)
This point
is an outlier
This is called
correlation
We could describe
the rest of the
data as having
a linear form.
Scatter plots
• Use hollow circles for points
• Label axes correctly with units
• What you want to predict goes on the y-axis
(response variable)
• Title of graph
• No background; No gridlines
• Unless you need to show categories- no legend
• Show different categories on a single graph in
different colours rather than on separate graphs.
• Adjust scale and size of font (14pt for pasting)
What to look for in your plot?
• Direction of the relationship - positive or negative
• Form of the graph - linear or curved
• The strength - whether it is strong, moderate or
weak
• Scatter - constant scatter, a fan effect…
• Outliers
• Groupings
Page 22
What do you see in this scatter plot?
• There appears to be
Mean January Air Temperatures
for 30 New Zealand Locations a linear trend.
20 • There appears to be
19 moderate constant
scatter.
Temperature (°C)
18
17 • Negative Association.
16
• No outliers or
15
groupings visible.
14
35 40 45
Latitude (°S)
What do you see in this scatter plot?
70
a non-linear trend.
60 • There appears to be
Internet Users (%)
50
40
non-constant scatter
30 about the trend line.
20
10
• Positive Association.
0
• One possible outlier
0 10 20 30 40
(Large GDP, low %
GDP per capita (thousands of dollars)
Internet Users).
What do you see in this scatter plot?
Average Age New Zealanders are First Married • Two non-linear trends
30
(Male and Female).
28
• Very little scatter
26
about the trend lines
• Negative association
Age
24
2 4
1 3
Describe these relationshipsPerfect,
Perfect,
negative, No positive,
linear relationship linear
relationship relationship
Moderate, Weak,
negative positive
linear linear
relationship relationship
Describe this relationship.
As the hours of study increase,
the test score . . . .? . . .
Pearson’s product-moment correlation
coefficient, r
Correlation measures the Points
strength of fall
No linear exactly on a
the linear association between
relationship two
Points fall exactly
quantitative variables. straight line
(uncorrelated)
on a straight line
r = 0.99 r = 0.57 **
r = 0.99 r = 0.57 *
*** *
* * * ** *
** * * *
y
* * * * * *
*** ** * * * *
* *
*
x x
Interpreting r
• 0.75-1 Strong positive linear association
• 0.5-0.75 Moderate positive linear association
• 0.25-0.5 Weak positive linear association
• -
0.25-0.25 No association or weak linear
association
• -0.5--0.25 Weak negative linear association
• -0.75--0.5 Moderate negative linear association
• -1 - -0.75 Strong negative linear association
Useful websites
• http://www.ruf.rice.edu/~lane/stat_sim/reg_by_ey
e/index.html
Regression by eye
• http://istics.net/stat/Correlations/ Guessing
• http://illuminations.nctm.org/LessonDetail.a
spx?ID=L455#whatif
effect of outliers
Assumptions
• linear relationship between x and y
• continuous random variables
• The residuals must be normally distributed
• x and y must be independent of each other
• all individuals must be selected at random
from the population
• all individuals must have equal chance of
being selected
What is correlation?
Positive Negative
Correlation treats x and y symmetrically.
The correlation of x and y is the same as
the correlation of y with x.
r is a multiple of the slope
Variables can have a strong
association but still have a small
correlation if the association isn’t
linear.
X
1200
3000 20
X
10000.6
19
Dominant Hand
800
2000
18
Male ($)
0.4
Temperature (°C)
600
Distance (million miles)
17
√
1000
400
0.2
16
200
0
15
0 0
0 1 2 3 4 5 6 7 8 9
0 0
14 200 0.2 400 0.4 600 0.6 8000.8 1
Female Number
Position ($)
35 40 45
Non-dominant Hand
Latitude (°S)
Correlation is sensitive to outliers.
A single outlying value can make a
small correlation large or make a
large one small.
You should be cautious in
interpreting the correlation - these
graphs all have the same
correlation coefficient (0.817)
Data set 1
Data set 2
Data set 3
Data set 4
Outliers can distort the
correlation dramatically. An
outlier can make an otherwise
small correlation look big or hide
a large correlation. It can even
give an otherwise positive
association a negative correlation
coefficient (and vice versa).
What do you see in this scatterplot?
Height and Foot Size
for 30 Year 10 Students
•Appears to be a
200
linear trend, with
a possible outlier
190 (tall person with
a small foot
size.)
180
Height (cm)
•Appears to be
170
150
•Positive
22 23 24 25 26 27 28 29 association.
Foot size (cm)
What will happen to the correlation
coefficient if the tallest Year 10 student is
removed?
Height and Foot Size
for 30 Year 10 Students
200
•It will get
190
smaller
180 •It will get
bigger
Height (cm)
170
30
Elephant •Outlier in X (the
elephant).
Life Expectancy (Years)
20 •Appears to be
constant scatter.
10
•Positive
association.
0 100 200 300 400 500 600
Gestation (Days)
What will happen to the correlation
coefficient if the elephant is removed?
Life Expectancies and Gestation Period
for a sample of non-human Mammals
40
20
bigger
•It will stay
10
Gestation (Days)
400 500 600
the same
How does the outlier affect the r - value?
How does the outlier affect the r - value?
How does the outlier affect the r - value?
How does the outlier affect the r - value?
How does the outlier affect the r - value?
How does the outlier affect the r - value?
When you see an outlier, it’s
often a good idea to report the
correlations with and without the
point.
Don’t confuse Correlation with
causation. Scatterplots and
correlation never prove
causation.
Using the information in the plot, can you
60 Doctors per
person), then the
50
life expectancy will
0 10000 20000 30000 40000
increase.
People per Doctor
Using the information in this plot,
can you make another suggestion as
to what needs to be done in a
country to increase life expectancy?
Life Expectancy and Availability of Televisions for a
Sample of 40 Countries It looks like if
80
you decrease the
number of people
70
per television
(i.e. have more
TVs per person),
Life Expectancy
r = - 0.93
TV watching
Causal relationships
• Two general types of studies: experiments
and observational studies
• In an experiment, the experimenter
determines which experimental units
receive which treatments.
• In an observational study, we simply
compare units that happen to have received
each of the treatments.
Causal relationships
• Only properly designed and carefully
executed experiments can reliably
demonstrate causation.
• An observational study is often useful for
identifying possible causes of effects, but it
cannot reliably establish causation
Causal relationships
• In observational studies, strong
relationships are not necessarily causal
relationships.
• Correlation does not imply causation.
• Be aware of the possibility of lurking
variables.
Watch out for lurking variables.
Damage ($) vs number of firemen
would show a strong correlation,
but damage doesn’t cause firemen
and firemen do seem to cause
damage (spraying water and
chopping holes). The underlying
variable is the size of the blaze.
Although there was plenty of
evidence that increased smoking
was associated with increased
levels of lung cancer, it took
years to provide evidence that
smoking actually causes lung
cancer.
It would be a good idea to read
the two pages of notes you have
that discusses correlation and
causation!
So now you want to know how to
calculate the correlation
coefficient, r.
Here is one version of the
formula!
Luckily the computer will
calculate R and you can square
2