0% found this document useful (0 votes)
13 views

Bi Variate 1

Uploaded by

Alex Manihuruk
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Bi Variate 1

Uploaded by

Alex Manihuruk
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 75

Bivariate Data analysis

Bivariate Data
In this PowerPoint we look at sets of data
which contain two variables.

Scatter plots Correlation

Outliers Causation
Variables
We are only going to consider
quantitative variables in this AS

Quantitative Qualitative
(Numerical) (categorical)
(measurements and counts) (define groups)

Continuous Discrete Categorical Ordinal


(no idea of (fall in natural
order) order)
Quantitative
Discrete Continuous
• Many repeated values • Few repeated values
• Age groups • Height
• Marks • Length
• Weight
Qualitative
Categorical Ordinal
• Gender • Grades
• Religious • Places in a race (e.g.
denomination 1st, 2nd, 3rd)
• Blood types
• Sport’s numbers (e.g.
He wears the number
‘8’ jersey)
We often want to know if there is
a relationship between two
numerical variables.

A scatter plot, which gives a visual display


of the relationship between two variables,
provides a good starting point.
In a relationship involving two variables, if
the values of one variable ‘depend’ on the
values of another variable, then the former
variable is referred to as the dependent (or
response) variable and the latter variable is
referred to as the independent (or explanatory)
variable.

y - axis dependent (response) variable


x - axis independent (explanatory) variable
Consider data on ‘hours of study’ vs ‘ test score’

Hours Score Hours Score Hours Score


18 59 14 54 17 59
16 67 17 72 16 76
22 74 14 63 14 59
27 90 19 72 29 89
15 62 20 58 30 93
28 89 10 47 30 96
18 71 28 85 23 82
19 60 25 75 26 35
22 84 18 63 22 78
30 98 19 61
We may want to see if we could predict
the test score (response variable) based on the
hours of study (explanatory variable).

y - axis: Test score


x - axis: Hours of study
Certain
We look patterns
for a
pattern
tell usinabout
the way
the
therelationship
points lie

This point
is an outlier

This is called
correlation
We could describe
the rest of the
data as having
a linear form.
Scatter plots
• Use hollow circles for points
• Label axes correctly with units
• What you want to predict goes on the y-axis
(response variable)
• Title of graph
• No background; No gridlines
• Unless you need to show categories- no legend
• Show different categories on a single graph in
different colours rather than on separate graphs.
• Adjust scale and size of font (14pt for pasting)
What to look for in your plot?
• Direction of the relationship - positive or negative
• Form of the graph - linear or curved
• The strength - whether it is strong, moderate or
weak
• Scatter - constant scatter, a fan effect…
• Outliers
• Groupings
Page 22
What do you see in this scatter plot?

• There appears to be
Mean January Air Temperatures
for 30 New Zealand Locations a linear trend.
20 • There appears to be
19 moderate constant
scatter.
Temperature (°C)

18

17 • Negative Association.
16
• No outliers or
15
groupings visible.
14

35 40 45

Latitude (°S)
What do you see in this scatter plot?

% of population who are Internet Users vs


GDP per capita for 202 Countries • There appears to be
80

70
a non-linear trend.
60 • There appears to be
Internet Users (%)

50

40
non-constant scatter
30 about the trend line.
20

10
• Positive Association.
0
• One possible outlier
0 10 20 30 40
(Large GDP, low %
GDP per capita (thousands of dollars)
Internet Users).
What do you see in this scatter plot?

Average Age New Zealanders are First Married • Two non-linear trends
30
(Male and Female).
28
• Very little scatter
26
about the trend lines
• Negative association
Age

24

22 until about 1970,


20 then a positive
1930 1940 1950 1960 1970 1980 1990 association.
Year
• Gap in the data
collection (Second
World War).
Rank these relationships from
weakest (1) to strongest (4):

2 4

1 3
Describe these relationshipsPerfect,
Perfect,
negative, No positive,
linear relationship linear
relationship relationship

Moderate, Weak,
negative positive
linear linear
relationship relationship
Describe this relationship.
As the hours of study increase,
the test score . . . .? . . .
Pearson’s product-moment correlation
coefficient, r
Correlation measures the Points
strength of fall
No linear exactly on a
the linear association between
relationship two
Points fall exactly
quantitative variables. straight line
(uncorrelated)
on a straight line

The correlation coefficient


r = -1
may take
r = -0.7 anyr =value
r = -0.4 0 r = 0.3 r = 0.8 r=1
between -1.0 and +1.0
r - what does it tell you?

How close the points in the


scatter plot come to lying on the line.

r = 0.99 r = 0.57 **
r = 0.99 r = 0.57 *
*** *
* * * ** *
** * * *
y

* * * * * *
*** ** * * * *
* *
*
x x
Interpreting r
• 0.75-1 Strong positive linear association
• 0.5-0.75 Moderate positive linear association
• 0.25-0.5 Weak positive linear association
• -
0.25-0.25 No association or weak linear
association
• -0.5--0.25 Weak negative linear association
• -0.75--0.5 Moderate negative linear association
• -1 - -0.75 Strong negative linear association
Useful websites
• http://www.ruf.rice.edu/~lane/stat_sim/reg_by_ey
e/index.html
Regression by eye

• http://istics.net/stat/Correlations/ Guessing

• http://illuminations.nctm.org/LessonDetail.a
spx?ID=L455#whatif
effect of outliers
Assumptions
• linear relationship between x and y
• continuous random variables
• The residuals must be normally distributed
• x and y must be independent of each other
• all individuals must be selected at random
from the population
• all individuals must have equal chance of
being selected
What is correlation?

A measure of the strength of a


LINEAR association between two
quantitative variables.
Sure you can calculate a
correlation coefficient for any
pair of variables but correlation
measures the strength only of the
linear association and will be
misleading if the relationship is
not linear.
Do you know that:
• Correlation applies only to quantitative
variables. Check you know the units and
what they measure.
• Outliers can distort the correlation
dramatically.
Some facts about the correlation
coefficient
• The sign gives the direction of the association.
• Correlation is always between -1 and 1.
• Correlation treats x and y symmetrically. The correlation
of x and y is the same as the correlation of y with x.
• Correlation has no units and is generally given as a
decimal.
• r is a multiple of the slope
• Note: variables can have a strong association but still have
a small correlation if the association isn’t linear.
• Correlation is sensitive to outliers. A single outlying value
can make a small correlation large or make a large one
small.
The sign gives the direction of the
association.

Positive Negative
Correlation treats x and y symmetrically.
The correlation of x and y is the same as
the correlation of y with x.
r is a multiple of the slope
Variables can have a strong
association but still have a small
correlation if the association isn’t
linear.

Always plot the data before


looking at the correlation!
Would it be OK to use a correlation
coefficient to describe the strength
of the relationship? Distances of Planets from the Sun
Reaction Times (seconds)
for 30 Year 10 Students
Average Weekly Income for Employed New Zealanders in 2001
4000 Mean January Air Temperatures
0.8 for 30 New Zealand Locations

X
1200
3000 20

X
10000.6
19
Dominant Hand

800
2000
18
Male ($)

0.4
Temperature (°C)

600
Distance (million miles)

17


1000
400
0.2
16
200
0
15
0 0
0 1 2 3 4 5 6 7 8 9
0 0
14 200 0.2 400 0.4 600 0.6 8000.8 1

Female Number
Position ($)
35 40 45
Non-dominant Hand

Latitude (°S)
Correlation is sensitive to outliers.
A single outlying value can make a
small correlation large or make a
large one small.
You should be cautious in
interpreting the correlation - these
graphs all have the same
correlation coefficient (0.817)
Data set 1
Data set 2
Data set 3
Data set 4
Outliers can distort the
correlation dramatically. An
outlier can make an otherwise
small correlation look big or hide
a large correlation. It can even
give an otherwise positive
association a negative correlation
coefficient (and vice versa).
What do you see in this scatterplot?
Height and Foot Size
for 30 Year 10 Students
•Appears to be a
200
linear trend, with
a possible outlier
190 (tall person with
a small foot
size.)
180
Height (cm)

•Appears to be
170

160 constant scatter.

150
•Positive
22 23 24 25 26 27 28 29 association.
Foot size (cm)
What will happen to the correlation
coefficient if the tallest Year 10 student is
removed?
Height and Foot Size
for 30 Year 10 Students

200
•It will get
190
smaller
180 •It will get
bigger
Height (cm)

170

160 •It will stay


150 the same
22 23 24 25 26 27 28 29

Foot size (cm)


What do you see in this scatter plot?

Life Expectancies and Gestation Period


for a sample of non-human Mammals •Appears to be a
strong linear
trend.
40

30
Elephant •Outlier in X (the
elephant).
Life Expectancy (Years)

20 •Appears to be
constant scatter.
10
•Positive
association.
0 100 200 300 400 500 600

Gestation (Days)
What will happen to the correlation
coefficient if the elephant is removed?
Life Expectancies and Gestation Period
for a sample of non-human Mammals

40

•It will get


30
Elephant
smaller
•It will get
Life Expectancy (Years)

20

bigger
•It will stay
10

0 100 200 300

Gestation (Days)
400 500 600
the same
How does the outlier affect the r - value?
How does the outlier affect the r - value?
How does the outlier affect the r - value?
How does the outlier affect the r - value?
How does the outlier affect the r - value?
How does the outlier affect the r - value?
When you see an outlier, it’s
often a good idea to report the
correlations with and without the
point.
Don’t confuse Correlation with
causation. Scatterplots and
correlation never prove
causation.
Using the information in the plot, can you

suggest what needs to be done in a


country to increase the life expectancy?
Explain.
Life Expectancy and Availability of Doctors for a
Sample of 40 Countries
80

Perhaps if you have


70 less people per
Doctor (i.e. more
Life Expectancy

60 Doctors per
person), then the
50
life expectancy will
0 10000 20000 30000 40000
increase.
People per Doctor
Using the information in this plot,
can you make another suggestion as
to what needs to be done in a
country to increase life expectancy?
Life Expectancy and Availability of Televisions for a
Sample of 40 Countries It looks like if
80
you decrease the
number of people
70
per television
(i.e. have more
TVs per person),
Life Expectancy

60 then the life


expectancy will
increase!
50

0 100 200 300 400 500 600

People per Television


Can you suggest another variable
that is linked to life expectancy and
the availability of doctors (and
televisions) which explains the
association between the life
expectancy and the availability of
doctors
(and televisions)?

Some measure of wealth of a country.

Eg Average income per person or GDP.


Damaged for life by too much TV
Damaged for life by too much TV

• Watching too much television as a child


causes serious health problems years later,
and raises the risk of heart disease, a New
Zealand study of 1000 children has
found….
• It links the amount of time spent in front of
the box as a child with obesity, high
cholesterol, poor fitness and smoking….
Damaged for life by too much TV
Health Score

r = - 0.93

TV watching
Causal relationships
• Two general types of studies: experiments
and observational studies
• In an experiment, the experimenter
determines which experimental units
receive which treatments.
• In an observational study, we simply
compare units that happen to have received
each of the treatments.
Causal relationships
• Only properly designed and carefully
executed experiments can reliably
demonstrate causation.
• An observational study is often useful for
identifying possible causes of effects, but it
cannot reliably establish causation
Causal relationships
• In observational studies, strong
relationships are not necessarily causal
relationships.
• Correlation does not imply causation.
• Be aware of the possibility of lurking
variables.
Watch out for lurking variables.
Damage ($) vs number of firemen
would show a strong correlation,
but damage doesn’t cause firemen
and firemen do seem to cause
damage (spraying water and
chopping holes). The underlying
variable is the size of the blaze.
Although there was plenty of
evidence that increased smoking
was associated with increased
levels of lung cancer, it took
years to provide evidence that
smoking actually causes lung
cancer.
It would be a good idea to read
the two pages of notes you have
that discusses correlation and
causation!
So now you want to know how to
calculate the correlation
coefficient, r.
Here is one version of the
formula!
Luckily the computer will
calculate R and you can square
2

root this to get r.


Remember only when the
association is linear.
r measures the strength of the
relationship NOT R2!!!!
r measures the strength of the
relationship NOT R2!!!!
r measures the strength of the
relationship NOT R2!!!!
The words you use
• There is a strong, positive, linear relationship
between ‘x’ and ‘y’ and when the x- values
increase, the y-values increase also. This is
indicated by the value of the correlation
coefficient i.e. r = 0.85 which is close to 1.
• (Note: Do not use ‘x’ and ‘y’ use what they
represent.)

You might also like