Two variables Chap3
Two variables Chap3
Markus Brede
[email protected]
Relationships between variables
• So far we have looked at ways of
characterizing the distribution of a single
variable, and testing hypotheses about the
population based on a sample.
• We're now moving on to the ways in which
two variables can be examined together.
• This comes up a lot in research!
Relationships between variables
• You might want to know:
o To what extent the change in a patient's blood
pressure is linked to the dosage level of a drug
they've been given.
o To what degree the number of plant species in an
ecosystem is related to the number of animal
species.
o Whether temperature affects the rate of a chemical
reaction.
Relationships between variables
• We assume that for each case we have at
least two real-valued variables.
• For example: both height (cm) and weight
(kg) recorded for a group of people.
• The standard way to display this is using a
dot plot or scatterplot.
Positive Relationship
Negative Relationship
No Relationship
Measuring relationships?
• We're going to need a way of measuring
whether one variable changes when another
one does.
• Another way of putting it: when we know the
value of variable A, how much information do
we have about variable B's value?
Recap of the one-variable case
• Perhaps we can borrow some ideas about
the way we characterized variation in the
single-variable case.
• With one variable, we start out by finding the
mean, which is also the expectation of the
distribution.
Sum of the squared deviations
• Then find the sum
of all the squared
deviations from the
mean.
• This gives us a
measure of the
total variation: it will
be higher for bigger
samples.
Sum of the squared deviations
•
The variance
• This is a good measure of how much
variation exists in the sample, normalized by
sample size.
• It has the nice property of being additive.
• The only problem is that the variance is
measured in units squared.
• So we take the square root to get...
The standard deviation
•
The standard deviation
• With a good estimate of the population SD,
we can reason about the standard deviation
of the distribution of sample means.
• That's a number that gets smaller as the
sample sizes get bigger.
• To calculate this from the sample standard
deviation we divide through by the square
root of N, the sample size, to get...
The standard error
• This measures the precision of our
estimation of the true population mean.
• Plus or minus 1.96 standard errors from the
sample mean should capture the true
population mean 95% of the time.
• The standard error is itself the standard
deviation of the distribution of the sample
means.
Variation in one variable
• So, these four measures all describe
aspects of the variation in a single variable:
a. Sum of the squared deviations
b. Variance
c. Standard deviation
d. Standard error
• Can we adapt them for thinking about the
way in which two variables might vary
together?
Two variable example
• Consider a small sample of four records with
two variables recorded, X and Y.
• X and Y could be anything.
• Let's say X is hours spent fishing, Y is
number of fish caught.
• Values: (1,1) (4,3) (7,5) (8,7).
Two variable example
• We can see there's a positive relationship
but how should we quantify it?
• We can start by calculating the mean for
each variable.
• Mean of X = 5.
• Mean of Y = 4.
Two variable example
• In the one-variable case, the next step would
be to find the deviations from the mean and
then square them.
• In the two-variable case, we need to connect
the variables.
• We do this by multiplying each X-deviation
by its associated Y-deviation
Calculating covariance
• -4 x -3 = 12
• -1 x -1 = 1
• 2x1=2
• 3x3=9
• Total of the cross-multiplied deviates = 24.
∑i (X i− X̄ )(Y i−Ȳ )
In Formulae
• Variance:
2
V [ X ]=E [( X − X̄ ) ]
V [ X ]=1/( N −1) ∑i ( X i− X̄ ) 2
• Covariance:
Cov [ X ,Y ]=E [( X − X̄ )(Y − Ȳ )]
Cov [ X ,Y ]=1/(N −1) ∑i (X i− X̄ )(Y i −Ȳ )
• Note Bessel's correction in the sample
versions ...
Calculating covariance
• Divide by N if this is the population, or divide
by N-1 if this is a sample and we're
estimating the population.
• If this was the population, we get 24 / 4 = 6.
• If this is a sample and we want to estimate
the true population value, we get 24 / 3 = 8.
• Assuming this is a sample, we have a
measure of 8 "fish-hours" for the estimated
covariance between X and Y.
Properties of covariance
• You might remember the formula for the
variance of the sum of two independent
random variates. If they are correlated we
instead have:
Cov [ X , Y ]
r=
√ V [ X ] √ V [Y ]
• So we obtain a correlation coefficient
• ... or more technically: a Pearson product
moment correlation coefficient
The correlation coefficient
• What magnitude will the measure have?