Lecutre Notes PDF
Lecutre Notes PDF
dispersion are the variance and its square root, the standard deviation.
Since the variance is just the square of the standard deviation, these
quantities contain essentially the same information, just on dierent
scales.
The range and IQR each take only two data points into account.
How might we measure the spread in the data accounting for the
value of every observation?
Consider the insurance claim data again:
Observation
Number (i) xi x x ;x
i (x ; x ) 2
i
33
One way to measure spread is to calculate the mean and then determine
how far each observation is from the mean.
X
9
Mean: x = n x = 19 (1100 + 1900 + + 1050) = 99756:67:
1
i
=1 i
x1 ; x x2 ; x : : : x9 ; x:
One idea is to compute the average of these deviations from the mean.
That is, compute
1 f(x ; x) + (x ; x) + + (x ; x)g = 1 X
9
9 1 2 9
9 =1 (x i
i ; x):
P
Problem: =1 (x ; x) = 0 (always!).
n
i i
Deviations from the mean always necessarily sum to zero. The pos-
itive and negative values cancel each other out.
Solution: Make all of the deviations from the mean positive by squaring
them before averaging.
That is, compute
(x1 ; x)2 (x2 ; x)2 : : : (x9 ; x)2
and then average. This gives the quantity
1X9
9 =1 (x
i
i ; x)2 = 78060733578:
34
If x1 x2 : : : x9 is the entire population, then x = , the population mean,
and the population size is N = 9. In this case, our formula becomes
1 X(x N
; )2
N =1 i
i
denoted by 2 .
Why is this called 2 rather than ? Why the 2 exponent?
Because this is the average squared deviation from the mean (in the claims
data example, the units of this quantity are squared dollars).
To put the variance on the same scale as the original data, we some-
times prefer to work with the population standard deviation
which is denoted as and is just the square root of the population
variance 2 :
v
u
p u
t 1 X N
1 X n
=1 i
The sample standard deviation is simply the square root of this quan-
tity:
v
u
p u
t 1 X n
35
The sample variance of the 9 insurance claims is
X
s = n ;1 1 (x = 9 ;1 1 f(x1 ; x)2 + (x9 ; x)2 g
n
2
i ; x)2
=1i
A Note on Computation:
The formula for s2 that we just presented,
X
s = n ;1 1 (x
n
2
i ; x)2
=1 i
n;1
When using this formula or any other that requires a series of calcu-
lations, keep all intermediate steps in the memory of your calculator
until the end to avoid round-o errors.
36
Coecient of Variation: In some cases the variance of a variable
changes with its mean.
For example, suppose we are measuring the weights of children of
various ages.
5 year old children (relatively light, on average)
15 year old children (much heaver, on average)
Clearly, there's much more variability in the weights of 15 year olds,
but a valid question to ask is \Do 15 year old children's weights have
more variabilty relative to their average?"
The coecient of variation allows such comparisons to be made:
population CV = 100%
sample CV = xs 100%:
37
Mean and Variance (or SD) for Grouped Data:
Example { Lead Content in Boston Drinking Water
Consider the following data on the lead content (mg/liter) in 12
samples of drinking water in the city of Boston, MA.
data: :035 :060 :055 :035 :031 :039 :038 :049 :073 :047 :031 :016
sorted data: :016 :031 :031 :035 :035 :038 :039 :047 :049 :055 :060 :073
Notice that there are some values here that occur more than once.
Consider how the mean is calculated in such a situation:
x = :016 + :031 + :031 + :03512+ :035 + :038 + + :073 = :042
= :016(1) + :(1)
031(2) + :035(2) + :038(1) + + :073(1)
+ (2) + (2) + (1) + + (1)
P mf
k
= P=1
i i i
f k
i=1 i
where
k = the number of distinct values in the data
m = the ith distinct value
i
= ( : 016 ; :042)2 (1) + (:031 ; :042)2 (2) + (:035 ; :042)2 (2) + + (:073 ; :042)2 (1)
$(1) + (2) + (2) + + (1)] ; 1
P
= P
k
i =1 ( mi; x)2 f i
$ =1 f ] ; 1
k
i i
38
Another Example: Apgar Scores of Low Birthweight Infants
Here is a frequency distribution of the Apgar scores for 100 low
birthweight infants in data set lowbwt.
Apgar Score Frequency
0 6
1 1
2 3
3 4
4 5
5 10
6 11
7 23
8 24
9 13
Total= 100
Using the formula for the mean for grouped data we have
P
x = P=1 m f
k
i i i
=1 f
k
i i
which agrees with the value we reported previously for these data.
Similarly, the sample SD is
v
u P (m
u ; x)2 fi
s=t P
k
=1 i i
$ f ];1
k
=1
r
i i
;1
= 2:43
39
z-Scores and Chebychev's Inequality:
The National Center for Health Statistics at the CDC gives the following
estimate of the body mass index ( height
weight
2 ) for 15 year old boys:
x = 19:83
Suppose that a particular 15 year old boy, Fred, has a BMI equal to 25.
How overweight is Fred?
We know he is heavier than average for his age/gender group, but how
much heavier?
Relative to the variability in BMI for 15 year old boys in general,
Fred's BMI may be close to the mean or far away.
Case 1: Suppose s = 10.
This implies that the typical deviation from the mean is about 10.
Fred's deviation from the mean is 5.17, so Fred doesn't seem to be
unusually heavy.
Case 2: Suppose s = 2.
This implies that the typical deviation from the mean is about 2.
Fred's deviation from the mean is 5.17, so Fred does seems to be
unusually heavy.
Thus, the extremeness of Fred's BMI is quantied by its distance
from the mean BMI relative to the SD of BMI.
40
The z -score gives us this kind of information.
z = x s; x
i
i
where
x = value of the variable of interest for subject i,
i
x = sample mean
s = sample standard deviation
Case 1: z = 25;1019 83 = :517. Fred's BMI is .517 SD's above the mean.
:
Case 2: z = 25;19
2
83 = 2:585. Fred's BMI is 2.585 SD's above the mean.
:
From NCHS data, the true SD for 15 year old boys is s = 3:43. So,
Fred's BMI is z = 25;3 1943 83 = 1:51 SD's above the mean.
:
:
41
;
Chebychev's Theorem: At least 1 ; 12 100% of the values of any vari-
able must be within k SDs of the mean, for any k > 1.
k
{ For the BMI example, we'd expect at least 75% of 15 year old
males to have BMIs between x ; 2s = 19:83 ; 2(3:43) = 12:97
and x + 2s = 19:83 + 2(3:43) = 26:69.
At least 89% of the observations must be within 3 SDs, since for
k=3
1 1 1
1 ; k2 100% = 1 ; 32 100% = 1 ; 9 100% = 89%:
{ For the BMI example, we'd expect at least 89% of 15 year old
males to have BMIs between x ; 3s = 19:83 ; 3(3:43) = 9:54
and x + 3s = 19:83 + 3(3:43) = 30:12.
Note that Chebychev's Thm just gives a lower bound on the per-
centage falling within k SDs of the mean. At least 75% should fall
within 2 SDs, but perhaps more.
{ Since it only gives a bound and not a more exact statement
about a distribution, Chebychev's Thm is of limited practical
value.
42
We can make a much more precise statement if we know that the distribu-
tion of the variable in which we're interest is bell-shaped. That is, shaped
roughly like this:
A bell-shaped distribution for X, say Another bell-shaped distribution for Y, say
0.4
0.3
0.3
Frequency density of x
Frequency density of y
0.2
0.2
0.1
0.1
0.0
-2 0 2 -2 0 2
x=value of X y=value of Y
43
For data that follow the normal distribution, the following precise state-
ments can be made:
Excatly 68% of the observations lie within 1 SD of the mean.
Exactly 95% of the observations lie within 2 SDs of the mean.
Exactly 99.7% of the observations lie within 3 SDs of the mean.
In fact, for normally distributed data we can calculate the percentage of
the observations that fall in any range whatsoever.
This is very helpful if we know our data are normally distributed.
However, even if the data aren't known to be exactly normal, but are
known to be bell-shaped, then the exact results stated above will be ap-
proximately true. This is known as the empirical rule.
Empirical rule: for data following a bell-shaped distribution:
Approximately 68% of the observations will fall with 1 SD of the
mean.
Approximately 95% of the observations will fall with 2 SDs of the
mean.
Nearly all of the observations will fall with 3 SDs of the mean.
44
BMIs of 15 Year-old Boys:
At age 15, suppose that BMI follows an approximately bell-shaped
distribution.
{ Then we would expect approximately 68% of 15 year old boys
to have BMIs falling in the interval (16:40 23:26) = x 1s.
Fred's BMI was 25, so his BMI is more extreme than two-thirds
of boys his age.
{ We would expect 95% of 15 year-old boys to have BMIs falling
in the interval (12:97 26:69) = x 2s and nearly all to fall in
the interval (9:54 30:12) = x 3s. (Compare these results with
the Chebychev bounds).
In fact, BMI is probably not quite bell-shaped for 15 year olds. It
may be for 5 year olds, but by age 15, there are many obese children
who probably skew the distribution to the right (lots of large values
in the right tail). Therefore, the empirical rule may be somewhat
inaccurate for this variable.
45
Introduction to Probability*
Note: we're going to skip ch.s 4 & 5 for now, but we'll come back to
them later.
We all have an intuitive notion of probability.
\There's a 75% chance of rain today."
\The odds of Smarty Jones winning the Kentucky Derby are 2 to 1."
\The chances of winning the Pick-5 Lottery game are 1 in 2.3 mil-
lion."
\The probability of being dealt four of a kind in a 5 card poker hand
is 1=4164."
All of these statements are examples of quantifying the uncertainty in a
random phenomenon. We'll refer to the random phenomenon of interest
as the experiment, but don't confuse this use with an experiment as a type
of research study.
The experiments in the examples above are
{ An observation of today's weather
{ The results of the Kentucky Derby
{ A single play of the Pick-5 Lottery game
{ The rank of a 5-card poker hand dealt from a shu&ed deck of
cards
A
B:
The union of A and B is the event that A occurs or B occurs (or
both).
{
can be read as \or" (inclusive).
The intersection of events A and B is denoted
A \ B:
The intersection of A and B is the event that A occurs and B occurs.
{ \ can be read as \and".
The following Venn diagrams describe these operations pictorially.
48
There are a number of legitimate ways to assign probabilities to events:
the classical method
the relative frequency method
the subjective method
Whatever method we use, we require
1. The probability assigned to each experimental outcome must be be-
tween 0 and 1 (inclusive). That is, if O represents the ith possible
i
outcome, we must have
0 P (O ) 1 for all i.
i
P (O1 ) + P (O2) + + P (O ) = 1:
n
49
The classical method is really a special case of the more general Relative
Frequency Method. The probability of an event is the relative frequency
with which that event occurs if we were to repeat the experiment a very
large number of times under identical circumstances.
I.e., if the event A occurs m times in n identical replications of an
experiment, then
P (A) = m
n when n ! 1.
Suppose that the gender ratio at birth is 50:50. That is, suppose
that giving birth to a boy and giving birth to a girl are equaly likely
events. Then by the clasical method
P (Girl) = 12 :
This is also the long run relative frequency. As n ! 1 we should
expect that
number of girls ! 1 :
number of births 2
There are several rules of probability associated with the union, intersec-
tion, and complement operations on events.
Addition Rule: For two events A and B
P (A
B ) = P (A) + P (B ) ; P (A \ B ):
Venn Diagram:
50
Example Consider the experiment of having two children and let
A = event that rst child is a girl
B = event that second child is a girl
Assume P (A) = P (B ) = 1=2 (doesn't depend on birth order and
gender of second child not in"uenced by gender of rst child).
Then the probability of having at least one girl is
P (A
B ) = P (A) + P (B ) ; P (A \ B ) = 21 + 21 ; P (A \ B )
Notice that this agrees with the answer we would have obtained by
summing the probabilities of the outcomes corresponding to at least
one girl:
P (A
B ) = P f(M F )g + P f(F M )g + P f(F F )g = 14 + 41 + 41 = 43 :
51
Complement Rule: For an event A and its complement A c
P (A ) = 1 ; P (A):
c
This is simply a consequence of the addition rule and the facts that
P (A
A ) = P (entire sample space) = 1
c
=0
) P (A) = 1 ; P (A )
c
A third rule is the multiplication rule, but for that we need the denitions
of conditional probability and statistical independence.
52
Conditional Probability:
For some events, whether or not one event has occurred is clearly relevant
to the probability that a second event will occur.
We just computed that the probability of having at least one girl in
two births as 34 .
Now suppose I know that my rst child was a boy.
Clearly, knowing that I've had a boy aects the chances of having at
least one girl (it decreases them). Such a probability is known as a
conditional probability.
The conditional proabability of an event A given that another event B has
occurred is denoted P (AjB ) where j is read as \given".
Independence of Events Two events A and B are independent if know-
ing that B has occurred gives no information relevant to whether or not
A will occur (and vice versa). In symbols A and B are independent if
P (AjB ) = P (A):
P (A
B jA ) = P f(AP
(B ) \ A g: c
c
A) c
We known that the probability that the rst child is a girl is P (A) = 21 ,
so
P (A ) = 1 ; P (A) = 1 ; 12 = 12 :
c
Both of these events can happen simultaneously only if the rst child is
a boy and the second child is a girl. That is, only if the outcome of the
experiment is f(M F )g. Thus,
P f(A
B ) \ A g = P (A
c c
1:
\ B ) = P f(M F )g =
4
Therefore, the conditional probability of at least one girl given that the
rst child is a boy is
P (A
B jA ) = P f(AP
(BA))\ A g = 11==24 = 21 :
c
c
c
54
Another Example:
Suppose that among US adults, 1 in 3 obese individuals has high blood
pressure, while 1 in 7 normal weight individuals has high blood pressure.
Suppose also that the percentage of US adults who are obese, or preva-
lence of obesity, is 20%.
What is the probability that a randomly selected US adult is obese
and has high blood pressure?
Let
A = event that a randomly selected US adult is obese
B = event that a randomly selected US adult has high b.p.
P (A) = 51 P (B jA) = 13 P (B jA ) = 71 :
c
55
Note that in general given that an event A has occurred, either B occurs, or
B must occur, so the complement rule applies to conditional probabilities
c
too:
P (B jA) = 1 ; P (B jA):
c
With this insight in hand, we can compute all other joint probabilities
relating to obesity and high blood pressure:
The probability that a randomly selected US adult is obese and does
not have high blood pressure is
2 1 2:
P (A\B ) = P (B \A) = P (B jA)P (A) = $1;P (B jA)]P (A) = 3
c c c
5 = 15
5 = 35
56
Independence: Two events A and B are said to be independent if
knowing whether or not A has occurred tells us nothing about whether or
not B has or will occur and vice versa.
In symbols, A and B are independent if
P (AjB ) = P (A) and P (B jA) = P (B ):
Note also that the terms mutually exclusive and independent are
often confused, but they mean dierent things.
{ Mutually exclusive events A and B are events that can't hap-
pen simultaneously. Therefore, if I know A has occurred, that
tells me something about B namely, that B can't have oc-
curred. So mutually exclusive events are necessarily depen-
dent (not independent).
Obesity and High Blood Pressure Example: The fact that obesity and
high b.p. are not independent can be veried by checking that
1 = P (B jA) 6= P (B ) = 19 :
3 105
Alternatively, we can check independence by checking whether P (A \ B ) =
P (A)P (B ). In this example,
1 = P (A \ B ) 6= P (A)P (B ) = 1
19
0:0667 = 15 5 105 = 0:0362
57
Bayes' Theorem:
We have seen that when two events A and B are dependent, then P (AjB ) 6=
P (A).
That is, the information that B has occurred aects the probability
that A will occur.
Bayes' Theorem provides a way to use new information (event B has oc-
curred) to go from our probability before the new information was available
(P (A), which is called the prior probability) to a probability that takes
the new information into account (P (AjB ), which is called the posterior
probability).
Bayes' Theorem allows us to take the information about P (A) and
P (B jA) and compute P (AjB ).
Obesity and High B.P. Example:
Recall
A = event that a randomly selected US adult is obese
B = event that a randomly selected US adult has high b.p.
and
P (A) = 51 P (B jA) = 13 P (B jA ) = 71 :
c
Suppose that I am a doctor seeing the chart of a patient, and the only
information contained there is that the patient has high b.p.
Assuming this patient is randomly selected from the US adult pop-
ulation, what is the probability that the patient is obese?
That is, what is P (AjB )?
58
By the multiplication rule, we know that
Let's examine the numerator and denominator of this expression and see
if we can use the information available to compute these quantities.
First, notice that the denominator is P (B ), the probability of high blood
pressure. If a random subject has high b.p., then the subject either
a. has high b.p. and is obese, or
b. has high b.p. and is not obese.
That is,
B = (B \ A )
( B \ A ) c
=0
Therefore,
P (B ) = P (B \ A) + P (B \ A ): c
P (AjB ) = P (B \ PA()A+\PB(B) \ A ) : c
()
59
Now consider the numerator, P (A \ B ). By the multiplication rule and
using the fact that (A \ B ) = (B \ A), we have
P (A \ B ) = P (B \ A) = P (B jA)P (A)
which is useul because we know these quantities.
Applying the same logic to the two joint probabilities in the denominator
of (**), we have that
P (B \ A) = P (B jA)P (A) and P (B \ A ) = P (B jA )P (A ):
c c c
=3)(1=5)
= (1=3)(1(1=5) = 1=15 = 0:368
+ (1=7)(4=5) 19=105
60
In the example above, we used the law of total probability to compute
P (B ) as
P (B ) = P (B \ A) + P (B \ A ) = P (B jA)P (A) + P (B jA )P (A )
c c c
P (A jB ) = P (B jA )P (A ) + P (PB(jBAjA)P)P(A(A) )+ P (B jA )P (A ) :
i
i i
1 1 2 2 k k
61
Another Example | Obesity and Smoking Status:
Let
B = event that a randomly selected US adult is obese
A1 = event that a randomly selected US adult has never smoked
A2 = event that a randomly selected US adult is an ex-smoker
A3 = event that a randomly selected US adult is a current smoker
and suppose
P (B ) = :209 P (B jA1) = :208 P (B jA2) = :239 P (B jA3) = :178
P (A1) = 0:520 P (A2) = 0:250 P (A3) = 0:230:
P (A2jB ) = P (B jA )P (A ) + PP ((B
B jA2)P (A2 )
jA2 )P (A2 ) + P (B jA3 )P (A3 )
1 1
= (:208)(:520) + ((::239)(:250)
239)(:250) + (:178)(:230) = :286
62
Diagnostic Tests
One important application of Bayes' Theorem is to diagnostic or screening
tests.
Screening is the application of a test to individuals who have not
yet exhibited any clinical symptoms in order to classify them with
respect to their probability of having a particular disease.
{ Examples: Mammograms for breast cancer, Pap smears for cer-
vical cancer, Prostate-Specic Antigen (PSA) Test for prostate
cancer, exercise stress test for coronary heart disease, etc.
Consider the problem of detecting the presence or absence of a particular
disease or condition.
Suppose there is a \gold standard" method that is always correct.
E.g., surgery, biopsy, autopsy, or other expensive, time-consuming
and/or unpleasant method.
Suppose there is also a quick, inexpensive screening test.
Ideally, the test should correctly classify individuals as positive or
negative for the disease. In practice, however, tests are subject to
misclassication errors.
63
Denitions:
A test result is a true positive if it is positive and the individual
has the disease.
A test result is a true negative if it is negative and the individual
does not have the disease.
A test result is a false positive if it is positive and the individual
does not have the disease.
A test result is a false negative if it is negative and the individual
does have the disease.
The sensitivity of a test is the conditional probability that the test
is positive, given that the individual has the disease.
The speci
city of a test is the conditional probability that the test
is negative, given that the individual does not have the disease.
The predictive value of a positive test is the conditional prob-
ability that an individual has the disease, given that the test is pos-
itive.
The predictive value of a negative test is the conditional prob-
ability that an individual does not have the disease, given that the
test is negative.
Notation: Let
A = event that a random individual's test is positive
B = event that a random individual has the disease
Then
sensitivity = predictive value positive =
specicity = predicitve value negative =
64
Estimating the Properties of a Screening Test:
Suppose data are obtained to evaluate a screening test where the true
disease status of each patient is known. Such data may be displayed as
follows:
Truth
Diseased (event B ) Not Diseased (event B )c
65
1. Suppose a random sample of n subjects is obtained, and each subject
is tested via both the screening test and the gold standard.
In this case,
estimated sensitivity =
estimated specicity =
estimated predictive value positive =
estimated predictive value negative =
66
Notice that only in case 1 is it possible to obtain estimates of all four
quantities from simple proportions in the contingency table.
However, this approach is not particularly quick, easy or ecient be-
cause, for a rare disease, it will require a large n to obtain a sucient
sample of truly diseased subjects.
Approach 2 is generally easiest, and predictive values can be com-
puted from this approach using Bayes' Theorem if the prevalence of
the disease is known as well.
Suppose we take approach 2. As before, let
A = event that a random individual's test is positive
B = event that a random individual has the disease
Suppose P (B ), the prevalence of disease, is known. In addition, suppose
the sensitivity P (AjB ) and specicity P (A jB ) are known (or have been
c c
by
P (B jA ) = P (A jB P)P(A(BjB) +)PP((BA )jB )P (B )
c c c
c c
c c c c
67
Suppose that a new screening test for diabetes has been developed. To
establish its properties, n1 = 100 known diabetics and n2 = 100 known
non-diabetics were tested with the screening test. The following data were
obtained:
Truth
Diabetic (event B ) Nondiabetic (event B ) c
This result says that if you've tested positive with this test, then
there's an estimated chance of 37.6% that you have diabetes.
68
ROC Curves:
There is an inherent trade-o between sensitivity and specicity.
Example | CT Scans The following data are ratings of computed
tomography (CT) scans by a single radiologist in a sample of 109 subjects
with possible neurological problems. The true status of these patients is
also known.
True Disease
Status
Normal Abnormal
1 33 3 36
Radiologist's Rating 2 6 2 8
3 6 2 8
4 11 11 22
5 2 33 35
58 51 109
Here, the radiologist's rating is an ordered categorical variable where
1 = denitely normal
2 = probably normal
3 = questionable
4 = probably abnormal
5 = denitely abnormal
69
Suppose we diagnose every patient with a rating 1 as abnormal.
Obviously, we will catch all true abnormals this way | the sensitivity
of this test will be 1.
However, we'll also categorize all normals as abnormal | the speci-
city will be 0.
Suppose we diagnose every patient with a rating 5 as normal.
Obviously, we won't incorrectly diagnose any normals as abnormal
| the specicity will be 1.
However, we won't detect any true abnormalities | the sensitivity
of this test will be 0.
Clearly, we'd prefer to use some threshold between 1 and 5 to diagnose
abnormality.
We can always increase the sensitivity by setting the threshold high,
but this will decrease the specicity.
Similarly, a low threshold will increase the specicity at the cost of
sensitivity.
70
For each possible threshold value, we can compute the sensitivity and
specicity as follows:
Test Positive Criterion Sensitivity Specicity
1 1.00 0.00
2 0.94 0.57
3 0.90 0.67
4 0.86 0.78
5 0.65 0.97
>5 0.00 1.00
A plot of the sensitivity versus (1 ; specicity) is called a receiver op-
erating characteristic curve, or ROC curve. The ROC curve for this
example is as follows:
71
Estimation of Prevalence from a Screening Test:
Suppose we apply a screening test with known sensitivity and specicity
to a new population for which the prevalence of the disease is unknown.
Without applying the gold standard test, can we estimate the preva-
lence?
Let's reconsider the diabetes example. Recall how we dened events:
A = event that a random individual's test is positive
B = event that a random individual has the disease
Previously, we obtained
estimated sensitivity = 0:8 = P (d
AjB )
estimated specicity = 0:9 = P (AdjB ):
c c
72
Using the law of total probability followed by the multiplication rule, we
have
P (A) = P (A \ B ) + P (A \ B ) c
= P (AjB )P (B ) + P (AjB )P (B )
c c
c c c
and the other quantities in this expression, P (AjB ) and P (A jB ), are thec c
d Pd( A ) djB )]
; $1 ; P (A c c
P (B ) = d
P (AjB ) ; $1 ; P (AdjB )]c c
105 ; $1 ; :9]
:8 ; $1 ; :9] = :116
= 580
73
Risk Dierence, Relative Risk and Odds Ratio:
Three quantities that are often used to describe the dierence between
the probability (or risk) of disease between two populations are the risk
dierence, risk ratio, and odds ratio.
We will call the two populations the exposed and unexposed popula-
tions, but they could be whites and non-whites, males and females,
or any two populations (i.e., the \exposure" could be being male).
1. Risk dierence: One simple way to quantify the dierence between
two probabilities (risks) is to take their dierence.
Risk dierence = P (diseasejexposed) ; P (diseasejunexposed):
74
2. Relative risk: (also known as risk ratio).
RR = PP(disease
(diseasejexposed)
junexposed)
RR = ::000154
002679 = 17:4 risk dierence = :002679;:000154 = :002525:
so the OR is
P (diseasejexposed)=$1 ; P (diseasejexposed)]
OR = P (disease ()
junexposed)=$1 ; P (diseasejunexposed)]
75
Independence between exposure status and disease status corresponds
to an odds ratio of 1.
The OR conveys similar information to that of the RR. The main
advantages of the OR are that
a. It has better statistical properties. We'll explain this later, but
for now take my word for it.
b. It can be calculated in cases when the RR cannot.
The latter advantage comes from the fact that using Bayes' Theorem,
it can be shown that
P (exposurejdiseased)=$1 ; P (exposurejdiseased)]
OR = P (exposure jnondiseased)=$1 ; P (exposurejnondiseased)]
()
{ I.e.. (*) and (**) are mathematically equivalent formulas.
{ This equivalence is useful because in some contexts, the proba-
bility of exposure can be estimated among diseased and nondis-
eased but the probability of disease given exposure status can-
not. This occurs in case-control studies.
76
Example | Contraceptive Use and Heart Attack
A case-control study of oral contraceptive use and heart attack. 58 female
heart attack victims were identied and each of these \cases" was matched
to one \control" subject of similar age, etc. who had not suered a heart
attack.
Heart
Attack
Yes No
Contraceptive Use Yes 23 11
No 35 47
58 58
In this case, the column totals are xed by the study design. Therefore,
the probability of heart attack given whether or not oral contraceptives
have been used cannot be estimated.
Why?
Thus, we cannot estimate the risk of disease in either the exposed
or unexposed group, and therefore cannot estimate the RR or risk
dierence.
However, we can estimate probabilities of contraceptive use given presence
or absence of heart attack:
P^ (contraceptive usejheart attack) = 23=58 = :397
P^ (contraceptive usejno heart attack) = 11=58 = :190:
Interpretation: The odds of heart attack are 2.8 times higher for
women who took oral contraceptives than for women who did not.
77
Theoretical Probability Distributions*
Probability Distributions:
Some denitions:
A variable is any characteristic that can be measured or observed
and which may vary (or dier) among the units measured or ob-
served.
A random variable is a variable that takes on dierent numerical
values according to a chance mechanism
{ E.g., any variable measured on the elements of a randomly
selected sample.
{ Discrete random variables are random variables that can take
on a nite or countable number of possible outcomes (e.g.,
number of pregnancies).
{ A continuous random variable can (theoretically, at least) take
on any value in a continuum or interval (BMI).
A probability function is a function which assigns a probability
to each possible value that can be assumed by a discrete random
variable.
The probability function of a discrete random variable (r.v.):
{ denes all possible values of the r.v.
{ gives the probabilities with which the r.v. takes on each of
those values.
79
Expected Value, Variance
The expected value of a random variable is the mean, or average value
of the r.v. over the population of units on which the r.v. is dened.
For a random variable X , its expected value is usually denoted E(X ),
or , or simply .
X
The expected value for a discrete r.v. can be computed from its probability
distribution as follows:
X
E (X ) = xP (X = x)
all x
where this sum is taken over all possible values x of the r.v. X .
E.g., the expected number of ears aected by ear infection during
the rst two years of life is computed as follows:
x P (X = x) xP (X = x)
0 :13 0(:13)
1 :48 1(:48)
2 :39 2(:39)
E (X ) = 1:26
{ Interpretation: the mean number of ears aected by otitis me-
dia during the rst two years of life is 1.26.
80
The variance of a random variable is the population variance of the r.v.
over the population of units on which the r.v. is dened.
The variance of X is usually denoted var(X ), or 2 , or simply 2 .
X
all x
where again this sum is taken over all possible values x of the r.v.
X.
E.g., the variance of the number of ears aected by ear infection
during the rst two years of life is computed as follows:
x P (X = x) X (x ; )2 P (X = x)
X
81
The Binomial Probability Distribution
Many random variables that can be described as event counts where there
is a max number of events that can occur, can be thought of as arising
from a binomial experiment.
A binomial experiment has the following properties:
1. The experiment consists of a sequence of n identical trials.
2. Two outcomes are possible on each trial, one a \success" and the
other a \failure".
3. The probability of success, denoted by p, is the same for each trial.
{ Since the probability of a failure is just 1 ; p, this means that
the failure probability is the same for each trial as well.
4. The trials are independent (what happens on one trial doesn't aect
what happens on any other trial).
In a binomial experiment we are interested in X , the r.v. dened to be
the total number of successes that occur over the n trials.
Note that \success" and \failure" are just convenient labels. A suc-
cess could be identied as the birth of a girl, and failure as the birth
of a boy, or vice versa. That is, \success" simply denotes the event
of interest that is being counted.
X in a binomial trial is a discrete random variable with possible
values 0 1 2 : : : n.
For any experiment with the above properties, X will necessarily have a
particular distribution, the binomial probability distribution that is
completely determined by n and p.
Examples:
A. The number of heads that occur in 4 coin "ips
1. Each coin "ip is an identical trial.
2. Two outcomes (Heads,Tails) are possible, where \success"=
Heads.
3. Probability of success= P (Heads) = 1=2 on each trial.
4. Coin "ips are independent.
82
B. The number of obese subjects out of 3 randomly selected US adults.
1. Observing obesity status of each randomly selected US adult
is an identical trial.
2. Two outcomes are possible (obese, not obese) where \success"
= subject is obese.
3. Probability of success= P (obese) = :209 on each trial.
4. Because selection of subjects is at random, obesity status is
independent from subject to subject.
Counter Examples:
C. The number of lifetime miscarriages experienced by a randomly se-
lected woman over the age of 50. Suppose the woman had had 5
lifetime pregnancies.
1. The n = 5 pregnancies are the trials, but they are not identi-
cal. They occur at dierent ages, under dierent circumstances
(woman's health status diers, environmental exposures dier,
fathers may dier, etc.).
2. Two outcomes are possible (miscarriage, not miscarriage) where
\success" = miscarriage.
3. Probability of success not constant on each trial. Probability of
miscarriage may be higher when woman is older, may depend
on birth order, etc.
4. Pregnancy outcome may not be independent from one preg-
nancy to the next (if previous pregnancy was a miscarriage,
that may increase the probability that next pregnancy will be
miscarriage).
D. Out of the n hurricanes that will form in the Atlantic next year, how
many will make landfall in the state of Florida?
1. Each hurricane represents a trial. Not identical.
2. Two outcomes possible (hit FL, not hit FL). \Success" = hit
FL.
3. Probabilities of hitting Florida may not be constant from hur-
ricane to hurricane depending upon when and where they form,
but a priori, it may be reasonable to assume that these prob-
abilities are equal from one hurricane to the next.
4. Hurricane paths are probably not independent. If the previous
hurricane hit FL, that may increase the chances that the next
hurricane will follow the same path and hit FL as well.
83
For any binomial experiment, the probability of any given number of \suc-
cesses" out of n trials is given by the binomial probability function.
Let the random variable X = the number of successes out of n trials,
where p is the success probability on each trial. Then the probability of x
successes is given by
n
P (X = x) = x p (1 ; p) ; :
x n x
;
Here, (read \n choose x) is shorthand notation for
n
x x
n!
!(n;x)! where
a! (\a factorial") is given by
a! = a(a ; 1)(a ; 2) (2)(1) and, by convention we dene0! = 1:
4!
1 3 1 4;3
= 3!(4 ; 3)! 2 1; 2
4(3)(2)(1)
1 3 1 1 1 4
= f(3)(2)(1)gf(1)g 2 2 = 4 2 = 0:25
84
Let's consider example B.
Let X =number of obese subjects out of n = 3 randomly chosen US adults
where p = :209.
Forgetting the formula for a minute, how could we compute P (X = 2),
say?
One way is to list all of the possible outcomes of the experiment of ob-
serving 3 subjects and add up the probabilities for the outcomes that
correspond to 2 obese subjects.
Possible outcomes:
Outcome First Second Third Probability
Number Subject Subject Subject of Outcome
1 O O O
2 O O N
3 O N O
4 O N N
5 N O O
6 N O N
7 N N O
8 N N N
Outcomes 2, 3, and 5 corresponse to getting a total of X = 2 obese
subjects out of n = 3. What are the probabilitites of these three
outcomes?
Probability of (O O N ):
Recall that for independent events, the joint probability of the events
is the product of the individual probabilities of each event. Here,
whether the subject is obese is independent from subject to subject.
So, the probability of observing (O O N ) is
p p (1 ; p) = p2 (1 ; p)1 = p (1 ; p) ;
x n x
where n = 3, x = 2.
85
Probability of (O N O):
p (1 ; p) p = p2 (1 ; p)1 = p (1 ; p) ;
x n x
where n = 3, x = 2.
Probability of (N O O):
(1 ; p) p p = p2 (1 ; p)1 = p (1 ; p) ;
x n x
where n = 3, x = 2.
Adding the probabilities of these mutually exclusive events together (ad-
dition rule) we get
P (X = 2) = p2 (1 ; p)1 + p2(1 ; p)1 + p2 (1 ; p)1 = 3p2 (1 ; p)1
where for n = 3, x = 2
n 3 3! = 3(2)(1) = 3:
x = 2 = 2!(3 ; 2)! f(2)(1)gf(1)g
;
32 is the number of ways to arrange a sequence with 2 `O's and 1
`N'.
;
More generally, gives the number of ways to choose x objects out
n
86
The binomial formula can be used to compute the probability of x successes
out of n trials where the success probability on each trial is p for any value
of n and p.
However, it is convenient to have a table to give the answer for any given
value of n and p, or, even better, a computer function that allows us to
input n and p and outputs the answer.
Table A.1 in Appendix A of our book gives binomial probabilities
for selected values of n and p.
E.g., we computed the probability of x = 3 heads out of n = 4 coin "ips
to be .25. Table A.1 uses k instead of x, so we look up n = 4 and k = 3 on
the left side of the table, p = :5 on the top and nd the probability equals
.2500 just as we computed.
Note that the table only gives selected values of p where p :5.
What if we are interested in p = :75, say?
We can handle such a case by considering the number of failures rather
than the number of successes.
That is, if X equals the number of successes out of n trials with success
probability p, then
Y = n ; X = number of failures,
where the failure probability is q = 1 ; p. We observe X = x successes
out of n trials if and only if we observe Y = n ; x failures. So,
n
P (X = x) = P (Y = n ; x) = n ; x q ; (1 ; q) ;( ; )
n x n n x
n
= n ; x q ; (1 ; q) :
n x x
87
Example Suppose that 55% of UGA undergraduates are women. In a
random sample of 7 UGA undergraduates, what's the probability that 3
of them are women?
Here X = number of women (success) out of n = 7 \trials" where prob-
ability of woman on each trial is p = :55. If x = 3 women are observed,
then we necessarily have observed Y = n ; x = 7 ; 3 = 4 men where the
probability of observing a man is
q = 1 ; p = 1 ; :55 = :45:
So, the desired probability can be computed based on X :
n 7
P (X = 3) = x p (1 ; p) ; = 3 (:55)3 (1 ; :55)7;3
x n x
88
The binomial probability function gives the P (X = x) for all possible
values of x: 0 1 2 : : : n. So, the probability function gives the entire
probability distribution of X .
Once we know the probability distribution of a discrete r.v., we can com-
pute its expected value and variance.
Recall:
X
E (X ) = xP (X = x)
all x
= 0P (X = 0) + 1P (X = 1) + + nP (X = n) = X
and
X
var(X ) = (x ; )2 P (X = x)
X
all x
= (0 ; )2 P (X = 0) + + (n ; )2 P (X = n) = 2
X X X
89
Example | Obesity Again
Suppose I take a random sample of n = 4 US adults. How many obese
subjects should I expect to observe on average?
Here n = 4, p = :209, so I expect to observe
E (X ) = np = 4(:209) = 0:836
obese adults out of a sample of n = 4.
In a sample of n = 1000, I'd expect to observe np = 1000(:209) = 209
obese adults. (Make sense?)
The variance of the number of obese adults observed out of n = 1000
would be
var(X ) = np(1 ; p) = 1000(:209)(1 ; :209) = 165:319
p
That is, the standard deviation is 165:319 = 12:9.
The interpretation here is that I could select n = 1000 US adults
and count the number of obese subjects over and over again. Over
the long run, the standard deviation of the number of obese subjects
observed when repeating this binomial experiment again and again
is 12:9.
{ That is, I expect to get about 209 out of 1000 obese subjects,
but the actual number obtained is going to vary around 209,
with typical deviation from 209 equal to 12.9.
90
The Poisson Probability Distribution
Another important discrete probability distribution that arisesoften in
practice is the Poisson probability distribution.
The binomial probability function gave the probability for the num-
ber of successes out of n trials.
{ Pertains to counts (of the number of successes) that are subject
to an upper bound n.
The Poisson probability function gives the probability for the num-
ber of events that occur in a given interval (often a period of time)
assuming that events occur at a constant rate during that interval.
{ pertains to counts that are unbounded. Any number of events
could, theoretically occur during the period of interest.
In the binomial case, we know p = probability of the event (success)
in each trial.
In the Poisson case, we know = the mean (or expected) number of
events that occur in the interval.
{ Or, equivalently, we could know the rate of events per unit of
time. Then , the mean number of events during an interval
of length t would just be t rate.
Example | Trac Accidents:
Based on long-run trac history, suppose that we know that an average
of 7 trac accidents per month occur at Broad and Lumpkin. That is,
= 7 per month. We assume this value is constant throughout the year.
What's the probability that in a given month we observe exactly 8
accidents?
Such probabilities can be computed by the Poisson probability function.
If X = the number of events that occur according to a Poisson experiment
with mean , then
e ;
x
P (X = x) = x! :
92
The Poisson distribution has the remarkable property that its ex-
pected value (mean) and variance are the same. That is, for X
following a Poisson distribution with mean ,
E(X ) = var(X ) =
0! + 1! + 2! = :894301
93
Continuous Probability Distributions
Recall that for a discrete r.v. X , the probability function of X gave P (X =
x) for all possible x, thus describing the entire distribution of the r.v. X .
We'd like to do the same for a continuous r.v.
How do we calculate probabilities for continuous random variables?
For a continuous r.v., the probability that it takes on any particular
value is 0! Therefore, we can't use a probability function to describe
it!
{ E.g., the probability that a randomly selected subject from this
class weighs 146.923578234785079074... lbs is 0.
Instead of a probability function that gives the probability for each partic-
ular value of X , we quantify the probability that X falls in some interval
or region of all possible values of X .
This works because while the probability that a random student
weighs 146.923578234785079074... lbs is 0, the probability that he/she
weighs between 145 and 150 lbs, say, is not 0.
So, instead of describing the distribution of a continuous r.v. with a prob-
ability function, we use what is called the probability density function.
The probability density function for a continuous r.v. X gives a curve
such that the area under the curve corresponding to some interval
on the horizontal axis gives the probability that X takes a value in
that interval.
94
E.g., suppose the probability density function for X =body weight for a
randomly selected student in this class looks like this:
0.025
0.020
Probability density
0.015
0.010
0.005
0.0
Weight (lbs)
The dashed vertical lines are at weight=145 lbs and weight=150 lbs.
The area under the curve between these lines gives the probability
that a randomly selected student weighs between 145 and 150 lbs.
In general, the are under the curve between x1 and x2 where x1 < x2
gives
P (x1 < X < x2 )
Note that the curve extends to the left and right, getting closer and
closer to zero.
{ That is, weights greater than x lbs, say, are possible (have
nonzero probability) no matter how big x is, but they are in-
creasingly unlikely as x gets bigger.
{ Similarly, smaller and smaller weights are decreasingly proba-
ble.
The entire area under the probability density function is 1, repre-
senting the fact that
P (;1 < X < 1) = 1
95
Note that for a continuous r.v. X , P (X = x) = 0 for all x. Therefore,
P (X x) = P (X = x) + P (X < x) = 0 + P (X < x) = P (X < x):
Similarly,
P (X x) = P (X = x) + P (X > x) = 0 + P (X > x) = P (X > x):
0.2
0.1
0.0
-4 -2 0 2 4
96
The normal distribution is not the only distribution whose p.d.f.
looks bell-shaped, but it is the most important one, and many real
world random variables follow the normal distribution, at least ap-
proximately.
The normal distribution, like the binomial and Poisson, is an example
of a parametric probability distribution. It is completely described
by a small number of parameters.
{ In the case of the binomial, there were two parameters, n and
p.
{ In the case of the Poisson, there was just one parameter, , the
mean of the distribution.
{ In the case of the normal, there are two parameters:
= the mean of the distribution, and
2 = the variance of the distribution.
That is, if X is a r.v. that follows the normal distribution, then that
means that we know exactly the shape of the p.d.f. of X except for
= E(X ), the mean of X , and 2 = var(X ), the variance of X .
{ We will use the notation
X N ( 2 )
to denote that the r.v. 2X folllows a normal distribution with
mean and variance .
{ E.g., X N (3 9) means that X has a normal distribution
with mean 3 and variance 9 (or SD=3).
The normal curve given above has mean 0 and variance 1. I.e., it is
N (0 1), which is called the standard normal distribution.
97
Normal distributions with dierent means have dierent locations.
Normal distributions with dierent variances have dierent degrees
of spread (dispersion).
{ Below are three normal probability distributions with dierent
means and variances.
Normal probability densities with different means and variances
0.4
mean=0,var=1
mean=0,var=4
mean=4,var=1
0.3
0.2
f(x)
0.1
0.0
-5 0 5 10
2
where again, e denotes the constant 2.71828... and denotes the constant
3.14159....
98
Facts about the normal distribution:
1. It is symmetric and unimodal.
{ As a consequence of this, the mean, median and mode are all
equal and occur at the peak of the normal p.d.f.
2. The normal p.d.f. can be located (have mean) anywhere along the
real line between 1 and extends indenitely away from its mean
in either direction without ever touching the horizontal axis.
{ That is, if X N ( 2 ), then any value of X is possible,
although values far from will not be very probable.
3. As with any p.d.f., the area under the normal curve between any two
numbers x1 x2 where x1 < x2 gives
P (x1 < X < x2 )
and the total area under the p.d.f. is 1.
In particular, here are a few notable normal probabilities:
{ For x1 = ; 1, x2 = + 1,
P ( ; 1 < X < + 1) = :6826
That is, 68.26% of the time a normally distributed r.v. falls
within 1 SD of its mean (i.e., has z score between -1 and 1).
{ For x1 = ; 2, x2 = + 2,
P ( ; 2 < X < + 2) = :9544
That is, 95.44% of the time a normally distributed r.v. falls
within 2 SDs of its mean (i.e., has z score between -2 and 2).
{ For x1 = ; 3, x2 = + 3,
P ( ; 3 < X < + 3) = :9972
That is, 99.72% of the time a normally distributed r.v. falls
within 3 SDs of its mean (i.e., has z score between -3 and 3).
These results are where the \empical rule" comes from.
99
Example | Height
Suppose that US adult women have heights that are normally distributed
where the population mean height is 65 inches and the population standard
deviation for women's height is 2.5 inches.
Suppose that US adult men have heights that are normally distributed
with population mean 70 inches and population SD of 3 inches.
Let X = the height of a randomly selected adult US woman, and Y = the
height of a randomly selected adult US man. Then
X N ( 2 ) = N (65 2:52) Y
X X
N (Y Y2 ) = N (70 32 ):
Women
Men
0.10
f(height)
0.05
0.0
55 60 65 70 75 80
height
Clearly, the area under the curve between 62.5 and 67.5 inches for
men is much less than 68.26%.
{ In fact the area under the male height curve between 62.5 and
67.5 inches turns out to be .1961 or 19.61%.
100