0% found this document useful (0 votes)
19 views

Lecture 4 - Statistics and Data Analysis I 2

The document discusses relative position and the normal distribution. It introduces percentiles and standard scores as two methods for measuring the relative position of an observation in a dataset. It then defines the normal distribution and explains that it is determined by two parameters - the mean and standard deviation.

Uploaded by

guy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Lecture 4 - Statistics and Data Analysis I 2

The document discusses relative position and the normal distribution. It introduces percentiles and standard scores as two methods for measuring the relative position of an observation in a dataset. It then defines the normal distribution and explains that it is determined by two parameters - the mean and standard deviation.

Uploaded by

guy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Statistics and Data Analysis I – IDC – 2017

Avner Halevy

Lecture 4 – Relative Position and the Normal Distribution


Suppose a friend tells you he got a 77 on an exam. Is this enough information to know if he did well?
Maybe not. If the average on the exam was 93, then 77 is not a stellar achievement. On the other
hand, if the average was 52, then 77 is fantastic. That’s why your first question would probably be:
what was the average? You would want to place 77 in its place relative to the average. Our first task
today is to describe two methods for measuring the relative position of an individual observation
in a dataset. Our second task is to introduce the normal distribution.

Percentiles

If you think about it, we’ve already seen a few examples of relative position. We know the me-
dian is found right between the lower 50% and the upper 50% of the data. If 77 is higher than the
median, your friend did better than at least 50% of the students who took the exam. We’ve also
computed quartiles. If 77 is higher than the third quartile, Q3, your friend did better than at least
75% of the students.
In fact, we can use the same procedure that we used to find the median and quartiles in order to find
any percentile: given a number 0 < p < 100, a value denoted by xp is the pth percentile if p% of
the values in the dataset are less than or equal to it. For example, the median is the 50th percentile
and Q3 is the 75th percentile. Note that a percentile is itself not a percentage. Rather, it is a value
of the variable under study that marks a certain percentage of the observations of that variable in the
dataset. The example below will demonstrate this.
In its most general form, the formula that we’ve used before lets us compute any percentile. If
0 < p < 100 (for example, p = 50 gives the median), the pth percentile xp is given by
 
p − F0
xp = L0 + (L1 − L0 )
f

L0 = bottom border of xp ’s class


L1 = top border of xp ’s class
F0 = cumulative frequency (in %) up to (but not including) xp ’s class
f = relative frequency (in %) of xp ’s class
For example, consider the following familiar table:
age (years) f (in thousands) f (%) F (%)
18-25 31.0 8 8
25-35 138.4 36 44
35-55 182.4 48 92
55-65 31.6 8 100
Sum 383.4 100

1
Suppose we wish to find the 90th percentile, so p = 90. The F column in the table tells us that the
90th percentile, denoted by x90 , falls somewhere in the class 35-55. Using the formula,
 
90 − 44
x90 = 35 + (55 − 35) = 54.17 years.
48

But if we go back to the example we started with, we would actually like to go in the other direction.
That is, we have a value, 77, and we wish to know what percentile it represents. In other words, for
what p is 77 = xp ? If we had the data about the other exams, we could figure out the answer by
reasoning in an analogous way to the one that led us to the above formula. To demonstrate this, let’s
continue with the above table and ask instead: what is the percentile of age 45? Since 45 falls right
in the middle of 35-55, I’m sure you’ve already figured out the answer. But let’s reason carefully in a
general way that will always work.
First, since 45 falls in the class 35-55, we know the percentile of 45 will be somewhere between 44 and
92, which are the cumulative frequencies at the beginning and end of the class. We simply ask: what
fraction of the way between 35 and 55 have we gone? The answer is
45 − 35
.
55 − 35
Therefore we should take this fraction of the relative frequency available in this class, which is f = 48:
 
45 − 35
48
55 − 35
Finally, we need to add this percentage to what we already had at the beginning of this class, 44:
 
45 − 35
p = 44 + 48 = 68.
55 − 35
Thus, 68% of the women in the study had age less than or equal to 45 years.
We can summarize what we did in a general formula:

 
x p − L0
p = F0 + f
L1 − L0
L0 = bottom border of xp ’s class
L1 = top border of xp ’s class
F0 = cumulative frequency (in %) up to (but not including) xp ’s class
f = relative frequency (in %) of xp ’s class

Standard Score

Another measure of relative position builds on our natural tendency to compare an observation to the
average. The standard score is a number that tells us two things:
• The sign (positive or negative) of the number tells us which side of the average the observation
falls on (if the number is 0, the observation is precisely average).

2
• The magnitude of the number tells us how far from the average the observation is found, where
the distance is measured in standard deviations.
Given an observation x from a dataset with mean x and standard deviation s, the standard score,
also known as a z-score and denoted by z, is
x−x
z= .
s
For example, suppose the average on your friend’s exam was x = 68 and the standard deviation of the
exam scores was s = 6. Then the standard score of 77 is
77 − 68
z= = 1.5
6
Since 1.5 is positive, this z-score tells us that 77 was above the average. The magnitude of this number
tells us that 77 was 1.5 standard deviations away from the mean.
Note that we can also use this formula in the other direction. For example, if we didn’t know your
friend’s raw score on the exam, but knew the standard score as well as the mean and standard
deviation, we would solve the above formula for x

x = x + zs

and use it to find the raw score


x = 68 + (1.5)(6) = 77.

Let’s go back to your friend. Suppose the 77 was a grade on a math exam, and the same friend also
got a 70 on an econ exam. In which course did your friend do better? As it stands, the answer is not
clear, because we cannot determine how 70 was relative to the rest of the scores. But suppose we also
know that the average on the econ exam was 64 and the standard deviation 3. Then the standard
score is
70 − 64
= 2.0
3
This means that your friend scored two standard deviations above the mean in econ, so relatively
speaking your friend did better in econ than he did in math, even though the raw scores suggest a
different story. This is something else standard scores let us do: they let us compare values from
different distributions by using a common standard.
Given an arbitrary dataset consisting of the n observations

x1 , x2 , . . . , xn

suppose we standardize each one of them and obtain the n z-scores

z1 , z2 , . . . , zn .

Then the following will always hold:


• The mean of the z-scores will be 0.
• The standard deviation of the z-scores will be 1.

3
Figure 1: Percentiles and z-scores for the normal distribution

In Figure 1 we see a comparison of percentiles and z-scores for a continuous distribution which is
ubiquitous in statistics and which we now introduce.

The Normal Distribution

We have seen that when the values of a continuous variable are grouped into increasingly finer classes,
the shape of the corresponding histogram often begins to look like the familiar bell curve. For example,
the distribution of grades on an exam might look like this:

Figure 2: Histogram approximated by a bell curve

This makes the bell curve an extremely useful tool in statistics. The distribution that generates this
curve precisely is called the normal distribution. It is completely determined by two numbers,

4
called the parameters of the distribution:
• The mean of the normal distribution, denoted by µ, determines where the center of the bell is
located.
• The standard deviation of the distribution, denoted by σ, determines how wide the bell is.
Putting everything together, we would write

X ∼ N (µ, σ 2 )

and say that the variable X is normally distributed with mean µ and standard deviation σ. (Note that
by convention, when using the above notation we specify the variance σ 2 rather than the standard
deviation σ.) An example of such a distribution is shown in Figure 3. Note that the normal distribution
is always symmetric about the mean, which is a property we shall exploit extensively.

Figure 3: The Normal Distribution

In Figure 4 we compare three different normal distributions. The blue and the red distributions have
the same mean, but different standard deviations. Since the blue curve has a wider spread, it must
have a larger standard deviation. On the other hand, the green and the red curves have the same
standard deviation but different means. Since the red curve is located to the right of the green curve,
it must have a larger mean.
In Figure 5 we can see that, regardless of what the values of µ and σ are, certain percentages of
the data will always lie within certain distances of the mean, when these distances are measured in
standard deviations.
One particular normal distribution has a special name: the standard normal distribution has a
mean of 0 and a standard deviation of 1. The standard normal distribution is denoted by Z. Thus

Z ∼ N (0, 1)

5
Figure 4: Comparing normal distributions

Figure 5: From standard deviations to relative frequencies

We will soon see that we can answer any question about the standard normal distribution. Thus, it is
useful to translate questions about other normal distributions to questions about the standard normal

6
distribution. This procedure is called standardization and is very simple (we demonstrate it below):

X −µ
if X ∼ N (µ, σ 2 ), then Z = ∼ N (0, 1)
σ
(This formula should remind you of the one we saw earlier today for computing a z-score.) In words: to
standardize a value from a general normal distribution, subtract the mean and divide by the standard
deviation.

Probability and the Normal Distribution

Suppose we are told that a certain variable has the standard normal distribution, Z. If we look
at a random value out of this distribution, we could ask a question like: what is the probability that
this value is less than or equal to 0.35? A concise notation for this question is

P (Z ≤ 0.35) =?

Since this is a question about the standard normal distribution, we can answer it using the standard
normal table, the top portion of which is shown in Figure 6 (the complete table is posted on Moodle
under “Resources”).

Figure 6: The standard normal table

The values in the left column and top row together specify z-values: the left column specifies the units
digit and the first digit after the decimal point, and the top row specifies the second digit after the
decimal point. The values inside the table specify probabilities. For example, to answer our question,
find 0.3 in the left column, and 0.05 in the top row, which together make up 0.35. Now find the value
where the row of 0.3 and the column of 0.05 intersect: 0.6368. As the partly shaded graph above the

7
table suggests, this value represents the area under the curve to the left of 0.35. Since areas correspond
to probabilities (which are simply relative frequencies expressed as proportions), this is exactly what
we needed:
P (Z ≤ 0.35) = 0.6368.

We can answer other questions, with the key idea always being the equivalence of area under the curve
and probability. Any region under the curve will have area between 0 and 1, and every corresponding
probability will likewise be between 0 and 1.
For example, what is P (Z > 0.35)? We can’t use the table since it only has probabilities of the form
P (Z ≤ z), where z is some non-negative number. However, we recognize the complementary event:
Z ≤ 0.35. These are complementary since any Z will necessarily satisfy exactly one of the following
inequalities:
Z ≤ 0.35 or Z > 0.35
This means that the probabilities of these events must sum to 1:
P (Z ≤ 0.35) + P (Z > 0.35) = 1
Therefore
P (Z > 0.35) = 1 − P (Z ≤ 0.35)
Remember that we already know P (Z ≤ 0.35) = 0.6368. Thus,
complem.
P (Z > 0.35) = 1 − P (Z ≤ 0.35) = 1 − 0.6368 = 0.3632.
How about P (Z < −0.35)? The problem now is that the table only has positive z-values. However, we
remember that every normal distribution is symmetric about the mean. In particular, the standard
normal is symmetric about 0: see Figure 7, which depicts the general situation.

Figure 7: Exploiting symmetry

Thus, the area under the curve to the left of -0.35 must be equal to the area under the curve to the
right of 0.35, so if we knew P (Z > 0.35) we’d be done. Using the last problem we solved,
symmetry complem.
P (Z < −0.35) = P (Z > 0.35) = 1 − P (Z ≤ 0.35) = 1 − 0.6368 = 0.3632.

8
Another kind of question might be: what is the probability that Z is between 0.35 and 0.55 (inclusive)?
We need to find P (0.35 ≤ Z ≤ 0.55). Again, we can’t use the table just yet, but Figure 8, which
depicts the general situation, should help.

Figure 8: The area under the curve between two points

We realize that we can find the area under the curve between 0.35 and 0.55 by subtracting the area
to the left of 0.35 from the area to the left of 0.55:

P (0.35 ≤ Z ≤ 0.55) = P (Z ≤ 0.55) − P (Z < 0.35)

From the table, we see that P (Z ≤ 0.55) = 0.7088. We already know that P (Z ≤ 0.35) = 0.6368
and it turns out that (for any continuous distribution) excluding the possibility of equality does not
change the probability. Therefore

P (0.35 ≤ Z ≤ 0.55) = P (Z ≤ 0.55) − P (Z < 0.35) = 0.7088 − 0.6368 = 0.072.

Let us now consider a normal distribution which is not standard. Suppose the scores on a certain
exam are normally distributed with mean 80 and standard deviation 10. Thus, if we let X denote the
scores on the exam, we are given that X ∼ N (80, 102 ). What is the probability that a random student
scored below 84? In this case, before we can use the table, we need to standardize the variable:
 
X −µ 84 − 80
P (X ≤ 84) = P ≤ = P (Z ≤ 0.40) = 0.6554
σ 10
where the last value was read directly from the table. Continuing with the same example, one might
ask: what was the 70th percentile on the exam? Well, what is the 70% of the standard normal? This
time we look inside the table for the smallest value greater than or equal to 0.70, which is 0.7019, and
then follow its row and column to the value 0.53. This is the z-value such that roughly 0.7 of the area
under the curve lies to its left. Since area represents probability, this must be the 70th percentile for
Z. Now we solve the standardization equation for X in terms of Z:

X = µ + Zσ

Then the 70th percentile of X is


80 + (0.53)(10) = 85.3
What about the 30th percentile? Since 30+70=100, we can use complementary events and symmetry
to conclude that the 30th percentile must be

80 + (−0.53)(10) = 74.7.

You might also like