0% found this document useful (0 votes)

19 views

Lecture 4 - Statistics and Data Analysis I 2

The document discusses relative position and the normal distribution. It introduces percentiles and standard scores as two methods for measuring the relative position of an observation in a dataset. It then defines the normal distribution and explains that it is determined by two parameters - the mean and standard deviation.

Uploaded by

guy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

Lecture 4 - Statistics and Data Analysis I 2

Uploaded by

guy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Statistics and Data Analysis I – IDC – 2017

Avner Halevy

Lecture 4 – Relative Position and the Normal Distribution

Suppose a friend tells you he got a 77 on an exam. Is this enough information to know if he did well?
Maybe not. If the average on the exam was 93, then 77 is not a stellar achievement. On the other
hand, if the average was 52, then 77 is fantastic. That’s why your first question would probably be:
what was the average? You would want to place 77 in its place relative to the average. Our first task
today is to describe two methods for measuring the relative position of an individual observation
in a dataset. Our second task is to introduce the normal distribution.

Percentiles

If you think about it, we’ve already seen a few examples of relative position. We know the me-
dian is found right between the lower 50% and the upper 50% of the data. If 77 is higher than the
median, your friend did better than at least 50% of the students who took the exam. We’ve also
computed quartiles. If 77 is higher than the third quartile, Q3, your friend did better than at least
75% of the students.
In fact, we can use the same procedure that we used to find the median and quartiles in order to find
any percentile: given a number 0 < p < 100, a value denoted by xp is the pth percentile if p% of
the values in the dataset are less than or equal to it. For example, the median is the 50th percentile
and Q3 is the 75th percentile. Note that a percentile is itself not a percentage. Rather, it is a value
of the variable under study that marks a certain percentage of the observations of that variable in the
dataset. The example below will demonstrate this.
In its most general form, the formula that we’ve used before lets us compute any percentile. If
0 < p < 100 (for example, p = 50 gives the median), the pth percentile xp is given by

p − F0
xp = L0 + (L1 − L0 )
f

L0 = bottom border of xp ’s class

L1 = top border of xp ’s class
F0 = cumulative frequency (in %) up to (but not including) xp ’s class
f = relative frequency (in %) of xp ’s class
For example, consider the following familiar table:
age (years) f (in thousands) f (%) F (%)
18-25 31.0 8 8
25-35 138.4 36 44
35-55 182.4 48 92
55-65 31.6 8 100
Sum 383.4 100

1
Suppose we wish to find the 90th percentile, so p = 90. The F column in the table tells us that the
90th percentile, denoted by x90 , falls somewhere in the class 35-55. Using the formula,

90 − 44
x90 = 35 + (55 − 35) = 54.17 years.
48

But if we go back to the example we started with, we would actually like to go in the other direction.
That is, we have a value, 77, and we wish to know what percentile it represents. In other words, for
what p is 77 = xp ? If we had the data about the other exams, we could figure out the answer by
reasoning in an analogous way to the one that led us to the above formula. To demonstrate this, let’s
continue with the above table and ask instead: what is the percentile of age 45? Since 45 falls right
in the middle of 35-55, I’m sure you’ve already figured out the answer. But let’s reason carefully in a
general way that will always work.
First, since 45 falls in the class 35-55, we know the percentile of 45 will be somewhere between 44 and
92, which are the cumulative frequencies at the beginning and end of the class. We simply ask: what
fraction of the way between 35 and 55 have we gone? The answer is
45 − 35
.
55 − 35
Therefore we should take this fraction of the relative frequency available in this class, which is f = 48:

45 − 35
48
55 − 35
Finally, we need to add this percentage to what we already had at the beginning of this class, 44:

45 − 35
p = 44 + 48 = 68.
55 − 35
Thus, 68% of the women in the study had age less than or equal to 45 years.
We can summarize what we did in a general formula:

x p − L0
p = F0 + f
L1 − L0
L0 = bottom border of xp ’s class
L1 = top border of xp ’s class
F0 = cumulative frequency (in %) up to (but not including) xp ’s class
f = relative frequency (in %) of xp ’s class

Standard Score

Another measure of relative position builds on our natural tendency to compare an observation to the
average. The standard score is a number that tells us two things:
• The sign (positive or negative) of the number tells us which side of the average the observation
falls on (if the number is 0, the observation is precisely average).

2
• The magnitude of the number tells us how far from the average the observation is found, where
the distance is measured in standard deviations.
Given an observation x from a dataset with mean x and standard deviation s, the standard score,
also known as a z-score and denoted by z, is
x−x
z= .
s
For example, suppose the average on your friend’s exam was x = 68 and the standard deviation of the
exam scores was s = 6. Then the standard score of 77 is
77 − 68
z= = 1.5
6
Since 1.5 is positive, this z-score tells us that 77 was above the average. The magnitude of this number
tells us that 77 was 1.5 standard deviations away from the mean.
Note that we can also use this formula in the other direction. For example, if we didn’t know your
friend’s raw score on the exam, but knew the standard score as well as the mean and standard
deviation, we would solve the above formula for x

x = x + zs

and use it to find the raw score

x = 68 + (1.5)(6) = 77.

Let’s go back to your friend. Suppose the 77 was a grade on a math exam, and the same friend also
got a 70 on an econ exam. In which course did your friend do better? As it stands, the answer is not
clear, because we cannot determine how 70 was relative to the rest of the scores. But suppose we also
know that the average on the econ exam was 64 and the standard deviation 3. Then the standard
score is
70 − 64
= 2.0
3
This means that your friend scored two standard deviations above the mean in econ, so relatively
speaking your friend did better in econ than he did in math, even though the raw scores suggest a
different story. This is something else standard scores let us do: they let us compare values from
different distributions by using a common standard.
Given an arbitrary dataset consisting of the n observations

x1 , x2 , . . . , xn

suppose we standardize each one of them and obtain the n z-scores

z1 , z2 , . . . , zn .

Then the following will always hold:

• The mean of the z-scores will be 0.
• The standard deviation of the z-scores will be 1.

3
Figure 1: Percentiles and z-scores for the normal distribution

In Figure 1 we see a comparison of percentiles and z-scores for a continuous distribution which is
ubiquitous in statistics and which we now introduce.

The Normal Distribution

We have seen that when the values of a continuous variable are grouped into increasingly finer classes,
the shape of the corresponding histogram often begins to look like the familiar bell curve. For example,
the distribution of grades on an exam might look like this:

Figure 2: Histogram approximated by a bell curve

This makes the bell curve an extremely useful tool in statistics. The distribution that generates this
curve precisely is called the normal distribution. It is completely determined by two numbers,

4
called the parameters of the distribution:
• The mean of the normal distribution, denoted by µ, determines where the center of the bell is
located.
• The standard deviation of the distribution, denoted by σ, determines how wide the bell is.
Putting everything together, we would write

X ∼ N (µ, σ 2 )

and say that the variable X is normally distributed with mean µ and standard deviation σ. (Note that
by convention, when using the above notation we specify the variance σ 2 rather than the standard
deviation σ.) An example of such a distribution is shown in Figure 3. Note that the normal distribution
is always symmetric about the mean, which is a property we shall exploit extensively.

Figure 3: The Normal Distribution

In Figure 4 we compare three different normal distributions. The blue and the red distributions have
the same mean, but different standard deviations. Since the blue curve has a wider spread, it must
have a larger standard deviation. On the other hand, the green and the red curves have the same
standard deviation but different means. Since the red curve is located to the right of the green curve,
it must have a larger mean.
In Figure 5 we can see that, regardless of what the values of µ and σ are, certain percentages of
the data will always lie within certain distances of the mean, when these distances are measured in
standard deviations.
One particular normal distribution has a special name: the standard normal distribution has a
mean of 0 and a standard deviation of 1. The standard normal distribution is denoted by Z. Thus

Z ∼ N (0, 1)

5
Figure 4: Comparing normal distributions

Figure 5: From standard deviations to relative frequencies

We will soon see that we can answer any question about the standard normal distribution. Thus, it is
useful to translate questions about other normal distributions to questions about the standard normal

6
distribution. This procedure is called standardization and is very simple (we demonstrate it below):

X −µ
if X ∼ N (µ, σ 2 ), then Z = ∼ N (0, 1)
σ
(This formula should remind you of the one we saw earlier today for computing a z-score.) In words: to
standardize a value from a general normal distribution, subtract the mean and divide by the standard
deviation.

Probability and the Normal Distribution

Suppose we are told that a certain variable has the standard normal distribution, Z. If we look
at a random value out of this distribution, we could ask a question like: what is the probability that
this value is less than or equal to 0.35? A concise notation for this question is

P (Z ≤ 0.35) =?

Since this is a question about the standard normal distribution, we can answer it using the standard
normal table, the top portion of which is shown in Figure 6 (the complete table is posted on Moodle
under “Resources”).

Figure 6: The standard normal table

The values in the left column and top row together specify z-values: the left column specifies the units
digit and the first digit after the decimal point, and the top row specifies the second digit after the
decimal point. The values inside the table specify probabilities. For example, to answer our question,
find 0.3 in the left column, and 0.05 in the top row, which together make up 0.35. Now find the value
where the row of 0.3 and the column of 0.05 intersect: 0.6368. As the partly shaded graph above the

7
table suggests, this value represents the area under the curve to the left of 0.35. Since areas correspond
to probabilities (which are simply relative frequencies expressed as proportions), this is exactly what
we needed:
P (Z ≤ 0.35) = 0.6368.

We can answer other questions, with the key idea always being the equivalence of area under the curve
and probability. Any region under the curve will have area between 0 and 1, and every corresponding
probability will likewise be between 0 and 1.
For example, what is P (Z > 0.35)? We can’t use the table since it only has probabilities of the form
P (Z ≤ z), where z is some non-negative number. However, we recognize the complementary event:
Z ≤ 0.35. These are complementary since any Z will necessarily satisfy exactly one of the following
inequalities:
Z ≤ 0.35 or Z > 0.35
This means that the probabilities of these events must sum to 1:
P (Z ≤ 0.35) + P (Z > 0.35) = 1
Therefore
P (Z > 0.35) = 1 − P (Z ≤ 0.35)
Remember that we already know P (Z ≤ 0.35) = 0.6368. Thus,
complem.
P (Z > 0.35) = 1 − P (Z ≤ 0.35) = 1 − 0.6368 = 0.3632.
How about P (Z < −0.35)? The problem now is that the table only has positive z-values. However, we
remember that every normal distribution is symmetric about the mean. In particular, the standard
normal is symmetric about 0: see Figure 7, which depicts the general situation.

Figure 7: Exploiting symmetry

Thus, the area under the curve to the left of -0.35 must be equal to the area under the curve to the
right of 0.35, so if we knew P (Z > 0.35) we’d be done. Using the last problem we solved,
symmetry complem.
P (Z < −0.35) = P (Z > 0.35) = 1 − P (Z ≤ 0.35) = 1 − 0.6368 = 0.3632.

8
Another kind of question might be: what is the probability that Z is between 0.35 and 0.55 (inclusive)?
We need to find P (0.35 ≤ Z ≤ 0.55). Again, we can’t use the table just yet, but Figure 8, which
depicts the general situation, should help.

Figure 8: The area under the curve between two points

We realize that we can find the area under the curve between 0.35 and 0.55 by subtracting the area
to the left of 0.35 from the area to the left of 0.55:

P (0.35 ≤ Z ≤ 0.55) = P (Z ≤ 0.55) − P (Z < 0.35)

From the table, we see that P (Z ≤ 0.55) = 0.7088. We already know that P (Z ≤ 0.35) = 0.6368
and it turns out that (for any continuous distribution) excluding the possibility of equality does not
change the probability. Therefore

P (0.35 ≤ Z ≤ 0.55) = P (Z ≤ 0.55) − P (Z < 0.35) = 0.7088 − 0.6368 = 0.072.

Let us now consider a normal distribution which is not standard. Suppose the scores on a certain
exam are normally distributed with mean 80 and standard deviation 10. Thus, if we let X denote the
scores on the exam, we are given that X ∼ N (80, 102 ). What is the probability that a random student
scored below 84? In this case, before we can use the table, we need to standardize the variable:

X −µ 84 − 80
P (X ≤ 84) = P ≤ = P (Z ≤ 0.40) = 0.6554
σ 10
where the last value was read directly from the table. Continuing with the same example, one might
ask: what was the 70th percentile on the exam? Well, what is the 70% of the standard normal? This
time we look inside the table for the smallest value greater than or equal to 0.70, which is 0.7019, and
then follow its row and column to the value 0.53. This is the z-value such that roughly 0.7 of the area
under the curve lies to its left. Since area represents probability, this must be the 70th percentile for
Z. Now we solve the standardization equation for X in terms of Z:

X = µ + Zσ

Then the 70th percentile of X is

80 + (0.53)(10) = 85.3
What about the 30th percentile? Since 30+70=100, we can use complementary events and symmetry
to conclude that the 30th percentile must be

80 + (−0.53)(10) = 74.7.

NutritionalInformation Aroma - Espress.bar
No ratings yet
NutritionalInformation Aroma - Espress.bar
16 pages
Exile in Dapitan
No ratings yet
Exile in Dapitan
42 pages
Normal Distribution
No ratings yet
Normal Distribution
63 pages
Measures of Position
No ratings yet
Measures of Position
44 pages
Basic Maths23su
No ratings yet
Basic Maths23su
42 pages
4 Normal Distribution
No ratings yet
4 Normal Distribution
40 pages
Normal Distribution Handouts
No ratings yet
Normal Distribution Handouts
34 pages
3.3s
No ratings yet
3.3s
7 pages
Math As A Tool
No ratings yet
Math As A Tool
31 pages
2.1 Describing Location in A Distribution: HW: P. 105 (1, 5, 9-15 ODD, 19-23 ODD, 31, 33-38)
No ratings yet
2.1 Describing Location in A Distribution: HW: P. 105 (1, 5, 9-15 ODD, 19-23 ODD, 31, 33-38)
24 pages
Measures of Relative Position
No ratings yet
Measures of Relative Position
28 pages
Additional-Notes STATS
No ratings yet
Additional-Notes STATS
8 pages
Normal Curve Powerpoint
No ratings yet
Normal Curve Powerpoint
18 pages
History Reporting
No ratings yet
History Reporting
61 pages
An Introduction to Psychological Statistics-98-107
No ratings yet
An Introduction to Psychological Statistics-98-107
10 pages
Normal Curve_area and Application _with Solutions
No ratings yet
Normal Curve_area and Application _with Solutions
28 pages
Module 2
No ratings yet
Module 2
13 pages
Chapter 5 English For Statistics
No ratings yet
Chapter 5 English For Statistics
16 pages
Lect 7
No ratings yet
Lect 7
16 pages
Statistics Lab 10-4
No ratings yet
Statistics Lab 10-4
11 pages
Chapter 2 Slides
No ratings yet
Chapter 2 Slides
19 pages
Normal Distribution report
No ratings yet
Normal Distribution report
5 pages
Lesson 5- Normal Distribution
No ratings yet
Lesson 5- Normal Distribution
54 pages
Math 102 Midterms Reviewer (With Mock Tests)
No ratings yet
Math 102 Midterms Reviewer (With Mock Tests)
3 pages
Week 6 Lec and Act
No ratings yet
Week 6 Lec and Act
8 pages
Normal Distribution
No ratings yet
Normal Distribution
8 pages
Normal Distribution
No ratings yet
Normal Distribution
101 pages
Lesson 4 Data Description Measures of Position-1
No ratings yet
Lesson 4 Data Description Measures of Position-1
14 pages
Week11-Normal Distribution (Ekstra-Week)
No ratings yet
Week11-Normal Distribution (Ekstra-Week)
59 pages
Lesson 4: Statistics/Data Management Unit 1 - Measures of Central Tendency
No ratings yet
Lesson 4: Statistics/Data Management Unit 1 - Measures of Central Tendency
26 pages
Math Midterms
No ratings yet
Math Midterms
6 pages
Statistics Assignment
No ratings yet
Statistics Assignment
4 pages
(Wk-5) Normal Curve and Its Application
No ratings yet
(Wk-5) Normal Curve and Its Application
38 pages
Assignment 5 - STAT
No ratings yet
Assignment 5 - STAT
8 pages
Statistics in Education: Distribution
No ratings yet
Statistics in Education: Distribution
79 pages
Topic 2.-Topic 2 - Mathematics As A Tool Part 1-01
No ratings yet
Topic 2.-Topic 2 - Mathematics As A Tool Part 1-01
63 pages
Lesson 7:: Normal Distribution in Statistics
No ratings yet
Lesson 7:: Normal Distribution in Statistics
5 pages
Eco 2
No ratings yet
Eco 2
31 pages
Chapter 2 Modelling Distribution of Data
No ratings yet
Chapter 2 Modelling Distribution of Data
58 pages
TEST311-MIDTERM
No ratings yet
TEST311-MIDTERM
10 pages
3 STATISTICAL DISTRIBUTION FUNCTIONS
No ratings yet
3 STATISTICAL DISTRIBUTION FUNCTIONS
4 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
10 pages
Normal Distribution
No ratings yet
Normal Distribution
9 pages
Measures of Position
No ratings yet
Measures of Position
24 pages
Statical Distriution function
No ratings yet
Statical Distriution function
8 pages
Module 3
No ratings yet
Module 3
54 pages
Statistics and Probability Module 1.2- Normal Distributions
No ratings yet
Statistics and Probability Module 1.2- Normal Distributions
22 pages
Normal Probability Curve: Dr. K Uldeep Kaur
No ratings yet
Normal Probability Curve: Dr. K Uldeep Kaur
37 pages
psych-110-chap-3-4-notes
No ratings yet
psych-110-chap-3-4-notes
8 pages
Chapter 2 Normal Distribution
No ratings yet
Chapter 2 Normal Distribution
31 pages
Skewness and Relative Position
No ratings yet
Skewness and Relative Position
8 pages
Mmw Reviewer
No ratings yet
Mmw Reviewer
9 pages
Chapter-2
No ratings yet
Chapter-2
69 pages
Meth
No ratings yet
Meth
6 pages
GE MODMAT Unit 4 Statistics 1
No ratings yet
GE MODMAT Unit 4 Statistics 1
14 pages
Statistics and Probability: Normal Distribution
No ratings yet
Statistics and Probability: Normal Distribution
40 pages
Lesson 7.1 Introduction To The Normal Distribution
No ratings yet
Lesson 7.1 Introduction To The Normal Distribution
9 pages
Ch-6 Normal Distribution Lecture Notes
No ratings yet
Ch-6 Normal Distribution Lecture Notes
6 pages
CHAPTER - 2 Normal Distribution
No ratings yet
CHAPTER - 2 Normal Distribution
22 pages
Group 7
No ratings yet
Group 7
27 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
6 pages
SAT Math: Master the Skills in 40 Pages
From Everand
SAT Math: Master the Skills in 40 Pages
Jennifer L Johnson
No ratings yet
System User Guide: Twido and Altivar Magelis & OTB FTB
100% (1)
System User Guide: Twido and Altivar Magelis & OTB FTB
115 pages
Relations: Arvind Kalia Sir
No ratings yet
Relations: Arvind Kalia Sir
40 pages
Modal & Modal-Like (Verbos Modales)
No ratings yet
Modal & Modal-Like (Verbos Modales)
2 pages
Measurement Uncertainty DMM Cal Presentation - Slides PDF
No ratings yet
Measurement Uncertainty DMM Cal Presentation - Slides PDF
76 pages
MGT2320_Assignment 2_ WRubric_W19 (5)
No ratings yet
MGT2320_Assignment 2_ WRubric_W19 (5)
4 pages
Hauwam Muhammed_Updated CV
No ratings yet
Hauwam Muhammed_Updated CV
4 pages
Samsung RV415 - SCALA2-AMD BA41-01532A BA41-01533A BA41-01534A REV 1.0 PDF
No ratings yet
Samsung RV415 - SCALA2-AMD BA41-01532A BA41-01533A BA41-01534A REV 1.0 PDF
54 pages
Chapter 8 - Trends (Part 2)
No ratings yet
Chapter 8 - Trends (Part 2)
21 pages
Recombinant DNA Technology - Tools, Process, and Applications
No ratings yet
Recombinant DNA Technology - Tools, Process, and Applications
2 pages
Understanding The Schemes of The Devil For The World and Humans
100% (1)
Understanding The Schemes of The Devil For The World and Humans
19 pages
Journal of Loss Prevention in The Process Industries
No ratings yet
Journal of Loss Prevention in The Process Industries
5 pages
Applying Virtue Ethics To Business: The Agent-Based Approach
No ratings yet
Applying Virtue Ethics To Business: The Agent-Based Approach
14 pages
Contoh Essay Biologi Molekuler
No ratings yet
Contoh Essay Biologi Molekuler
4 pages
Risk Analysis (Divya Jadi Booti)
No ratings yet
Risk Analysis (Divya Jadi Booti)
48 pages
Family Office Elite Summer 16 PDF
No ratings yet
Family Office Elite Summer 16 PDF
132 pages
(Ebook) Medical Image Registration by Joseph V. Hajnal, Derek L.G. Hill ISBN 9780849300646, 0849300649 all chapter instant download
100% (1)
(Ebook) Medical Image Registration by Joseph V. Hajnal, Derek L.G. Hill ISBN 9780849300646, 0849300649 all chapter instant download
77 pages
ILLUSION
No ratings yet
ILLUSION
21 pages
Functions Review Packet
No ratings yet
Functions Review Packet
7 pages
Letter From My Attorney H. Louis Sirkin To Robert Kraft (My Father's Press Agent) Re: False & Defamatory Letter Being Circulated, August 18, 2006
No ratings yet
Letter From My Attorney H. Louis Sirkin To Robert Kraft (My Father's Press Agent) Re: False & Defamatory Letter Being Circulated, August 18, 2006
3 pages
Samsung: Communications and Device Solutions. Samsung Is The World's Largest Mobile Phone and
No ratings yet
Samsung: Communications and Device Solutions. Samsung Is The World's Largest Mobile Phone and
5 pages
IEEE 802.11ad Introduction and Performance Evaluation
No ratings yet
IEEE 802.11ad Introduction and Performance Evaluation
5 pages
Entrepreneurship (Business Implementation - Expectations) - Week 2
No ratings yet
Entrepreneurship (Business Implementation - Expectations) - Week 2
48 pages
B To B Personal Selling
No ratings yet
B To B Personal Selling
61 pages
Syllabus Ebd 2033-Industrial Organization
No ratings yet
Syllabus Ebd 2033-Industrial Organization
4 pages
Short Questions On JOURNEY TO THE END OF THE EARTH
No ratings yet
Short Questions On JOURNEY TO THE END OF THE EARTH
1 page
Final Scoping Report Lonrho
No ratings yet
Final Scoping Report Lonrho
92 pages
Cry of Balintawak or Pugad Lawin
No ratings yet
Cry of Balintawak or Pugad Lawin
3 pages
(2017) 최신도입 네트워크rtk 활성화 방안연구
No ratings yet
(2017) 최신도입 네트워크rtk 활성화 방안연구
198 pages

Lecture 4 - Statistics and Data Analysis I 2

Uploaded by

Lecture 4 - Statistics and Data Analysis I 2

Uploaded by

Statistics and Data Analysis I – IDC – 2017

Lecture 4 – Relative Position and the Normal Distribution

L0 = bottom border of xp ’s class

and use it to find the raw score

suppose we standardize each one of them and obtain the n z-scores

Then the following will always hold:

The Normal Distribution

Figure 2: Histogram approximated by a bell curve

Figure 3: The Normal Distribution

Figure 5: From standard deviations to relative frequencies

Probability and the Normal Distribution

Figure 6: The standard normal table

Figure 7: Exploiting symmetry

Figure 8: The area under the curve between two points

P (0.35 ≤ Z ≤ 0.55) = P (Z ≤ 0.55) − P (Z < 0.35)

P (0.35 ≤ Z ≤ 0.55) = P (Z ≤ 0.55) − P (Z < 0.35) = 0.7088 − 0.6368 = 0.072.

Then the 70th percentile of X is

You might also like